Healthcare Analytics in the Cloud

Below is the continuation of the transcript of a Webinar hosted by InetSoft on the topic of Machine Learning Big Data Analytics in Healthcare. The presenter is Abhishek Gupta, Product Manager at InetSoft, and the guest is Jim Reynolds, CTO at Health Analytica.

Abhishek: Now Jim, there are also sensitive data and privacy issues that impact healthcare analytics in the cloud. There are regulations and potential audits involved. How do you manage to protect the data even as you have to go through a lot of these cleansing and joining steps across different formats types and even sources of data?

Jim: So there's actually lots of encryption involved at various places and along the way in a pipeline, and so we do keep the data in our archives in an encrypted fashion. When we move data along from one part of the pipeline to another we keep control of the environment by having really good controls on each of the stages.

This is where Vertica actually helps us out quite a bit because we have the ability to nicely go in and assign roles to go there and put in some protections, and that was one of the things that we were looking for in a data store which is to have some ability to have controls over the data.

#1 Ranking: Read how InetSoft was rated #1 for user adoption in G2's user survey-based index Read More

Healthcare Data Pipeline

A healthcare data pipeline serves as the backbone of modern healthcare analytics, enabling the seamless flow of data from various sources to its ultimate destination, where it can be leveraged for critical insights and decision-making. At its core, a healthcare data pipeline consists of a series of interconnected stages or processes designed to collect, ingest, transform, store, and analyze healthcare-related data. The pipeline typically begins with data acquisition, where information is gathered from disparate sources such as electronic health records (EHRs), medical devices, wearables, and patient portals. This initial stage requires robust mechanisms for data extraction and ingestion to ensure that data is efficiently collected and integrated into the pipeline.

Once the data is collected, the next stage in the pipeline involves data preprocessing and transformation. This step is essential for cleaning, standardizing, and harmonizing the data to ensure consistency and accuracy. Given the heterogeneous nature of healthcare data, which may vary in format, structure, and quality, preprocessing tasks may include data cleansing, normalization, de-identification, and enrichment. Moreover, advanced techniques such as natural language processing (NLP) and machine learning algorithms may be employed to extract valuable insights from unstructured data sources such as clinical notes and medical imaging reports. By standardizing and transforming the data into a unified format, healthcare organizations can facilitate seamless integration and analysis across disparate datasets.

Finally, the processed data is stored in a secure and scalable repository for further analysis and utilization. This may involve deploying a data warehouse, data lake, or other storage solutions capable of handling large volumes of structured and unstructured data. In addition to storage, the data pipeline may incorporate data governance and security measures to ensure compliance with regulatory requirements such as HIPAA (Health Insurance Portability and Accountability Act) and GDPR (General Data Protection Regulation). Furthermore, the stored data can be leveraged for various analytics use cases, including clinical decision support, population health management, predictive modeling, and research. By establishing a robust healthcare data pipeline, organizations can unlock the full potential of their data assets to drive improvements in patient care, operational efficiency, and healthcare outcomes.

So as it moves along the pipeline we keep these controls in place, and as things move along everywhere where we have an attack surface. We have to keep the data either protected by network access or by encryption, and the infrastructure that we build has to deal with that.

Abhishek: Okay you mentioned Hewlett-Packard Enterprise Vertica as part of your overall solution. Tell us a little bit about what you're using Vertica for specifically and maybe perhaps tell us the journey that led you to Vertica.

Jim: Sure, so Vertica is a really very fast and very easy to manage and very cost effective column store for our problem, and the reason that that's an important idea is that traditional relational databases work really well when you're dealing with things in a worldwide fashion, and they're really good for online transaction processing, like updating your bank account as a classic example, where you want all of that data to be really transactionally secure and consistent.

But when you're dealing with large scale analytics you really need two things out of the environment: the ability to move analytics to the data in a cost effective fashion, and you need to be able to do data scans really fast. So that was what the big breakthrough was with column stores, and Vertica has been for us a very solid platform on which we can build highly embeddable analytics, and it allows us to perform lots of complicated analytics both in the database as well as satisfying a lot of the interactive analytics use cases that we have for our customers.

view demo icon
View a 2-minute demonstration of InetSoft's easy, agile, and robust BI software.

Abhishek: So just so I understand where Vertica fits in on your machine learning analytics solution, where does Vertica fit? Are you using it for both the data acquisition and management and then also for an analytics or one of the other? How do they relate your solution and platform and the Vertica technology?

Jim: Right, so in our data analytics pipeline we divide up into three large segments. One is our data curation and staging, and that's where we perform the activities of publishing, and we use Vertica both for the staging and the publishing.

Then we have a next stage of our pipeline which is our large scale compute, and in that stage we also use Vertica because we do a lot of metric scans, metric calculations, and that requires a lot of column scans, and so for our compute environment we use Vertica for that.

But it also does really well for interactive data filtering and interactive metric calculations, and so for our user presentation layer we also use Vertica for that, and it serves all of those parts of our pipeline quite well.

Previous: Faster Better Cheaper Analytics and Decision-Making