An Interview with Somalee Datta, Stanford Medicine

Somalee Datta is a computational physicist by training and a biotechnologist by profession. She believes that with the explosion of data in healthcare and with new methods to analyze such large amounts of data, we will see massive changes in how human diseases are addressed. She wants to be part of this revolution. Read her full bio.

Interview with Somalee Datta, Stanford Medicine

Q: Your current focus is on creating a large-scale biomedical data analytics platform, a “data lake” that aggregates and provides access to various types of data, including EMR, omics, clinical data, ECG, imaging, population health, and wearable devices data and is linked for analytical access in support of biomedical research and translation. Can you provide some details on the strategic approach and the processes you are putting in place to build this data lake?

A: In the context of Big Data, you have heard the paradigm “bring compute to data”. Our approach at Stanford Medicine is to “bring scientists to data”. The platform is designed ground up to reduce time to science. The data lake receives data from the Hospitals, it is necessary but not sufficient. Our researchers need high quality data quickly, and transparently. They need a flexible research computing environment that meets their data science needs. We need to support state-of-the art privacy and security measures to protect our patient data including data de-identification and HIPAA compliant data centers. Researchers also need training and access to expertise and collaboration opportunities. In essence, we are working to bring a system engineering approach to provide an ecosystem of solutions and services such that the research community can thrive.

Q: Why a “data lake” and not a “data warehouse”?

A: We start with a data lake where the raw data pours in from the Hospital. On top of this data lake, we have a number of traditional and modern data warehouses and applications that meet different downstream needs.

Q: Who/what are the intended primary users/applications of this data analytics platform?

A: The platform is for our research community, it is a multi-disciplinary one. It is not unusual to find research teams that are composed of physicians, basic science, health science and computer science researchers. These teams are supported by service providers who bring expertise in data, research computing, statistics, analytics, and software application development.

Q: What is your approach of getting all stakeholders engaged to help create this platform?

A: Stanford Medicine community is designed to foster a high degree of collaboration across its leaders. For example, we have a governance committee, composed of leadership from the school and the two hospitals, that oversees our various initiatives. We partner with other subject matter experts at Stanford for complimentary services e.g. Honest Broker or Research Computing. We co-create with our research community. We strive to keep ourselves abreast of research initiatives. In many cases, we are part of their research group and therefore, eat our own dogfood. Permit me to use a cliche “we try to be the best at getting better”.

Q: What are some of the major challenges associated with bringing the various data into one platform and make it accessible via analytics tools? How will you overcome these challenges?

A: Perhaps the most significant challenge is using data from clinical grade device that is designed to meet operational needs at a medical facility, for research. The devices have (research) unfriendly file formats, and the operating systems and data access protocols are idiosyncratic. We are placing significant IT effort to mobilize data and are deploying complex software stack to convert data to open standards. For newer applications, research requirements are becoming an integral part of the Hospital’s data architecture.

Q: Is all data HIPAA regulated and how do you control/address access to individual patient data?

A: The data lake, with its raw data, is HIPAA regulated. A subset of our research environments are also HIPAA regulated. Protected Health Information (PHI) lives in the HIPAA regulated environments. Many of our researchers work with de-identified data. Such data live in environments that meet NIH’s dbGaP security guidelines.

Q: The integration of clinical genomics data with other clinical and/or patient related data has been notoriously slow over the past years, though we detect a positive trend towards a change. By when can we expect full integration and adoption of genomics data and applications in the clinic?

A: Stanford offers a clinical genomics program. Centers of Excellence like Stanford need to go above and beyond offering a clinical test. They need to be able to improve patient’s health and their experience. The challenges are clearly numerous and include evolving standards of data interpretation, ethics, cost management and recovery, and demonstrable impact to human health. Evidence is growing and trends are favorable. But what do we ultimately need to do make genomics a ubiquitous offering around the clinics in the world? I think that we need to make the cost/benefit favorable given the variable socioeconomic conditions. This will continue to be a challenge for at least another decade.

Q: How can we overcome the challenges of a non-harmonized terminology of medical/phenotype data?

A: With popularization of common data models like OHDSI OMOP, PCORI, PEDSNet, Sentinel etc., we are starting to tackle harmonization. These common data models increase insight sharing without data sharing, and reduce time to science with code sharing. Stanford supports OMOP. We are excited about the focus on data QA and analytical packages. There is a robust community of end users, developers and thought leaders who are actively engaged in discussion forums, trainings and workshops.

Q: AI, machine learning, and sophisticated data mining are promising technologies to harness the power of biomedical data to improve patient care decisions and or even prevent disease. What is it that we can realistically expect from these technologies to improve healthcare and in what timeframe?

A: I am hopeful that earliest impact will come as a physician or nurse assistant. AI approaches can augment a physician’s analysis or nurse workflow. There are some examples of fully automated FDA approved AI, but generalizing these to other domains may not be trivial. An airplane cockpit that is increasingly more advanced is a good analogy, we want our pilots assisted, not replaced yet.

Q: What is still required to take advantage of the full potential of these technologies?

A: As a community, we are in the early stages of AI driven research. Much of our data is not mobilized for research yet. Efforts like NIST genome-in-a-bottle that are truly designed to democratize data analytics are few and far between. Collaborative science continues to be challenging due to privacy and ethical issues. Even though computing has become ubiquitous, data security continues to be hard. Where AI technologies can help save lives are in regions of the world where physicians are few and far between, but deploying technologies in such parts pose their own unique challenges.

Q: What are you expecting PMWC attendees will walk away with from the PWMC 2020 Silicon Valley conference? What is the call to action that we as a community should focus on?

A: How do we mobilize data to innovate faster while preserving patient privacy? AI brings a new paradigm where we can share learning models without sharing data. Where is the healthcare AI Commons?

Q: Is there anything else you would like to share with the PMWC audience?

A: I would like to take the opportunity to start with a historical tidbit. Stanford served as the hotbed of AI in Medicine (AIM) between 1973-1992 with the SUMEX-AIM project. SUMEX-AIM stands for Stanford University Medical EXperimental computer for Artificial Intelligence in Medicine, a national computer resource funded by NIH, the only non-DoD node on the ARPANET network at the time. One of the artifacts of the project is the first workshop on AIM held at Rutgers in 1975 that discusses the the tenet “First, do no harm” in the context of AI. Perhaps now, armed with more data, we can discuss this tenet more fully.