Join me in a walkthrough of my positive journey in collaborative learning, by joining Omdena’s real-world challenges. How I went through a complete life cycle of machine learning projects in an impact cause like analyzing and detecting anomalies in cancer clinical data. And how we deployed a REST API to detect clinical data anomalies for FlaskData in this drug safety challenge.
Author: V Mukundan
It took a year into the COVID-19 pandemic for an mRNA-based vaccine to be developed and emphasized the effect of clinical trials of drugs. This has been a powerful development in quelling the gruesome effect of the virus with millions of deaths, positive cases, and decreased herd immunity. The most important game-changer between the part of the world recovering partially & others still reeling with the pandemic has been the accessibility to vaccines.
A clinical trial procedure
A clinical trial undergoes a lengthy procedure of tests, research, and analysis before it becomes available for use by the public.
Phase 1 consists of testing the drug on healthy individuals checking the safety, how the body metabolizes the drug, dosage, and side effects.
Phase 2 requires testing on larger groups of affected individuals probing the drug efficacy, feasible protocols & tolerability.
Phase 3 involves testing on wider demographics examining the long-term effectiveness, assessing safe for public usage, adverse events, placebo & standard comparators in randomized control group trials.
Finally, it is sent to medical regulatory bodies for approval and further testing to be rendered safe for use.
A project to increase drug safety using data science
The clinical data being generated by multiple sources are exploding in terms of variety, volume & velocity, unsynchronized from multiple sources, and unstructured in time, place, or doctor visits especially with the growing number of apps for healthcare and wearable devices. The project is indispensable in providing a much-needed independent validation of medical decisions that affect everyone, especially vaccines for a pandemic such as covid-19. While regulatory bodies and tech companies have enormous resources to sift through exploding health-related data but modus operandi on data privacy-related laws, data accessibility to the public and data-driven decisions affecting the entire world’s population are still in their nascent stages. Omdena and FlaskData came together to collaborate on increasing drug safety using data science. This project was initiated by Flaskdata, an Israeli data automation platform that provides robust digital data management and automated monitoring tools enabling tight adherence to clinical protocols for biotech, biopharma, and connected medical devices.
Omdena is a game-changer
From a couple of connections on my LinkedIn network, I came to know about Omdena’s challenges. I was taken by this sentence on Omdena page – ‘The world´s only place for truly collaborative AI projects to apply your skills on real-world data with change-makers from around the world.‘ My initial enthusiasm led me to read up a bunch of interesting blogs on Omdena & Medium about their projects involving data science for good
Omdena is a game-changer in using AI to create social impact and makes a difference in tackling/gaining insights into societal and infrastructural challenges. It was exciting to see how people come together from across the world to work on a challenge and realize an impactful solution.
This prompted me to apply for the drug safety challenge and was excited to be selected to work on this project. Being a novice in Data Science, it was also inspiring to read about paths and journeys taken by members of the Omdena community from a beginner to being an expert and landing jobs in many different companies.
Detecting anomalies in clinical data
The actual challenge involved detecting anomalies in clinical data for early-stage drug & device projects and building an automated API that works well on high dimensional data being model-free and scales well. The API is the minimum viable product (MVP) to be delivered after two sprints & a timeline of 8 weeks for the entire project. The data had fundamental challenges by way of lack of data before clinical trials, smaller training datasets for certain ailments, high dimensional datasets, bias, noise, lack of reliability, uniqueness within datasets apart from the usual challenges such as feature engineering based on domain expertise, missing data, time stamps, time series drifts & spikes, etc. The datasets explored included breast cancer datasets, migraine paint report datasets, and adverse event/placebo datasets in pancreatic cancer.
The beginning (Week 1 & 2) of this project was a bit overwhelming. I have never worked with so many experts from different backgrounds and skills before. The first couple of weeks were spent reading to get an understanding of the data, technical vocabulary apart from the garbled column names & the data instances.
Even for people with ‘learn by doing‘ beliefs, it was hard to get started with just an initial look/plot of the data.
At this crucial juncture, the sense of community via the Slack channel became clear. As per Omdena’s model of collaborative learning, no one is ever assigned a task and the project progresses in a self-driven environment with taking initiatives.
Collaboration and teamwork
Slowly in week -3, different tasks were delineated, and various experienced members volunteered to be task leaders (TLs). Figure 1 shows the plan that materialized almost midway into the project. Figure 1(A) shows the workflow for the initial labeling, ETL, and modeling while figure 1(B) shows the design of the API to be constructed at the end of this project.
A great learning process came from codes shared, members trying out different plots/labeling with different data sets, documentation, testing different models, weekly meetings with subject matter experts from Flaskdata, and members sharing a wide variety of reading resources. As a first-timer on Omdena, I joined different tasks & decided to follow the communication between the Task Leads & other members on the Slack channel and was hoping to maximize my learning. Though it was getting difficult to catch up with everything, one of the team members advised me to focus on one area rather than all of it. It was also intriguing to see that task managers did not have all the answers and different members worked on iterative progress towards the final MVP as a team.
Another most important aspect of the project is the weekly status meeting where breakthroughs related to different tasks are presented, ideas that did not work, and the road ahead. Asking for help was never frowned upon and it helped build rapport between members. I was surprised to see posts asking for help and other members sharing ideas & resources for the benefit of all.
How real-time experience is built
In the beginning, the dataset looked incorrigible to an untrained eye especially for people with no prior knowledge of clinical trial data and the data types not conforming to the usual python data types. This truly made the project a relevant real-time experience. The raw data had columns of timestamps, the device from which the data originated, heart rate, glucose levels, etc. The ETL process for this project was different apart from the usual steps of correcting the data types, encoding the categorical variables, removal of columns with 90% missing data, removing highly correlated columns, imputing data, compute interquartile range on numerical data, etc. Iterative imputers with many regressors such as AdaBoost, Stacking, KNN, decisions trees, Bayesian Ridge, and simple imputers were tested on breast cancer data.
Clinical data and best models
This concluded that the decision tree regressor was the best. The data were in JSON format along with data dictionaries. The anomalies to be classified were separated into a single variable (L1) and multivariate (L2+) anomaly. Unsupervised methods such as DBSCAN (density-based spatial clustering of application with noise, gaussian mixture methods, and decision tree-based isolation forest were explored. Semi-supervised methods with label propagation and KNN were also probed. Interesting additional insight about racial & ethnic bias was also explored.
DBSCAN is advantageous in separating high and low-density clusters but the epsilon parameter specifying the radius of the neighborhood was not easy for some datasets and the anomaly scores were available. Gaussian mixture models are preferred as their thresholds can be dynamically set but the number of clusters must be decided beforehand, and this is not preferable as the nature of the critical event is not completely defined. In case of a critical/adverse event or an anomaly, the number, nature of change in different variables, and patterns in data are not fully known, and detecting them is difficult. Isolation forests use recursive partitioning that can be represented by tree structures with built-in anomaly scores. The main disadvantage of isolation forest is the way the branching of trees introduces an artifact of rectangular regions (a bias) either in the horizontal or the vertical region after mapping the anomaly scores for all instances in the data[https://en.wikipedia.org/wiki/Isolation_forest|https://ieeexplore.ieee.org/document/8888179]. This is solved by selecting a branch cut with a random slope in an extended isolation forest algorithm. The performance & precision scores turned out to be high and worked well with categorical data and unbalanced data. The final model was chosen after several iterations and several meetings with the staff & subject matter experts of both FlaskData and Omdena team
Deployment and REST API
As a test for minimum functionality and as a first step towards MVP delivery, two sprints spanning over seven consecutive days were implemented. This involved a customer-facing API executing service with data payload and returning a response data with an ‘isAnomaly’ tag to the user in case of an adverse event. Moving to the final MVP required deploying a customer-centric solution in the FlaskData cloud production environment. The REST API was constructed with three different endpoints consisting of missing data anomalies(L1), other anomalies (L2+) along with race & ethnic bias with variables such as investigator, event, subject, country, ethnicity, age group, etc. The future with the designed scalable API looks very promising with customizable endpoints, monitoring & validating models, and integrated active learning across many more datasets and benchmark different models.
The final takeaway from my first project with Omdena was that no one knows all the solutions to different challenges but with the combined strength of the team.
To make the most of the collaborative environment, it is important to communicate openly and ask for help from other team members.
The beginning of the project may be frustrating as it is an unstructured new team, but it will come together to figure the individual tasks, stage gates, and milestones to the final goal.
I am grateful to be part of this team that taught me a lot about project management, technical resources, small individual contributions, software literacy, and many more.
At the beginning of the project, I did not know anything about anomaly detection or isolated forest but with this project and with the support of the team, I was able to understand and adopt this model for smaller projects.
It was heartening to see a team mostly made of software engineers, data scientists, and budding data enthusiasts from non-biology or non-pharmacology backgrounds build an application to monitor data quality, adverse events, and bias in clinical trials.