Improving the Lives of Cancer Patients by Identifying Existing Non-Cancer Generic Drugs
Reboot Rx is the tech nonprofit startup dedicated to fast-tracking the development of affordable cancer treatments. Their strategy leverages repurposed generic drugs, AI technology, and innovative funding models. 50 technology changemakers developed automated methods to extract and store patient counts and study results from unstructured text in published clinical studies.
The problem
Reboot Rx is solving a big problem. Each year worldwide, 17 MILLION people are diagnosed with cancer, 10 MILLION people die from cancer, and $1 TRILLION is spent on cancer care.
Clinical studies, like clinical trials and observational studies, assess whether a drug intervention is effective for treating a disease. Many clinical studies have evaluated non-cancer generic drugs for the treatment of cancer. Reboot Rx is interested in synthesizing information from publications describing these studies in order to identify the most promising repurposing opportunities.
Publications of clinical studies report numerical values, including patient counts (referred to as ‘sample size’) and study results (referred to as ’outcome measures’), in an unstructured narrative text format. The goal of this project was to automatically extract these values and store them in a structured format for analysis.
For example, the sentence “the overall response rate was 45% and 30% with metformin and pravastatin respectively” reports the response to two drugs, metformin, and pravastatin, in a patient population. Reboot Rx wanted to extract the numerical outcome measures (45% and 30%) from the text and label these values with an outcome label (‘response rate’). The level of difficulty of this task can range from easy to very challenging depending on the type and scope of the clinical study. Clinical studies reporting a single outcome by evaluating two treatment groups would be considered easy to tackle. On the other hand, clinical trials with multiple comparison groups (three or more drug arms evaluated simultaneously) or studies with drug combinations (each intervention arm containing two or more drugs) can add complexity to the task.
Current NLP techniques and language models do a reasonable job of extracting numerical values. But these methods are not tailored for the specific task proposed above. To label extracted values with a high level of accuracy, participants needed to enhance supervised and semi-supervised modeling techniques by applying unique rules-based methods and/or curating a training dataset of labeled data.
The project outcomes
The team built an automated data extraction and classification exercise to create a structured database, at scale, of outcome measures contained in text-based clinical study abstracts. Data extraction is limited to text contained in study abstracts (not using the full study text). Reboot Rx provided a list of clinical studies with study ID, title, and abstract. To contain the scope of this exercise, Reboot Rx provided a list of outcome measures, outcome labels, and expected data format (%, months, days) for extraction. The accuracy of extracted data and labels was evaluated using a test dataset curated by the Reboot Rx team.
Project outcomes included machine learning models to classify defined variables, curated training annotations, and extracted datasets using an NLP pipeline.
Your benefits
Join a thriving AI community in 85 countries
Work with changemakers from around the world
Adress a real-world problem with your skills
Build up your skill-set while setting the stage for a meaningful career
Requirements
Good English
A good/very good grasp in computer science and/or mathematics
Student, (aspiring) data scientist, (senior) ML engineer, data engineer, or domain expert (no need for AI expertise)
Programming experience with C/C++, C#, Java, Python, Javascript or similar
Understanding of ML and Deep learning algorithms
Application Form
Become an Omdena Collaborator