Machine Learning for Risk Prediction of Colon and Lung Cancer

Machine Learning for Risk Prediction of Colon and Lung Cancer

The Omdena team developed and deployed two machine learning enabled applications, one for lung cancer prediction and one for colorectal cancer (CRC) prediction. The user of the apps can enter the input data for a patient and get the likelihood of cancer as an output.

The partner for this project, Radmol AI, is a Dublin-based and Microsoft for Startups supported company on a mission to minimize the risk of delay and errors in medical diagnosis.


The problem

According to WHO (world health organization), 80% of countries do not have early detection programs/guidelines for childhood cancer while 67% of countries do not have a childhood cancer defined referral system.

As a result, in high-income countries, more than 80 percent of children diagnosed with cancer are cured, but this figure is only 20 percent in many emerging countries. The disparity is mostly the result of a late or inaccurate diagnosis, among others. Many cancer patients could be saved from premature death and suffering if they had timely access to early detection programs and adequate treatment.

Many cancer patients could be saved from premature death and suffering if they had timely access to early detection programs and adequate treatment. Radmol AI´s solution will facilitate early detection and prompt intervention, hence minimizing needless loss of lives and strain on the economic resources of individuals and countries.


The project outcomes

Within 10 weeks, the team covered the following steps:

  1. Collecting datasets for cancer patients with previous symptoms, demographics data, and ailments
  2. Labeling datasets appropriately
  3. Training and testing various machine learning models for lung and colorectal cancer (CRC) prediction
  4. Deploying two applications to visualize the predictions (see the screenshot below) 



Application Screenshot



Cancer Drugs Survival Analysis to Support Affordability of Immunotherapy Treatments

Cancer Drugs Survival Analysis to Support Affordability of Immunotherapy Treatments

The team explored various models available in the survival analysis literature and identified the best-performing algorithms. The model predicts the survival probability of a patient and the next treatment period for specific, often less costly, drugs. 

The project partner Mango Sciences is a Boston-based leading emerging market data science company connecting millions of underrepresented patients to precision medicine. The company’s Querent™ platform utilizes industry-leading AI analytics to transform deep clinical data into key insights that drive global health improvements.


The problem

The vast majority of patients who unfortunately get diagnosed with cancer, can’t afford the life-extending cancer drugs targeted immunotherapies due to high price tags. As a result, most patients use older chemotherapy medications which have significant side effects and poor outcomes.

Survival Analysis is a branch of statistics developed initially to analyze the expected duration of lifespans of individuals. It is also known as duration analysis, time-to-event analysis, reliability analysis, and event history analysis. In the case of cancer treatments, it can be used to predict the survival probability of a patient or the next treatment period for specific drugs. 


The project outcomes

Mango Sciences has developed a financing product for immunotherapies that helps patients’ families pay for their drugs over a period of time, but they only pay the full value of the drug if they get the clinical benefit. Mango Sciences is building predictive algorithms to identify which type of drug works best in which patients based on their specific characteristics. Fundamentally, the right drug should go to the right patient at the right time. Patients and their families should only pay the full value for drugs if they receive a clinical benefit that is financed over a period of time.

The team explored the standard set of models available in the survival analysis literature and identified the best-performing model with the highest concordance index. An example visualization and prediction in Tableau can be found below. 


Survival Analysis using AI

Survival Analysis in Tableau;  Source: Omdena


Detecting Pathologies Through Computer Vision in Ultrasound

Detecting Pathologies Through Computer Vision in Ultrasound

Envisionit Deep AI is an innovative medical technology company using Artificial Intelligence to transform medical imaging diagnosis and democratize access to healthcare. In this two-month Omdena Challenge 50 technology changemakers have been building an Ultrasound solution that is able to detect the type and location of different pathologies. The solution works with 2D images and also is able to process a video stream. 


The problem

The health care services in Africa are under-resourced and overused. Africa is the youngest continent in the world, where pneumonia is the number 1 cause of death in children younger than five. Breast cancer is the most frequently diagnosed cancer among women, impacting over two million women worldwide each year and causing the greatest number of cancer-related deaths amongst women. Whilst breast cancer rates are higher among women in more developed regions, rates are increasing in nearly every region globally, with some of the most rapidly increasing incidence rates being from African countries.

Ultrasound is a relatively inexpensive and portable modality of diagnosis of life-threatening diseases and for use in point of care. The procedure is a non-invasive tool, and it quickly gives doctors the information necessary to make a diagnosis. Sonography machines are being made smaller and smaller, making them more and more accessible to developing countries.

An AI solution integrated with the mobile ultrasound tool will achieve radiology-level performance for diagnosis on ultrasound images. This will assist in delivering impactful and feasible medical solutions to such countries where there are significant resource challenges.


The project outcomes 

The AI solution is split into the following components:



1. Image preprocessing and normalization

Envisionit Deep AI has access to several Ultrasound sets that will be provided. These datasets are in a number of different formats, resolutions, and quality settings, which will be the case with production environments. Different practices and /or hospitals use different Ultrasound equipment that stores the images in a variety of formats and quality settings. The most common storage format is DICOM, archived in a practice /hospital PACS (picture archiving and communication system). Envisionit Deep AI already has the ability to interface with PACS platforms to retrieve and exchange images. However, for this project, an additional image normalization routine has been developed to ensure the consistency of images in the training, testing, and production datasets.


2. Model training

Even though the title of this component is about training an AI model, this should be preceded by an algorithm selection /design. The algorithm is capable of fulfilling the following requirements:

  • Identification of pathologies from a set of pre-defined pathologies on a given image/video frame. Initially, we’d look at 10+ pathologies /labels, but the algorithm should be capable of identifying more pathologies /labels.
  • Identification of the location of pathologies from #1 above – object detection rather than only classification.


3. Model validation, including field specialist review

This step involves all the relevant tools to extract model performance metrics as well as provide a UI to a Field Specialist (Radiologist) in order to perform an independent model validation with their own dataset; this dataset may or may not include images that were used in the testing dataset that was used during the model training.

The aim of this validation step is to ensure the adequate performance of the model. Envisionit Deep AI has a set performance metric of 95% accuracy and above before a model is considered to be ready for field pilots and ultimate deployment. Only models with combined (automated and Field Specialist validation) accuracy of 98% is considered for production environments.


4. Field deployment, including collecting concordance/discordance feedback

With all AI models that are deployed by Envisionit Deep AI, an ability for the users to provide concordance /discordance feedback is made available. These stats are collected not only with mere status feedback (Agree /Disagree) but also by allowing the users to augment AI predictions by adjusting the locations of the identified pathologies as well as adding /removing pathologies from the identified set.

Envisionit Deep AI predominantly used NVIDIA-based GPUs, and all cloud and on-premise Docker hosts include support for NVIDIA GPUs in containers themselves.


5. Concordance/discordance field specialist review 

This is an important step to ensure that the model is not fed incorrect data for further training. Thus, concordance /discordance feedback is validated by a radiologist before it is added to the training set (uploaded to the AWS S3 bucket that training containers use). 

A certain level of automation should be available to ensure that Envisionit Deep AI internal staff, as well as any Field Specialist consultants, are not overwhelmed with feedback validation.

This includes automated quorum testing by using a number of different models (different training stages, different input datasets, perhaps limited to a specific type of images, image quality, or contents of images, i.e., specific body part) and only images with discordance feedback significantly deviating from the expected norm should be forwarded for human validation.


Improving the Lives of Cancer Patients by Identifying Existing Non-Cancer Generic Drugs

Improving the Lives of Cancer Patients by Identifying Existing Non-Cancer Generic Drugs

Reboot Rx is the tech nonprofit startup dedicated to fast-tracking the development of affordable cancer treatments. Their strategy leverages repurposed generic drugs, AI technology, and innovative funding models. 50 technology changemakers developed automated methods to extract and store patient counts and study results from unstructured text in published clinical studies.


The problem

Reboot Rx is solving a big problem. Each year worldwide, 17 MILLION people are diagnosed with cancer, 10 MILLION people die from cancer, and $1 TRILLION is spent on cancer care.


AI cancer drugs

Source: Reboot Rx

Clinical studies, like clinical trials and observational studies, assess whether a drug intervention is effective for treating a disease. Many clinical studies have evaluated non-cancer generic drugs for the treatment of cancer. Reboot Rx is interested in synthesizing information from publications describing these studies in order to identify the most promising repurposing opportunities.

Publications of clinical studies report numerical values, including patient counts (referred to as ‘sample size’) and study results (referred to as ’outcome measures’), in an unstructured narrative text format. The goal of this project was to automatically extract these values and store them in a structured format for analysis.

For example, the sentence “the overall response rate was 45% and 30% with metformin and pravastatin respectively” reports the response to two drugs, metformin, and pravastatin, in a patient population. Reboot Rx wanted to extract the numerical outcome measures (45% and 30%) from the text and label these values with an outcome label (‘response rate’). The level of difficulty of this task can range from easy to very challenging depending on the type and scope of the clinical study. Clinical studies reporting a single outcome by evaluating two treatment groups would be considered easy to tackle. On the other hand, clinical trials with multiple comparison groups (three or more drug arms evaluated simultaneously) or studies with drug combinations (each intervention arm containing two or more drugs) can add complexity to the task. 

Current NLP techniques and language models do a reasonable job of extracting numerical values. But these methods are not tailored for the specific task proposed above. To label extracted values with a high level of accuracy, participants needed to enhance supervised and semi-supervised modeling techniques by applying unique rules-based methods and/or curating a training dataset of labeled data.


The project outcomes

The team built an automated data extraction and classification exercise to create a structured database, at scale, of outcome measures contained in text-based clinical study abstracts. Data extraction is limited to text contained in study abstracts (not using the full study text). Reboot Rx provided a list of clinical studies with study ID, title, and abstract. To contain the scope of this exercise, Reboot Rx provided a list of outcome measures, outcome labels, and expected data format (%, months, days) for extraction. The accuracy of extracted data and labels was evaluated using a test dataset curated by the Reboot Rx team. 

Project outcomes included machine learning models to classify defined variables, curated training annotations, and extracted datasets using an NLP pipeline.