AI for Malaria Prevention: Identifying Water Bodies Through Satellite Imagery

AI for Malaria Prevention: Identifying Water Bodies Through Satellite Imagery

By Tanmay Laud


Combining satellite images, topography data, population density and other data sources to build an algorithm that identifies the areas in which stagnant water bodies (malaria mosquito breeding sites) likely exist. The model helps to identify breeding sites quicker and more accurately.

You must have read the famous quote by Andrew Ng which highlights the importance of data in today’s world. He says, “It’s not who has the best algorithm that wins. It’s who has the most data.” The statement stands true in most of the data-driven applications, however, the required amount of data is not always available.

“Data is the new oil”- Clive Humby

If you come from the Kaggle world, then the problems of data sourcing might not be known to you. The online competitions begin with a rich corpus of data (that has been annotated and verified by large teams). But, at Omdena, our journey begins with the task of data acquisition. It is a challenging task, especially in the Social Good space, since not many tech giants are focusing on these problems, and as a result, data is not readily available for analysis and churning.

Thus, data collection from the right sources becomes a critical exercise in a machine learning project. That brings us to the Artificial Intelligence Zzapp Malaria project, a project tackled by 50 collaborators from across the world with a common objective —  to provide Artificial Intelligence-driven mechanisms that detect water bodies prone to the breeding of mosquitoes in order to prevent malaria. Let me talk about how my team eventually built a productive dataset from an initially minimal one.


The Problem Statement


Explaining the purpose/ problem statement of this project

High-Level Objective (


The project falls under the UN’s Sustainable Development Goal 3, which is to“ end the epidemics of AIDS, tuberculosis, malaria and neglected tropical diseases” by the year 2030. Given a region, our task was to automatically identify areas where there are water bodies. We achieved this by pre-surveying areas for malaria-infected water bodies via Artificial Intelligence tools like satellite imagery, topography analysis, and
geo-referenced data. It allows for more cost-effective surveys in new areas.

As you might have realized by now, money plays a big impact on such a project. To be able to cater to a large area like Ghana or Kenya in Africa, you need to be able to direct your resources to the most susceptible regions in the most cost-efficient manner, that too in a stipulated amount of time. The time is limited since you have to treat the water bodies before the wet season arrives, leading to a rise in mosquito breeding.

The dataset that we received was particularly for the Ghana and Amhara regions of the African subcontinent.

What’s interesting? The data did not come all at once.

Zzapp Malaria was surveying these regions during this phase hence the data came in periodic batches. The majority of the data was being sourced during the period of the project as it was a wet season in the above-mentioned areas. As an Artificial Intelligence engineer/ data scientist, you need to align your game plan to this flow of incoming information.


Highlighted Grids have a higher risk of containing water bodies

Highlighted Grids have a higher risk of containing water bodies (


The dataset comprised of 3-meter resolution satellite images (each image was 100×100 meters) with labels indicating the number of natural and artificial sources of water bodies in these regions. This was based on a survey conducted by Zzapp field workers. Each image corresponded to a 100×100 MGRS (military-grade geo-co-ordinate system) grid and so we had approximately 200 grids to start with and got around 1500 images by the end of the project.


Overcoming data challenges

We had the following challenges in this project

  • Lack of enough data. As explained earlier, the Zzapp data arrived in a periodic manner. Also, since not many organizations are working on data collection or using artificial intelligence for malaria elimination, there is NO pre-annotated dataset (at the time of this project) which one can download and get started with.
  • Lack of high-resolution data. The dataset had a 3-meter resolution which is better than most satellite image sources but is still not as detailed as a Google Maps image.
  • Imagery cannot convey the presence ( or probability of ) of water accumulation in an area. Consider, for example, water collected in a canal covered by a roof. This would be impossible to detect with satellite imagery.


How we did it

To solve the lack of data issue, we devised a two-step approach. Firstly, we detected the presence of large water bodies ( lakes, rivers, streams, ponds). This was achieved using State-Of-The-Art vision models like the DeepWaterMap which produced a probability map given a grid. This was, in itself, a useful way to trace the surrounding regions of interest. (Humans tend to settle around large water bodies).

Next, we used the output of the above model as a proxy variable to further detect the risk of water accumulating in smaller cross-sections.


Difference between Provided images v/s Google Hybrid images

Provided images v/s Google Hybrid images (


To compensate for the lack of resolution, we created a pipeline that extracts rich images from Google Satellite Hybrid service for corresponding grids given to us. You can see the difference in the details in the 3 references image on the left. You might wonder, why not use super-resolution instead? But using super-res could cause variations that would deviate from the original truth.

Further, since these images alone cannot comprehensively convey water presence, we created more features using population density, vegetation indices, topography, and landcover classifications. Let’s look at each of these factors briefly.


1. Population Density 


Graph between Population density v/s Land distance to water

Population density v/s Land distance to water (


Research suggests that as the population density in a given region rises, the land distance to water decreases. We thought of leveraging to interpret the risk of mosquito breeding grounds based on how densely populated a region is. The graph on the left roughly highlights this inverse proportionality.


2. Vegetation Indices and Height Above Nearest Drainage (HAND)



Mapping information about Vegetation masks calculated over the region of Ghana and Amhara

Vegetation masks calculated over the region of Ghana and Amhara (


A Vegetation Index (VI) is a spectral transformation of two or more bands designed to enhance the contribution of vegetation properties and allow reliable spatial and temporal inter-comparisons of terrestrial photosynthetic activity and canopy structural variations. Dual polarised (VV and VH) Sentinel-1 Ground Range Detected (GRD) scenes were acquired from Google Earth Engine ( All scenes were pre-processed using the following steps:

  • Thermal noise removal,
  • Radiometric calibration
  • Terrain correction.

The HAND data was also exported using Earth Engine. This was used to help eliminate false positives located above the drainage line.


3. Landcover Classification


Geographical representation of Landcover classification labels sourced from the LandCoverNet dataset

Landcover classification labels sourced from the LandCoverNet dataset (


In order to gain information about the terrain, we used a labeled dataset that was specifically released for the African subcontinent. LandCoverNet is a labeled global land cover classification dataset based on Sentinel-2 data. Version 1 of the dataset contains data across the entire African continent. The dataset is labeled on a pixel-by-pixel basis where each pixel is identified as one of the 10 different land cover classes: “trees cover areas”, “shrubs cover areas”, “grassland”, “cropland”, “vegetation aquatic or regularly flooded”, “lichen and mosses / sparse vegetation”, “bare areas”, “built-up areas”, “snow and/or ice or clouds” and “open water”.


4. Topography


Digital Elevation Model data for Ghana and Amhara region on map

Digital Elevation Model data for Ghana and Amhara region (


All the topographic features were calculated using SRTM v3 DEM (Digital Elevation Model) data. We used the SAGA API in order to pre-process the DEM dataset and generate topographic features. The DEMs were smoothed to fill in isolated elevation pits (or spikes), which typically represent errors or areas of internal drainage that interrupts the estimate of water flow. Then the following 17 topographic features were generated using the pre-processed elevation tiff:

  • Relative Slope Position
  • Topographic Wetness Index
  • Topographic Position Index (tpi500)
  • Channel Network Distance
  • Convergence Index
  • LS Factor

After generating raster datasets for the above topographic features, these features were projected onto the polygons of interest (positive and negative scan chunks) in Ghana and Amhara. The mean, max, and min of all the pixel values within a given grid were calculated for all of the above features to aggregate them at the MGRS grid level.

The topographic features were instrumental in detecting natural sources of water (both large and small in size) with high AUC which is evident below:


bodies in Ghana Region

Actual v/s Predicted Labels for water bodies in Ghana Region (


Bringing it all together

Using the aforementioned data sources, we ended up generating 81 features and after a round of exploratory data analysis, we were able to finalize on the top 20 most relevant features. We then set out to build and validate ensemble models that could best capture the information in each of the data sources. It allowed us to detect both natural and artificial sources of water with a high degree of recall. The higher recall was preferred since the notion of capturing all water sources was more important than inaccurately labeling some regions with having water. The data flow diagram aims to highlight this effort.



Data Flow Diagram (



Analyzing Mental Health and Youth Sentiment Through NLP and Social Media

Analyzing Mental Health and Youth Sentiment Through NLP and Social Media

By Mateus Broilo and Andrea Posada Cardenas


We are living in an era where life passes so quickly that mental illness has become a pivotal issue, and perhaps a bridge to some other diseases.

As Ferris Bueller once said:

“Life moves pretty fast. If you don’t stop and look around once in awhile, you could miss it.”

This fear of missing out has caused people of all ages to suffer from mental health issues like anxiety, depression, and even suicide ideation. Contemporary psychology tells us that this is expected — simply because we live on an emotional roller coaster every day.

The way our society functions in the modern day can present us with a range of external contributing factors that impact our mental health — often beyond our control. The message here is not that the odds are hopelessly stacked against us, but that our vulnerability to anxiety and depression is not our fault. — Students Against Depression

According to WHO, good mental health is “a state of well-being in which every individual realizes his or her own potential, can cope with the normal stresses of life, can work productively and fruitfully, and is able to make a contribution to her or his community. At the same time, we find it at WordNet Search as “the psychological state of someone who is functioning at a satisfactory level of emotional and behavioral adjustment”. Notice that it is far from being a perfect definition, but it gives us a hint related to which indicator to look for, e.g. “emotional and behavioral adjustment”.

It’s foreseen that this year (2020) around 1 in 4 people will experience mental health problems. Especially, low-income countries have an estimated treatment gap of 85%, contrary to high-income countries. The latter has a treatment gap of 35% to 50%.

Every single day, tons of information is thrown into the wormhole that is the internet. Millions of young people absorb this information and see the world through the glass of online events and others’ opinions. Social media is a playground for all this information and has a deep impact on the way our youth interacts. Whether by contributing to a movement on Twitter or Facebook (#BlackLifeMatters), staying up to date with the latest news and discussions on Reddit (#COVID19), or engaging in campaigns simply for the greater good, the digital world is where the magic happens and makes worldwide interactions possible. The digital eco-not so friendly-system plays a crucial role and represents an excellent opportunity for analysts to understand what today’s youth think about their future tomorrow.

Take a look at the article written by Fondation Botnar related to the young people’s aspiration.


The power of sentiment analysis

Sentiment analysis, a.k.a  opinion mining or emotional artificial intelligence (AI), uses text analysis, and NLP to identify affective level patterns presented in data. Therefore, a wise question could be: How do the polarities change?


Top Mental Health keywords from Reddit and Twitter



Violin plots

Considering a data set scraped from Reddit and Twitter from 2016–2020, these “dynamic” polarity distributions could be expressed using violin plots.



Sentiment Violin-Plot hued by Year

Sentiment Violin Plots by year. Here positive values refer to positive sentiments, whereas negative values indicate negative sentiments. The thicker part means the values in that section of the violin has a higher frequency, and the thinner part implies lower frequency.




On one hand, we see that as the years go by polarity tends to become more and more neutral. On the other hand, it’s difficult to understand which sentiment falls in what category, and what does the model categorizes as positive vs negative sentiments for each year. Also, text sentiment analysis is subjective and does not really spot complex emotions like sarcasm.


Violin plots according to label

So now, the next attempt was to see polarities according to labels — anxiety, depression, self-harm, suicide, exasperation, loneliness, and bullying.



Sentiment Violin-Plot hued by Year

Sentiment Violin Plots by label



Even if we try to see the polarities by the label, we might end up with surface-level results instead of crisp insights. Look at Self-harm, what’s the meaning of positive self-harm? But it’s still there in the green plot.

We see that most of the polarities are distributed close to the limits of the neutral region, which is ambiguous since it can be viewed as either a lack of positiveness or a lack of accurate sentiment categorization. The question is — how do we gain better insights?

Maybe we try plotting the mean (average) sentiment per year per label.



Mean Sentiment per Year hued by label



Notice that Depression was the only label that went through two consecutive decreasing mean sentiment values and passed from positive (2017–2019) to neutral in 2020. Moreover, Loneliness and Bullying classes are depicted only with one mark each, because they appear only in the data scraped from (Jan - Jun)/2020.


Depression-label word cloud

Before pressing on, let’s just take a look at the Depression-label word cloud. Here we can detect a lot of “emotions” besides the huge “depression” in green, e.g. “low”, “hopeless”, “financial”, “relationship”.


Keywords relating to mental health

Source: Omdena



These are just the most frequent words associated with posts labeled as Depression and not necessarily translates the feelings behind the scene. However, there is a huge “feel” there… Why? For sure, this is related to one of the most common words, which actually is the 6th more common word in the whole data set. In a more in-depth analysis aiming to find interconnections among topics, certainly “feel” would be used as one of the most prominent edges.



feel knowledge graph

“Feel” Knowledge Graph



This Knowledge graph shows all the nodes where “feel” is used as the edge connector. Very insightful but not very visible.

In fact, there’s a much better approach that performs text analysis across lexical categories. So now the question is: “What sort of other feelings related to mental health issues should we be looking for?”.




Empath analysis

The main objective of empath analysis consists of connecting the text within a wide range of sentiments besides just negative, neutral, and positive polarities. Now we’re able to go far beyond trying to detect subjective feelings. For example, look at the second and third lexicon- “sadness” and “suffering”. Empath uses similarity comparisons to map a vocabulary of the text words, (our data set is composed of Reddit and Twitter posts) across Empath’s 200 categories.


AI Mental Health

Empathy Value VS Lexicon




The Empath value is calculated by counting how many times it appears and is normalized according to the total text emotions spotted and the total number of words analyzed. Now we’re able to go much deeper and truly connect the sentiment presented in the text into some real emotion, rather than just counting the most frequent ones and assuming whether it is related to something good or bad.



Empathy value vs Year

Emotion trends hued by lexicon



We choose five lexicons that might be more deeply associated with mental health issues and show in the left plot: “nervousness”, “suffering”, “shame”, “sadness” and “hate”, we tacked these five emotions per year analyzed. And guess what? Sadness skyrocketed in 2020.



Sentiment analysis in the post-COVID world

The year 2020 turned our lives upside down. From now on we will most likely have to rethink the way we eat, travel, have fun, work, connect,… In short, we will have to rethink our entire lives.

There’s absolutely no question that the COVID-19 pandemic plays an essential role in mental health analysis. To take these impacts into account, since COVID-19 began to spread out worldwide in January, we selected all the data comprising the period of (January — June)/2020 to perform the analysis. Take a look at the Word Cloud related to the COVID-19 analysis from May and June.




COVID 19 Top keyword Analysis of Mental Health



Covid 19 Top Keyword Analysis of Mental Health




We can see words like help, anxiety, loneliness, health, depression, isolation. In this case, we can consider that it reflects the emotional state of people on social media. As said earlier that the sentiment analysis under polarity tracking isn’t that insightful, but we display the violin graphs below just for comparing.



Sentiment Violin-plot for COVID 19 Analysis by Months



Sentiment Violin-plot for COVID 19 Analysis by Label



Now we see a very different pattern from the previous one, and why is that? Well, now we’re filtering by the COVID-19 keywords and indeed the sentiment distribution now seems to make sense. Looking more closely at the distribution of the data, the following is observed.



Graph of number of relatable words vs count of words



In the word count from the sample of texts from 2020, only 2.59% of them contain words related to COVID-19. The words we used are “corona”, “virus”, “COVID”, “covid19”, “COVID-19” and “coronavirus”. Furthermore, the frequency of occurrence decreases as the number of related words found increases, the most common being at most three times in the same text.

Till now, we have presented the distribution of sentiments for specific words related to COVID-19. Nonetheless, questions about how these words relate to the general sentiment during the time period under analysis haven’t been answered yet.

The general sentiment has been deteriorating, i.e. becoming more negative, since the beginning of 2020. In particular, June is the month with the most negative sentiment, which coincides with the month with the most contagious cases of COVID-19 in the period considered, with a total number of 241 million cases. Considering the differences between the words related to COVID-19 and words that are completely unrelated, in the former, more negativity in sentiments is perceived in general.



Graph between sentiment vs months in 2020

The sentiment by the label is again observed — this time from January 2020 to June 2020 only.



Violin Plot by label 2020



Exasperation remains stable, with February being the month that attracts the most attention due to its negativity compared to the rest. Likewise, self-harm is quite stable. The months that call out the attention for their negativity in this category are March and June. Contrary to self-harm, in suicides, March doesn’t represent a negative month. However, the rest of the months between February and June not only present a detriment in the sentiment, which worsens over time, but they are also notably negative. June draws attention to having really positive and really negative sentiments (high polarities), which doesn’t happen in the other months. It has to be verified, but it could be that the number of suicides has been increasing in the last months. Regarding anxiety, a downward trend is also observed in the sentiment between February and May. Finally, one should be careful with loneliness, given the high negativity perception in May and June. Given that there are only data for June 2020 for Bullying, this label isn’t analyzed.

The next figure presents the time series corresponding to the sentiment between 2019/05 and 2020/06. A slight downward trend can be observed. This means that the general sentiment has become more negative. Additionally, there are days that present greater negativity, indicated by the troughs. Most of the troughs in the present year are found in the last months since April.



Sentiment Analysis from 2019-05

Incidents that moved the youth




There are other major incidents, besides COVID-19, that have influenced the youth to call for help and to speak up in 2020. The recent murder of George Floyd was the turning point and lighted up the #BlackLivesMatter movements. Have a look at the word cloud on the left — with the most frequent and insightful words

The youth gathered to protest against racism and call for equality and freedom worldwide. The Empath values related to Racism and Mental Health are displayed below.


AI Mental Health

Normalized empathy analysis



The COVID-19 pandemic has led the world towards a scenario of a global economic crisis. Massive unemployment, lack of food, lack of medicines. Perhaps the big Q is: “How will the pandemic affect the younger generations and the generations to come? ”. Unfortunately, there’s no answer to this question. Except that the economic crisis that we’re presently living in is definitely going to affect the younger generation because they’re the ones to study, go to college and find a job in the near future. The big picture tells us that unemployment is increasing on a daily basis and there are not enough resources for all of us. The Word Cloud in the opening of the article reflects some of the most frequent words related to the actual economic crisis.

Building a Risk Classifier for a PTSD Assessment Chatbot

Building a Risk Classifier for a PTSD Assessment Chatbot


MLFlow to structure a Machine Learning project and support the backend of the risk classifier chatbot regarding PTSD.



The Problem: Classification of Text for a PTSD Assessment Chatbot

The input

A text transcript similar to:


therapist and client conversation snapshot


The output

Low Risk -> 0 , High Risk -> 1

One of the requirements of this project was to have a productionized model for Text Classification regarding PTSD that could communicate with a frontend, for example, using Machine Learning.

As part of the solution to this problem, we decided to explore the MLFlow framework.



MLflow is an open-source platform to manage the Machine Learning lifecycle, including experimentation, reproducibility, and deployment regarding PTSD. It currently offers three components:

MLFlow Tracking: Allows you to track experiments and projects.

MLFlow Models: Provides a model and framework to persist, version, and serialize models in multiple platform formats.

MLFlow Projects: Provides a convention-based approach to set up your ML project to benefit the maximum work being put in the platform by the developer’s community.

Main benefits identified from my initial research were the following:

  • Work with any ml library and language
  • Runs the same way anywhere
  • Designed for small and large organizations
  • Provides a best practices approach for your ML project
  • Serving layers(Rest + Batch) are almost for free if you follow the conventions



The Solution


The focus of this article is to show the baseline ML models and how MLFlow was used to aid in Text Classification and training model experiment tracking and productionization of the model.


Installing MLFlow

pip install mlflow


Model development tracking

The snippet below represents our cleaned and pretty data, after data munging:


snapshot of a table containing transcript_id, text, and label as the column headings


In the gist below a description of our baseline(dummy) logistic regression pipeline:


train, test = train_test_split(final_dataset, 
random_state=42, test_size=0.33, shuffle=True)
X_train = train.text
X_test = test.text

LogReg_pipeline = Pipeline([
('tfidf', TfidfVectorizer(sublinear_tf=True, min_df=5, 
norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')),
The link to this code is given here.

One of the first useful things that you can use MLFlow during Text Classification and model development is to log a model training run. You would log for instance an accuracy metric and the model generated will also be associated with this run.


with mlflow.start_run():, train["label"])
# compute the testing accuracy
prediction = LogReg_pipeline.predict(X_test)
accuracy = accuracy_score(test["label"], prediction)
mlflow.log_metric("model_accuracy", accuracy)
mlflow.sklearn.log_model(LogReg_pipeline, "LogisticRegressionPipeline")


The link to the code above is given here.


At this point, the model above is saved and reproducible if needed at any point in time.

You can spin up the MLFlow tracker UI so you can look at the different experiments:


╰─$ mlflow ui -p 60000                                                                                                                                                                                                                  130 ↵
[2019-09-01 16:02:19 +0200] [5491] [INFO] Starting gunicorn 19.7.1
[2019-09-01 16:02:19 +0200] [5491] [INFO] Listening at: (5491)
[2019-09-01 16:02:19 +0200] [5491] [INFO] Using worker: sync
[2019-09-01 16:02:19 +0200] [5494] [INFO] Booting worker with pid: 5494


The backend of the tracker can be either the local system or a cloud distributed file system ( S3, Google Drive, etc.). It can be used locally by one team member or distributed and reproducible.

The image below shows a couple of models training runs in conjunction with the metrics and model artifacts collected:


Experiment Tracker in MLFlow screenshot

Sample of experiment tracker in MLFlow for Text Classification


Once your models are stored you can always go back to a previous version of the model and re-run based on the id of the artifact. The logs and metrics can also be committed to Github to be stored in the context of a team, so everyone has access to different experiments and resulted in metrics.


MLFlow experiment tracker


Now that our initial model is stored and versioned we can assess the artifact and the project at any point in the future. The integration with Sklearn is particularly good because the model is automatically pickled in a Sklearn compatible format and a Conda file is generated. You could have logged a reference to a URI and checksum of the data used to generate the model or the data in itself if within reasonable limits ( preferably if the information is stored in the cloud).


Setting up a training job

Whenever you are done with your model development you will need to organize your project in a productionizable way.

The most basic component is the MLProject file. There are multiple options to package your project: Docker, Conda, or bespoke. We will use Conda for its simplicity in this context.


name: OmdenaPTSD

conda_env: conda.yaml

 command: "python"


The entry point runs the command that should be used when running the project, in this case, a training file.

The conda file contains a name and the dependencies to be used in the project:


name: omdenaptsd-backend
- defaults
  - anaconda
- python==3.6
  - scikit-learn=0.19.1
  - pip:
- mlflow>=1.1


At this point you just need to run the command.


Setting up the REST API classifier backend

To set up a rest classifier backend you don’t need any job setup. You can use a persisted model from a Jupyter notebook.

To run a model you just need to run the models serve command with the URI of the saved artifact:


mlflow models serve -m runs://0/104dea9ea3d14dd08c9f886f31dd07db/LogisticRegressionPipeline
2019/09/01 18:16:49 INFO mlflow.models.cli: Selected backend for flavor 'python_function'
2019/09/01 18:16:52 INFO mlflow.pyfunc.backend: === Running command 'source activate 
mlflow-483ff163345a1c89dcd10599b1396df919493fb2 1>&2 && gunicorn --timeout 60 -b -w 1 mlflow.pyfunc.scoring_server.wsgi:app'
[2019-09-01 18:16:52 +0200] [7460] [INFO] Starting gunicorn 19.9.0
[2019-09-01 18:16:52 +0200] [7460] [INFO] Listening at: (7460)
[2019-09-01 18:16:52 +0200] [7460] [INFO] Using worker: sync
[2019-09-01 18:16:52 +0200] [7466] [INFO] Booting worker with pid: 7466


And a scalable backend server (running gunicorn in a very scalable manner) is ready without any code apart from your model training and logging the artifact in the MLFlow packaging strategy. It basically frees Machine Learning engineering teams that want to iterate fast of the initial cumbersome infrastructure work of setting up a repetitive and non-interesting boilerplate prediction API.

You can immediately start launching predictions to your server by:


curl -H 'Content-Type: application/json' -d 
'{"columns":["text"],"data":[[" concatenated text of the transcript"]]}'


The smart thing here is that the MLFlow scoring module uses the Sklearn model input ( pandas schema) as a spec for the Rest API. Sklearn was the example used here it has bindings for (H20, Spark, Keras, Tensorflow, ONNX, Pytorch, etc.). It basically infers the input from the model packaging format and offloads the data to the scoring function. It’s a very neat software engineering approach to a problem faced every day by machine learning teams. Freeing engineers and scientists to innovate instead of working on repetitive boilerplate code.

Going back to the Omdena challenge this backend is available to the frontend team to connect at the most convenient point of the chatbot app to the risk classifier backend ( most likely after a critical mass of open-ended questions).



More About Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up collaboration.

A Faster Way to Annotate Transcript Data in PTSD Therapy Sessions

A Faster Way to Annotate Transcript Data in PTSD Therapy Sessions

The Problem


This project has been done with Christoph von Toggenburg, CEO of World Vision Switzerland, who was exposed to Post Traumatic Stress Disorder in an armed ambush in Africa. PTSD can be triggered when someone experiences a severe traumatic event, and instead of the trauma leveling off, it becomes a mental health condition.

Symptoms include panic attacks, anxiety, uncontrollable thoughts, and more, which can be triggered whenever they are reminded of the event.

“The difference between trauma and PTSD is that switch in your brain, and it becomes a part of your life. It is something you cannot reverse, but you can deal with the symptoms, and if treated properly, you can get much better” — Christoph

Christoph started BEATrauma, an initiative to help victims with PTSD therapy all around the world. His vision is to create a mobile app Risk Assessment chatbot to converse with users and determine a risk assessment for PTSD, by using Cognitive Behavioral Therapy(CBT), which would implement machine learning — that’s where we come in!


The Data Problems — Not Annotated, Not Enough

Data is not always easy to find, especially when dealing with sensitive user information like therapy sessions. Though through our community network, we were able to get around 1700 transcripts on therapy sessions, about only 50 which were for PTSD.


The Solution

From a traditional treatment point, we discovered that CBT (Cognitive Behavioral Therapy) was the best solution for PTSD therapy using a Risk Assessment Chatbox. CBT is having a therapist to talk to the patient more about their experiences and “expose” them more until they finally become comfortable with it. Knowing that we could implement a conversational agent in NLP for this purpose, we set our sights on training data using Risk Assessment Chatbox.

We split into two groups. One was in charge of risk assessment, creating a rule-based algorithm in rasa with sentiment analysis to converse with the user, along with a backend classification model trained on transcript data to determine if the user had PTSD. The other focused on CBT, training a seq-to-seq chatbot for therapy!

This article described the data annotation part. Since the transcripts came completely unlabelled, we had to give them a score between 0 to 1 so that the model could learn which patients had PTSD and which didn’t. One of our project collaborators had experience with statistics and psychology and guided the team of seven through reading through the transcripts and scoring them!


The Annotation Process

  • Understand each of the 6 criteria for PTSD. E.x., Exposure to actual or threatened death, serious injury, or sexual violence, Persistent avoidance of stimuli associated with the traumatic event(s), and more!
  • Keeping the criteria in mind, read an entire transcript (which can take from 45 min-1 hr).
  • Score each of the 6 criteria with either a 0, 0.5, or 1, of which 0 means not displaying the symptom at all, 0.5 meaning somewhat displaying it, and 1 representing a clear expression of that symptom.
  • Follow a formula to take in all 6 numbers and spit out a number between 0 and 1 for the risk assessment for PTSD.
  • Rinse and repeat for the other 49.


Points explained of Criterion A(CAPS-5)

Criterion A’s description


We faced two problems in our annotation process. The first was that it took far too long to annotate all the data. Through complications and busyness, it took around two weeks to finish with tons of hard work put in. The second was that the transcripts were often a bit unclear and difficult to understand.

We brainstormed several solutions to the annotation problem:

  • Determine a bag of words and their embeddings for each criterion and run LDA (Latent Dirichlet Allocation) on top of them for classification of each criterion to completely automate the process
  • Using USE (Universal Sentence Encoder) to determine the cosine similarity of each sentence to match sentences of the same criterion
  • Use GPT-2 to summarize each transcript to get the main idea, speeding up the annotations


Creating the Risk Assessment Chatbot

From there, we had to create a classification model that takes in user conversations and determine if they had PTSD. Another task group had a breakthrough with ULMFiT’s transfer learning technique, which resulted in 80% accuracy, which is a very good start that is currently further improved through data augmentation methods.


Ready to run the advanced models soon!




More About Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up collaboration.

How To Estimate Possibly Undetected COVID-19 Infection Cases

How To Estimate Possibly Undetected COVID-19 Infection Cases

Country-wide estimations for undetected Covid-19 cases and recommendations for enhancing testing facilities.


By Nikhel Gupta

Why estimating undetected Covid-19 cases is crucial?

An estimation of the undetected Covid-19 cases is important for authorities to plan economical policies, make decisions around different stages of lockdown and to work towards the production of intensive care units.


How far is a Covid-19 testing center from your home? (credit: link)

How far is a Covid-19 testing center from your home? (credit: link)

As we have crossed a psychological mark of 1 million Covid-19 patients around the globe, more questions are popping up regarding the capabilities of our health care systems to contain the virus. One of the major worries is the systematic uncertainty in the number of citizens who have hosted the virus. The major contribution to this uncertainty is possibly due to the small fraction of Covid-19 tests being performed.

The main test to confirm if someone has Covid-19, is to look for signs of the virus’s genetic material in the swab of their nose or throat. This is not yet available for most people. The healthcare workers are morally restricted to reserve the testing apparatus for seriously ill patients in the hospital.

In this article, I will show a simple Bayesian approach to estimate the undetected Covid-19 cases. The Bayes theorem can be written as:

P(A|B) = P(B|A) × P(A) / P(B)

where P(A) is the probability of event A, P(B) is the probability of event B, P(A|B) is the probability of observing event A if B is true, and P(B|A) is the probability of observing event B if A is true.

The quantity of interest for us is P(infected|notTested) i.e. the probability of infections that are not tested. This is equivalent to the percentage of the population infected by Covid-19 but not tested and we can write it as:

P(infected|notTested) = P(undetected|infected)×P(infected)/P(notTested)

Here the other probabilities are:

  • P(notTested|infected): Probability of tests not done on people that are infected or percentage of the population not tested but infected.
  • P(infected): Prior probability of infection or known percentage of the infected population.
  • P(notTested): Probability or percentage of people not tested.

The following plot shows the total Covid-19 tests per million people and the total number of confirmed cases per million people for several countries. This suggests a clear relation between the Covid-19 tests and confirmed positive detections.


Figure 1: Tests per million versus positive Covid-19 cases per million as of 20 March 2020 (data source).


Assuming that all countries follow this relation between the Covid-19 tests and confirmed cases, we can make a rough estimate of the number of undetected cases in each country (I will come back to this assumption later in this post).

Let’s take Australia as an example:

For example, the plot shows that prior knowledge of infected cases

P(infected) = 27.8/10⁶, and

P(notTested) = (10⁶ — 473)/10⁶.

To estimate the P(notTested|infected), I used the relation between the Covid-19 tests and confirmed cases as in the above Figure 1. This is done by fitting a power law of the form: y = a * x**b, where a is normalization and b is the slope of this power law. The following plot shows a fit to the data points from the above plot, where the best fit a = 0.060±0.008 and b = 0.966±0.014.


Figure 2: The relation between Covid-19 tests and confirmed cases and a power-law best fit.


Using the best fit parameters, P(notTested|infected) = (10⁶— 4473)/10⁶ / (a * (10⁶ — 4473)**b)/10⁶.

With probabilities 1, 2 and 3, I find P(infected|notTested) = 0.00073 per cent population of Australia. Multiplying this by the population of Australia indicates that there is a possibility of about 18,600 undetected Covid-19 cases in Australia. The following plot shows possible undetected Covid-19 cases as a function of tests per million for different countries as of 20 March 2020.


Figure 3: Estimation of undetected Covid-19 cases (see assumptions in the text).

Note that several assumptions and considerations are made to estimate these undetected cases. For instance:

  • I assumed that all countries would follow the same power-law relation to estimate P(notTested|infected). However, this is not an extremely good assumption as there is huge scatter in this relation between different countries.
  • Our prior knowledge of the number of infections can be biased itself as P(infected) depends on the number of tests performed as of 20 March 2020.
  • I haven’t considered the susceptibility of a country’s populations to Covid-19, and the attack rate i.e. the biostatistical measure of the frequency of morbidity, which for Covid-19 is estimated around 50–80% (Verity et al. 2020).
  • The impact of government policies of these countries from 14 days before 20 March and 14 days after is not considered.
  • I haven’t considered how susceptible people are targeted for testing in different countries in the next days.

Figure 4 below shows the total number of confirmed cases versus the tests per million as of 5 April 2020 for several countries (data source).

After 16 days on 5 April, the confirmed positive cases in countries like Ukraine, India and Philipines are consistent with the predictions in Figure 3. These countries performed ≤ 10 tests per million people as of 20 March.

Note that the consistency between estimations as of 20 March and 5 April does not necessarily mean that all undetected cases as of 20 March are confirmed now. Several of the confirmed cases as of 5 April are expected to be new cases due to the spread between 20 March and 5 April (even in the presence of lockdowns).

The estimated undetected cases for countries like Colombia and South Africa are about twice as large (Figure 3) as compared to the total confirmed cases as of 5 April (i.e. about 1,500 for both). Both countries have performed about 100 tests per million people.

Countries like Taiwan, Australia, and Iceland, on the other hand, have shown an order of magnitude small number of confirmed cases as compared to estimated numbers in Figure 3.

This indicates that the countries that have not boosted their testing efficiency to more than 1,000 tests per million people have significantly larger uncertainties on the number of current confirmed cases.

Figure 4: The total number of confirmed cases versus the tests per million as of 5 April 2020.

Given the data in Figure 4 from 5 April 2020, I repeated the whole exercise again to estimate the undetected Covid-19 cases for these countries, cities, and states. The following figure shows the best fit power-law and data points similar to Figure 2 but for the data as of 5 April 2020.


Figure 5: Best fit power law for data as of 5 April 2020.


The best-fit slope for the power-law relation in Figure 5 (b = 1.281±0.009) is consistent with the slope in Figure 2 at 2-σ confidence level. This helps our assumption of estimating P(notTested|infected) from the best fit power-law relation (the slope is not changing), however, other caveats are the same as before.

Finally, the following plot shows the estimated undetected Covid-19 cases for different countries as of 5 April 2020.


Figure 6: Estimated Undetected Covid-19 cases as of 5 April 2020 (see assumptions in the text).


As the comparison between the undetected estimations as of 20 March (Figure 3) and confirmed cases as of 5 April (Figure 4) shows that more tests per million people are required to capture the possible undetected cases, thus now is the high time that authorities raise the testing efficiency in order to reduce the systematics from undetected Covid-19 cases. This seems to be the only good way to reduce the death rate of Covid-19 patients as indicated by the large amount of Covid-19 testing in Germany and South Korea.

To make this happen, all countries need at least one testing center within a radius of 20 Km and arrange more drive through testing facilities as soon as possible.

This work was done in collaboration with the people working on the Omdena Coronavirus AI challenge.

You can contact me on LinkedIn and follow my academic research on Orcid.



How Omdena is combating the Coronavirus

A good start to learn more about Omdena’s innovation platform is to read about our Coronavirus Policy AI Challenge, where more than 70 AI and domain experts are collaborating to build AI models that reveal the direct and indirect impact of pandemic policies on the economic health of marginalized communities.

Our aim is to support policymakers in identifying the most effective ways to minimize the economic suffering of those most vulnerable.


About Omdena

Solving challenges through collaboration

Omdena is an innovation platform where changemakers build AI solutions to real-world problems through the power of bottom-up collaboration.

Learn more about the power of Collaborative AI.

Using AI to Enable Data-Driven Response Actions During Pandemics

Using AI to Enable Data-Driven Response Actions During Pandemics

Palo Alto-based startup Omdena wants to use AI to help governments make data-driven decisions when dealing with pandemics like the coronavirus


Omdena Logo


By Laura Clark Murray 


Palo Alto, California, March 30, 2020 –  When travel is restricted, schools closed, businesses shut down, and communities put into quarantine, individuals in those ecosystems lose their sources of income. Omdena, a Palo Alto-based startup that unites AI and domain experts from around the globe, is launching an AI challenge to investigate the impact of such policy decisions on people’s financial stability.


In an effort to curb the coronavirus pandemic more than 100 countries have imposed travel restrictions and 2.5 billion people, or 30 percent of the world’s population, have been directed by governments to stay at home. The resulting loss of wages is expected to be disastrous to those already on the economic margins, including wage workers. Omdena’s AI challenge aims to provide analysis of the economic effects of the COVID-19 crisis.

The lockdowns in Europe, the US, and India affect the poorest in those regions and elsewhere. We must think about those hundreds of millions who do not have savings or a pantry full of food. When those people cannot go to work every day to earn a living, the impact is devastating,said Rudradeb Mitra, Founder of Omdena. “We want that impact to be understood and considered by policymakers.”

Omdena, which is a partner of the United Nations’ AI for Global Good Summit 2020, comes with a track record of successfully completed AI projects. Those efforts include using machine learning to identify the safest routes in Istanbul for earthquake victims to reunite with their loved ones. It has also delivered AI solutions which helped detect the outbreak of fires in the Brazilian rainforest with 95 percent accuracy.

“Our goal is to minimize the human suffering that results from pandemic policies. We created this AI challenge to support policymakers with data-driven analyses that will help them make even more informed decisions in the future,” added Mira.

Omdena runs collaborative AI projects in which global teams of 40 or more data scientists and experts build AI solutions to address significant real-world problems. To date, more than 900 people from over 75 countries have participated in Omdena’s challenges.

Omdena’s Coronavirus Policy AI Challenge is supported by the UN AI for Good Global Summit, AI for Peace, PWG, Fruitpunch AI, LabelBox, and Spell. Joining the challenge are economic, health and humanitarian policy experts from around the world, who bring experience with organizations including the World Health Organization, The World Bank, European Commission, and UNICEF USA.

We are excited to join efforts with Omdena to protect those with the least capacity to manage the burdens of this crisis — the impoverished and economically marginalized,” said Branka Panic, Founder of the think tank AI for Peace.We aspire to help governments and international organizations deal with this and future pandemics by taking an AI-enabled and evidence-based approach to policymaking.”

Omdena is a partner of several United Nations organizations, including the UN Refugee Agency and the UN World Food Programme, as well as an official Innovation Partner of the United Nations AI for Global Good Summit 2020.


For media inquiries contact: Laura Clark Murray, Omdena, 

About Omdena: Founded in May 2019, Omdena is an innovation platform for building AI solutions to real-world problems through global collaboration. The company’s partners include the UN World Food Programme and the UN Refugee Agency. Omdena is an Innovation Partner of the United Nations AI for Good Global Summit 2020. Learn more at

Learn more about Omdena’s Coronavirus Policy AI Challenge at

About Rudradeb Mitra: The India-born Rudradeb Mitra is a graduate from the University of Cambridge, UK and an international AI expert. He has built six startups in four countries. His primary interest is to build products with social value. He is a mentor and AI advisor at several institutions including Google Launchpad, ImpactHub, MIT Enterprise, Founders Institute and a senior AI advisor of EFMA Banking Group. Mitra founded Omdena in 2019 to address real-world problems through global collaboration.

Stay in touch via our newsletter.

Be notified (a few times a month) about top-notch articles, new real-world projects, and events with our community of changemakers.

Sign up here