AI Meets 96 Years Old NGO: Improving Case Management for Cross-Border Child Protection

AI Meets 96 Years Old NGO: Improving Case Management for Cross-Border Child Protection

How can AI and Natural Language Processing (NLP) help alleviate social workers’ administrative burden in case management?


By Shrey Grover and Jianna Park


The social service sector is increasingly showing an interest in turning to data-driven practices, which, until now, were predominantly utilized by the commercial counterpart.

Some of the key reforms that social organizations expect from leveraging data include:

  1. Harness the potential of the underlying gold mine of expert knowledge
  2. Relieve the limited staff of repetitive administrative and operational tasks
  3. Address missions on a shorter timeline with enhanced efficiency

Yet, according to IBM’s 2017 study, most of the sector seems to be in the early stages of the data journey, as shown in the visual below. Constrained budget, access to technology, and talent are cited as the major hurdles in utilizing analytical services.


ai case management

67% of nonprofits are in the preliminary stages of using data (source)


We at Omdena had an unparalleled opportunity to work on one of such nascent stage projects for International Social Service (ISS), a 96-year old NGO that massively contributes to resolving child protection cases.

Why This Project?

ISS has a global network of expertise in providing essential social services to children across borders — mainly in the domains of child protection and migration. However, with over 70,000 open cases per year, ISS caseworkers were facing challenges in two aspects: managing time, and managing data. The challenges were often exacerbated by administrative backlogs and a high turnover rate not uncommon in the nonprofit sector.

If we could find a way to significantly reduce the percentage of time lost on repetitive administrative work, we could focus on the more direct, high-impact tasks, helping more children and families access better quality services. ISS saw an urgent need for a technological transformation in the way they managed cases — and this is where Omdena came into the picture.


Problem Statement

The overarching goal of this project was to improve the quality of case management and avoid unnecessary delays in service. Our main question remained, how can we help save caseworkers’ time and leverage their data in a meaningful way?

It was time to break down the problem into more specific targets. We identified factors that hinder ISS caseworkers from focusing on the client-facing activities, seen in the following picture.

ai case management

Graphic by Omdena collaborator Bima Putra Pratama 


We saw that these subproblems could each be solved with different tools, which would help organizations like ISS understand the various ways that machine learning can be integrated with their existing system.

We also concluded that if we could manage data in a more streamlined manner, we could manage time more efficiently. Therefore, we decided that introducing a sample database system would prove to be beneficial.


Initial Challenge

The biggest challenge we faced as a team was the data shortage. ISS had a strict confidentiality agreement with their clients, which meant they couldn’t simply give us raw case files and expose private information.

Initially, ISS gave us five completed case files with names either masked or altered. As manually editing cases would have taken up caseworkers’ client-facing time, our team at Omdena had to find another way to augment data.



Our team collectively tackled the main problem from various angles, as follows:

ai case management

Graphic by Bima Putra Pratama


As we had only five data points to work with and were not authorized to access ISS’s data pool, we clarified that our final product would be a proof-of-concept rather than a production-ready system.

Additionally, keeping in mind that our product was going to be used by caseworkers who may not have a technical background, we consolidated the final deliverables into a web application with a simple user interface.


Data Collection

Due to limited cases available at the start of the project, the first task in hand was to collect more children related cases from various sources. We majorly concentrated on child abuse and migration cases. We gathered the case files and success stories that were publicly available on ISS partner websites, Malawi’s Child Protection Training Manual, Bolt Burdon Kemp, and Act For Kids. We also collected a catalog of child welfare court cases from the European Court of Human Rights (ECHR), WorldCourt, US Case Law, and LawCite. In the end, we had managed to collate a dataset of about 230 cases and were ready to utilize these in our project pipeline.


Data Engineering

We relied on a supervised learning approach for our risk score prediction model. For this, we manually labeled risk scores for each of our cases. A risk score, a float value ranging from 0 to 1, intends to highlight priority cases (i.e. cases which involve a threat to a child’s wellbeing or have tight time constraints) and require the immediate attention of the caseworkers.

The scores were given by taking into consideration various factors such as the presence and a history of abuse within the child’s network, access to education, caretaker’s willingness to care for the child, and so on. To reduce bias, three collaborators provided their risk score input for each case, and the average of the three was considered as the final risk score for that case.


ai ngo

Manual risk score assignment process


Finally, we demarcated the risk scores into three categories, using the following threshold.

ai case management


Additionally, we augmented our data — which originally only contained case text — by adding extra information such as case open and close dates, type of service requested, and country where the service is requested. Using this, we created a seed file to populate our sample database. These parameters would later help caseworkers see how applying a simple database search and filter system can enable dynamic data retrieval.



Next, we moved to data preprocessing which is crucial in any data project pipeline. To generate clean, formatted data, we implemented the following steps:

  • Text Cleaning: Since the case texts were pulled from different sources, we had different sets of noises to remove, including special characters, unnecessary numbers, and section titles.
  • Lowercasing: We converted the text to lower case to avoid multiple copies of the same words.
  • Tokenization: Case text was further converted into tokens of sentences and words to access them individually.
  • Stop word Removal: As stop words did not contribute to certain solutions that we worked on, we considered it wise to remove them.
  • Lemmatization: For certain tasks like keyword and risk factor extraction, it was necessary to reduce the word to its lemmatized form (eg. “crying” to “cry,” “abused” to “abuse”), so that the words with the same root are not addressed multiple times.


Feature Extraction

We had to convert the case texts into compact numerical representations of fixed lengths to make them machine-readable. We considered four different types of embedding methods — Term Frequency Inverse Document Frequency (TFIDF), Doc2Vec, Universal Sentence Encoder (USE), and Bidirectional Encoder Representations (BERT).

To choose the one that works best for our case, we embedded all cases using each embedding method. Next, we reduced the embedding vector size to 100 dimensions using Principal Component Analysis (PCA). Then, we used a hierarchical clustering method to group similar cases as clusters. To find the optimal number of clusters for our problem, we referred to the dendrogram plot. We finally evaluated the quality of the clusters using Silhouette scores.

After performing these steps for all four algorithms, we observed the highest Silhouette score for USE embeddings, which was then selected as our embedding model.


Models & Algorithms

Text Summarization

Multiple pre-trained extractive summarizers were tried, including BART, XLNet, BERT-SUM, and GPT-2 which were made available thanks to the HuggingFace Transformers library. As evaluation metrics such as ROUGE-N and BLEU required a lot more reference summaries than what we had, we opted for relative performance comparison and checked for the quality and noise level of each model’s outcomes. Then, inference speed played a major role in determining the final model for our use case, which was XLNet.

ai case management

Time each model took to produce a sample summary, in seconds


Keyword & Entity Relation Extraction

Keywords were obtained from each case file using RAKE, a keyword extraction algorithm that determines high-importance phrases based on their frequencies in relation to other words in the text.

For entity relations, several techniques using OpenIE and AllenNLP were tried, but they each had their own set of drawbacks, such as producing instances of repetitive information. So we implemented our own custom relation extractor utilizing spaCy, which better-identified subject and object nodes as well as their relationships based on root dependency.

AI case management

Entity relation graph made via Plotly


Similarity Clustering

The pairwise similarity was computed between a given case and the rest of the data based on USE embeddings. Among Euclidean distance, Manhattan distance, and cosine similarity, we chose cosine similarity as our distance metric for two reasons.

First, it works well with unnormalized data. Second, it takes into account the orientation (i.e. angle between the embedding vectors) rather than the magnitude of the distance between the vectors. This was favorable for our task as we had cases of various lengths, and needed to avoid missing out on cases with diluted yet similar embeddings.

After getting similarity scores for all cases in our database, we fetched top five cases that had the highest similarity values to the input case.

Risk Score Prediction

A number of regression models were trained using document embeddings as input and manually labeled risk scores as output. Tensorflow’s AutoKeras, Keras, and XGBoost were some of the libraries used. The best performing model — our custom Keras neural network sequential model — was selected based on root mean square error (RMSE).

ai case management

Comparison of risk prediction model accuracies


Abuse Type & Risk Factor Extraction

We created more domain-specific tools to generate another source of insight via algorithms to find primary abuse types and risk factors.

For the abuse type extraction, we defined eight abuse-related verb categories such as “beat,” “molest,” and “neglect.” spaCy’s pre-trained English model en_core_web_lg and part-of-speech (POS) tagging were used to extract verbs and transform them into word vectors. Using cosine similarity, we compared each verb against the eight categories to find which abuse types most accurately capture the topic of the case.

Risk factor extraction works in a similar way, in that text was also tokenized and preprocessed using spaCy. This algorithm, however, further extended the previous abuse verbs by including additional risk words, such as “trauma,” “sick,” “war,” and “lack.” Instead of only looking at verbs, we compared each word in the case text (excluding custom stop words) against the risk words. Words that had a similarity score of over 0.65 with any of the risk factors were presented in their original form. This addition aimed to provide more transparency over what words may have affected the risk score.


Web Application

To put these models altogether in a way that ISS caseworkers could easily understand and use, a simple user interface was developed using Flask, a lightweight Python web application framework. We also created forms via WTForms and graphs via Plotly, and let Bootstrap handle the overall stylization of the UI.

A Javascript code to implement Google Translate API was incorporated into the HTML templates, enabling the translation of any page within the app into 108 languages.

For the database, we used PostgreSQL, a relational database management system (RDBMS), along with SQLAlchemy, an object-relational mapper (ORM) that allows us to communicate with our database in a programmatic way. Our dataset, excluding the five confidential case files initially provided by ISS, was seeded into the database, which was then hosted on Amazon RDS.


ai case management

The seeded database also includes fields like summary, risk score, and other model outcomes


Ai case management

Running models on a new case


ai case management

Querying the database


A public Tableau dashboard to visualize the case files was also added, should caseworkers wish to refer to external resources and gain further insight on case outcomes.


ai case management

Dashboard showcasing additional child court cases as an external point of reference (source)


The aim of this project was to assist ISS in offering services in a timely manner, bearing in mind the organizational history of knowledge available. Within eight weeks, we achieved this goal by providing an application prototype that would help caseworkers understand some of the various ways to leverage the power of data.

This tool, upon continued development, will be the first step toward ISS’s AI journey. And with enhanced capabilities, both experienced and less experienced caseworkers will be able to make better-informed decisions.


Our models do come with some limitations, mainly stemming from limited data due to privacy reasons and time constraints.

As our dataset only accounted for two types of services (child abuse and migration), and came from a small number of geographical sources, the risk score prediction model may contain biases. Bias could also have been induced by the manual labeling of risk scores, which was done based on our assumptions.

The solutions provided are not meant to replace human involvement. As 100% accuracy of a machine learning model can be difficult to achieve, the tool works best in combination with the judgment of a caseworker.

Moving Forward: AI and case management

To bring this tool closer to a production level, a few improvements can be made.

Incorporating more of the official ISS data would allow fine-tuning of the models, which would yield better results. This can be done without breaching a client confidentiality agreement, by further training the models within the organization’s secure server, or introducing a differential privacy policy that allows sharing only patterns of data.

Furthermore, risk scores can be validated by the ISS caseworkers. They can provide even more risk scores as they use our prediction model, to enable continuous learning.

The database fields can be more granular and include additional attributes as caseworkers see fit. For example, the current field “case_text” can be divided into “background_info,” “outcome,” and so on. This will create more flexibility in document search and flagging missing information.

Finally, once the app is productionized and deployed on a platform like AWS, ISS caseworkers across the globe will have access to these tools, as well as access to the entire network’s pool of resources — truly empowering caseworkers to do the work that matters.

Adopting an Agile Navigation Approach in an AI Project

Adopting an Agile Navigation Approach in an AI Project

By Diana Roccaro

Applying Natural Language Processing in an agile AI project to investigate Online Violence against Children.

I had the honor to contribute to the 29th Omdena project, aimed at investigating Online Violence against Children (OVAC), conducted in partnership with Save The Children. I tried to follow our kick-off meeting on August 20, 2020, sitting in a train running the Albula Line of the Rhaetian Railway, a very scenic route belonging to the UNESCO World Heritage. A bird would have chosen a more direct path to cover the actual distance of 47 kilometers, but our train had to cross 55 bridges and 39 tunnels spread across 62 kilometers.

Setting Project Goals

Only when you know your goal, you can select the most efficient path. The kick-off meeting, which I mostly missed due to a constantly interrupted internet connection, heavily focused on Child Sexual Abuse Materials (CSAM), a term basically synonymous with child pornography. The problem statement handed over to us, in contrast, was intentionally kept very broad and did not specify at all, what kind of OVAC our project team was supposed to analyze. For this reason, every task team formed during the 8 weeks of our project set their focus somewhat differently.


Agile AI

My attempt to categorize Online Violence against Children into distinct subclasses and to allocate project tasks to these subclasses. / Source:


The aim of our project was to apply Natural Language Processing (NLP) techniques to investigate OVAC. This article is restricted to my analysis of news articles. I am a research neuroscientist (and later on a medical writer and machine learning engineer) by education and thus clearly no expert neither in the domain of OVAC, nor in the field of journalism. But whatever we humans write, our choice of words will inevitably be influenced by our background knowledge, and hopefully also adapted to our target audience. Human language is very dynamic. Thus, whatever we try to analyze using NLP, it can’t be a bad idea to first try to obtain some minimal level of domain knowledge from experts on the field. Understanding a domain and reading about definitions and human classification systems will help us not only to select the terms we want to look for but also to define classes that our machine counterparts shall learn to recognize and predict.


Collecting Data

To retrieve news articles in digital form, we need to — decide on a provider of digital news articles and search for specific terms. If that search engine even offered advanced search options, we could think about the most sophisticated and goal-directed query syntax. Although standard in scientific repositories, such advanced functionality is rarely built into news article search engines, which forces us to spend some time and effort selecting the best terms to search for.


Agile AI

Photo by Ray Hennessy on Unsplash.


So how will we select our search terms, being neither OVAC domain experts nor experts of journalistic terminology? Just as the bird trying to fly towards its target, carefully paying attention to environmental cues telling it to adapt direction, we can adopt an agile methodology, as software developers would call it. As long as we lack the bird’s perfect navigation strategy, we can simply use trial and error. We input a search term, analyze what we get back, and adjust our search accordingly. Scientific terms will unlikely be used by reporting agencies. It may thus be a wise idea to use different search terms to collect news articles than to collect scientific publications.

I decided to retrieve news articles from the Thai e-newspaper “Bangkok Post” and started out comparing the specificity of various search terms to our topic of interest, Online Violence against Children. Since our kick-off presentation focused on CSAM, I decided to try its non-scientific equivalent “child pornography” and found each article in the results list to in fact be related to OVAC, thus relevant to our project. In contrast, most other search terms returned many irrelevant articles, in addition. For example, only 6 out of the 52 (thus 12 %) news articles obtained from Bangkok Post by searching for “online grooming” proved to in fact be on OVAC.


online violence children

Specificity of results retrieved using various search terms related to Online Violence against Children. The higher the OVAC-relevant fraction, the more useful the search term. / Source:


If the goal is to include only OVAC-relevant articles into our analysis, 3 strategies are conceivable: 1) to use only search terms that produce results 100 % specific to the problem of OVAC, 2) to use various search terms and manually check article per article to include only the relevant ones, or 3) to find some method to (semi-)automate filtering out the relevant articles. Since 1) would restrict the analysis to very few subclasses of OVAC, and 3) has the potential to increase the efficiency of 2), but also since I’m a curious person who has never before built a classifier that relies on machine learning to categorize text documents, I decided to build a news article classifier.


Automating the Data Collection Process

The search results I got back using my search terms broadly fell into three categories: 1) news articles on OVAC (our target class), 2) articles on violence against children, but in physical offline instead of online forms (“PVAC”), and 3) articles on online violence, but against adults instead of children (“OVAA”). I wondered whether an algorithm will be capable of picking only project-relevant OVAC-related articles out of a collection of news articles of all three classes. To find this out, I trained a Support Vector Machine (SVM) Classifier based on 209 news articles from Bangkok Post (supervised ML; 131 OVAC + 43 PVAC + 35 OVAA articles).

Whenever we train a machine learning model, we have to decide which metric we want to optimize. My goal was not to find every single article out there, that would be on OVAC, our class of interest, but rather to end up with a collection of news articles, all relevant for our project (OVAC-related). In other words, my aim was to maximize recall — the fraction of true positives among all articles classified as positive for our target class OVAC. Already in the 2nd training attempt, I reached a recall of 100 %: every article predicted by the SVM to be on OVAC, was in fact on OVAC.

A classifier as reliable as this may prevent us from battling our way through every single article to confirm its relevance to our project. On the other hand, such great performance can only be achieved on a test set of news articles that shows a class distribution similar to the article set it was trained on. Most notably, a classifier not trained to recognize entirely irrelevant articles will evidently always fail to detect such.


online violence children

Classification metrics obtained for a Support Vector Machine (SVM) Classifier trained to recognize news articles related to our target class, OVAC. / Source:


One of the advantages of news articles, if compared to research publications, is that these are published with a very short delay. News articles thus have the power to provide insights about trends over time with a much shorter time lag bias than research publications.

Quantifying the Issue

One goal stated in our problem statement included capturing the severity of the situation. Will the severity of a problem be reflected by the number of news articles published on that problem? Rather not. A case of an actor condemned for possessing child pornography may be reported by 20 news articles and followed up closely, whereas an international ring of pedophiles acting over decades may receive much less media attention and be reported by only a specific reporting agency.

Features can be extracted from news articles with the help of NLP techniques, and then converted into numeric variables with the aim to quantify the magnitude of a problem. They can surely serve as indicators pointing towards potentially underlying trends, but need to be interpreted with a healthy portion of caution and skepticism. Too many biases may play their part as well, such as the reporting bias resulting in an over-representation of the case of the famous actor, as compared to a less attention-grabbing but potentially much more severe case.

Nevertheless, I decided to give it a try and investigate trends over time with respect to the reporting of child pornography cases. Instead of using advanced NLP, I decided to rely on basic NumPy functionality and created a dummy variable for 7 selected verbs, often reported in the context of child pornography. For each article, represented by separate rows of a pandas data frame, the dummy variable would assume a value of 1, if the article contains a certain text string such as “possess”, or alternatively a value of 0 if it does not.


online violence children

Appending dummy variables to a pandas dataframe to indicate the presence (1) or absence (0) of a given text string in each article. / Source:


For every reporting year, I summed up these word occurrences to obtain the total number of articles, in which a given string was mentioned together with child pornography, per year. I used the streamgraph package developed for the statistics package “R” to visualize these yearly article frequencies over the past 10 years.


online violence children

Yearly frequency of selected verbs reported together with child pornography. Blue: “stream”, green: “spread”, light green: “share”, yellow: “record”, light orange: “publish”, dark orange: “possess”, red: “create”. / Source:


Different NLP techniques have different requirements with respect to the cleanness of a text string. To create a word cloud, it’s beneficial to lowercase every word, so that the frequency of the capitalized and non-capitalized version of the same word will be added together. To analyze, how often the possession of pornography is mentioned in news articles, it is beneficial to use word reduction techniques such as stemming and/or lemmatization, so that the frequencies of differently inflected or derived word forms like “possess”, “possessed”, “possession”, “possessing” will be added together. (I used an alternative approach to solve the same problem, above.)

To avoid wasting time, it is wise to think about such requirements before jumping into actual text cleaning. Some cleaning steps won’t be required for a certain NLP analysis, whereas others will make the analysis difficult or even impossible. A highly efficient, customizable function developed by my collaborator Sij allowed me to clean text strings in virtually no time.


online violence children

Very efficient, customizable text cleaning function developed by my collaborator Sijuade. Booleans can be adjusted to specify, which of the 6 standard steps shall be applied. / Source:


I created n-grams, document term matrices, and word clouds to identify the most common words and word sequences in news articles. These most common terms include “sexual”, “social media” and “facebook”, “law” and “government”, “police”, the “Philippines”, “women” and “teacher”. In analogy to the famous “garbage in, garbage out”, these results of course reflect the principle of “search terms in, keywords out”. Moreover, they illustrate that OVAC often occurs on social media platforms, that Facebook plays a major role in the Thai social media market, and that the Thai news agency “Bangkok Post” often reports on the Philippines.

Besides identifying the most common words or word-sequences (n-Grams) in a collection of text documents, it’s also relevant to investigate, how these words relate to each other. I followed the instructions provided by Jason Brownlee in a Machine Learning Mastery article to calculate my own word embeddings based on the 209 news articles from Bangkok Post with the aim to investigate word similarities and to possibly detect previously unknown words.


online violence children

Visualization of word vector embeddings trained on 209 Thai news articles of all three classes (OVAC/PVAC/OVAA). Depicted are the 9 closest neighbors (used in a most similar context) to 8 selected words: pornography, CSAM, cyberbullying, trolling, grooming, sexting, cyberharassment, and cybercrime (see bottom-right for color labels). / Source:


I then calculated cosine similarities between selected pairs of word vector embeddings to quantify the similarity between specific words, resp. more precisely: the similarity between the contexts, in which different words typically appear. I found (“sexting” and “stalking”) and (“sexting” and “bullying”) to occur in very similar contexts (cosine similarity scores of 0.86 and 0.83, respectively), whereas (“pornography” and “bullying”) occurred in rather different contexts (0.37).

Furthermore, such visualization of word embeddings has allowed me to identify another term in the family of OVAC related problems, I wouldn’t have known before: “trolling”. I could now go back to the search engine of Bangkok Post to assess the specificity of news articles obtained using that search term for the problem of OVAC.


agile ai

Photo by Matheo JBT on Unsplash.

Insights and Conclusion

So, did all of these NLP techniques help me to gain valuable insights about the problem of Online Violence against Children? I would say that a substantial portion of my personal insights rather came from applying human intelligence while reading about the problem and talking to subject matter experts.

I consider an initial exchange with domain experts, in our case from Save The Children, complemented by some literature review, as invaluable for defining the scope and search terms. Amongst others, I also had the chance to interview a lady working for ECPAT International, a global network of organizations working towards ending the sexual exploitation and abuse of children worldwide.

I came to the conclusion, that the prevalence and importance of the various subclasses of OVAC seem to differ by geographic location. Analogously, also the barriers to overcome in the fight against OVAC likely differ by geographic location. Poverty, cultural norms, the available infrastructure, current legislation, and data protection regulations all critically determine, which forms of OVAC are most present in a society, and to what extent these are considered as normal or as something to be prevented in the future.

In some countries, coming into touch with child pornography is a part of everyday life, already in early childhood. As long as governments don’t provide their population with alternative opportunities to earn the minimum amount of money that would allow them to make an acceptable living, some of the problems around OVAC will be difficult to change. As long as teenagers enjoy being groomed by strangers online, making them aware of the associated risks may have little effect. Providers of social media platforms or chat forums have to respect the data privacy of their users, a goal typically in conflict with efforts to prevent potential online violence against minors.

I believe that the problem of Online Violence against Children can only be sustainably prevented if all stakeholders pull together. Parents can only truly care for the wellbeing of their children if they have enough to survive. Platform providers can only help to prevent OVAC, if prevention methods can be aligned with the current, typically local, regulatory requirements. And Omdena collaborators can only select the best terms to retrieve news articles if they already understand the problem to some extent. Every new insight gained can help re-adjusting the direction, and many tunnels and bridges will still need to be crossed on the long journey to the bright final destination: an internet providing a safe place for every child on this planet.

Exploring Scientific Literature on Online Violence Against Children Using Natural Language Processing

Exploring Scientific Literature on Online Violence Against Children Using Natural Language Processing

The following work is part of the Omdena AI Challenge on preventing online violence against children, implemented in collaboration with John Zoltner at Save the Children US.

This article is written by Wen Qing LimMaria Guerra-AriasSijuade Oguntayo


Textual Data  –  A Trove of Information

The amount of information available in the world is increasing exponentially year by year and shows no signs of slowing. This rapid increase is driven by expansions in physical storage and the rise of cloud technologies, allowing more data to be exchanged and preserved than ever before. This boom, while great for scientific knowledge, also has possible downsides. As the volume of data grows, so also does the complexity in managing and extracting useful information from it.

More and more, organizations are turning to electronic storage to safeguard their data. Unstructured textual information like newspapers, scientific articles, and social media is now available in unprecedented volumes.

It is estimated that about 80% of enterprise data currently in existence is unstructured data, and this continues to increase at a rate of 55–65% per year.

Unstructured data, unlike structured data, does not have clearly defined types and isn’t easily searchable. This also makes it relatively more complex to perform analysis on.

Text mining processes utilize various analytics and AI technologies to analyze and generate meaningful insights from unstructured text data. Common text mining techniques include Text Analysis, Keyword Extraction, Entity Extraction/Recognition, Document Summarization, etc. A typical text mining pipeline includes data collection (from files, databases, APIs, etc.), data preprocessing (stemming, stopwords removal, etc.), and analytics to ascertain patterns and trends.

Just as data mining in the traditional sense has proven to be invaluable in extracting insights and making predictions from large amounts of data, so too can text mining help in understanding and deriving useful insights from the ever-increasing availability of text data.

Natural Language Processing (NLP) can be thought of as a way for computers to understand and generate human natural language. This is possible by simulating the human ability to comprehend natural language. NLP’s strength comes from the ability of computers to analyze large bodies of text without fatigue and in an unbiased manner (note: unbiased refers to the process, it is possible for the data to be biased).


Online Violence Against Children

As of July 2020, there are over 4.5 billion internet users globally, accounting for over half of the world’s population. About one-third of these are children under the age of 18 (one child in every three in the world). As these numbers rise, sadly, so too does the number of individuals looking to exploit children online. The FBI estimates that at any one time, there are about 750,000 predators going online with the intention of connecting with children.

For our project, we wanted to explore how text mining and NLP techniques could be applied to analyzing the scientific literature on online violence against children (OVAC). We picked scientific articles as our focus, as these can provide a wealth of information — from the different perspectives that have been used to study OVAC (i.e. criminology, psychology, medicine, sociology, law), to the topics that researchers have chosen to focus on, or the regions of the world where they have dedicated their efforts. Text mining allowed us to collect, process, and analyze a large amount of published scientific data on this topic — capturing a meaningful snapshot of the state of scientific knowledge on OVAC.


Data Collection and Preprocessing


Our overall process flow from data collection to analysis


Our first step was to collect datasets of articles that we could find online. The idea was to scrape a variety of repositories for scientific articles related to OVAC, using a set of keywords as search terms. We built scrapers for each repository, making use of the BeautifulSoup and Selenium libraries. Each scraper was unique to the repository and collected information such as the article metadata (i.e Title, Authors, Publisher, Date Published, etc.), the article Abstract, and the article full-text URL (where available). We also built a script to convert the full-text articles from PDF to Text, using Optical Character Recognition (OCR). Only one of the repositories, CORE, had an API that directly allowed us to scrape the full text of the articles.

Having collected over 27,000 articles across 7 repositories, we quickly realized that many articles were not relevant to OVAC. For example, there were many scientific articles about physical sexual violence against children, that also mentioned some sort of online survey. These articles fulfilled the “online AND sexual AND violence AND children” search term but were irrelevant to OVAC. Hence, we had to manually filter the scientific articles for relevance, sieving out 95% of articles that were not related to OVAC.

Faced with such a painfully manual task, some members of the team tried out semi-automated methods of filtering. One method used clustering to find groups of papers that were similar to each other. The idea was that relevant papers would show up in the same group, while irrelevant papers would show up in their own groups. We would then only need to sift through each cluster instead of going through each individual paper, saving almost 10–30 times the effort. However, this assumed perfect clusters, which was often not true. The clustering method was definitely faster and filtered out 41% of articles, but it also left more irrelevant articles undetected. An alternative to clustering would be to train classifiers to identify relevant articles based on a set of pre-labeled articles. This could potentially work better than clustering, but having undetected articles still remains a limitation.

One of the perks of working with scientific articles (read: texts that have been reviewed rigorously) is that minimal data cleaning is required. Steps that we would otherwise have to take when dealing with free texts (e.g. translating slang, abbreviations, and emojis, accounting for typos, etc.) are not needed here. Of course, text pre-processing steps like stemming, stop-word removal, punctuation removal, etc. are still required for some analysis, like clustering or keyword analysis.


Drawing insights from text analysis regarding online violence against children

Armed with a set of relevant articles, the team set off to discover the various types of methods to extract insights from the dataset. We attempted a variety of methods (i.e. TF-IDF, Bag of Words, Clustering, Market Basket Analysis, etc.) in search of answers to a set of questions that we aimed to explore with the dataset. Some analyses were limited by the nature of the datasets (e.g. in keywords analysis, there is a lot of noise and random words in the data. Some trends/patterns emerge but it is not very conclusive), while others showed great potential in picking out useful insights (e.g. clustering, market basket analysis as described below).



Keywords Analysis

Based on the title and abstract texts, we were able to generate a word cloud of the most frequent terms appearing in the OVAC scientific literature. We also used TF-IDF vector analysis to explore the most relevant words, bigrams, and trigrams appearing in the title and abstract texts in each publication year. This allowed us to chart the rise of certain research topics over time — for example around the years 2015 and 2016, terms related to “travel” and “tourism” began to appear more often in the OVAC literature, suggesting that this problem received greater research attention in this period


Word cloud of title and abstract texts from over 1300 scientific articles on online violence against children. Source.



Geographical Market Basket Analysis


Heat map of the Lift between country pairs. A lift of more than 1 suggests that the presence of one country increases the probability that the other country will also appear in the article. The larger the lift, the more likely they would appear together.


We conducted a Market Basket analysis to find out which countries were likely to appear in the same article. This could potentially give insight to the networks of countries involved in OVAC. While we noticed that many countries appear together because they were geographically close, there were also exceptions.

From the heat map above, this includes country pairs like Malaysia-US, Australia-Canada, Australia-Philippines, and Thailand-Germany. Upon investigation, we found that:

  • Most articles contain these pairs because of exemplification.
  • Some are mentioned as a breakdown of countries where respondents of surveys and studies were conducted. (E.g. Thailand and Germany were mentioned as part of a 6-country survey of adolescents.)
  • More interestingly, there were also articles that mentioned pairs of countries due to offender-victim relationships. (E.g. an article studying offenders in Australia mentioned that they preyed on child victims in the Philippines.)


Topics Clustering Analysis

Another of our solutions used machine learning to separate the documents into different clusters defined by topics. A secondary motive was to explore the possibility that the different documents can form a network of communities not only based on their topics, but also on how the documents relate to each other.

The Louvain Method for community detection is a popular clustering algorithm used to understand the structure, as well as to detect communities of large networks. The TF-IDF representation of the words in the vocabulary was used to build a co-occurrence matrix containing the cosine similarities between each document. The clustering algorithm detected 5 distinct communities.

A manual inspection of the documents in each cluster suggested the following topics –

  • Institutional, Political (legislative) & Social Discourse
  • Online Child Protection — Vulnerabilities
  • Technology
  • Analysis of Offenders
  • Commercial Perspective & Trafficking


Bar chart of Frequency of Articles by Topic


The first two topics appear to be the most published while Commercial Perspective & Trafficking is the least. The cluster and structure detected by the clustering algorithm we noticed, could be visualized in the shape of a Graph Network. The articles were represented as nodes, and nodes of the same topic are grouped together and colored the same, the strength of the relationship between nodes as defined by the cosine similarity is represented by links/edges. Below is a visual representation of the structure of the Graph Network:


Structure of Graph Network — Articles were labeled according to community detection clustering and relationships defined by the cosine similarity between the documents. Other information like the text, date, published data, and URL of the papers were stored as properties of the nodes (vertices), and the links (edges) were defined as the cosine similarity value between documents.


One advantage of restructuring the data in this manner is that it allows the data to be stored in a Graph Database. Traditional relational databases work exceptionally well at capturing repetitive and tabular data, they don’t do quite as well at storing and expressing relationships between the entities within the data elements. A database that embraces this structure can more efficiently store, process, and query connections. Complex analysis can be done on the data by using a pattern and specifying starting points. Graph Databases can efficiently explore connecting data to those initial starting points, collecting and processing information from nodes and relationships while leaving out data outside the search pattern.


Challenges and Limitations

The major challenges we faced were related to compiling our dataset. Only one of the repositories we used, CORE, granted API access which greatly sped up the process of obtaining data. For the rest, the need to build custom scraping scripts meant that we could only cover a limited number of repositories. Other open repositories, such as Semantic Scholar, resisted our scraping efforts, while others such as Web of Science or ESBSCOhost, are completely walled-off to non-subscribers. The great white whale of scientific article repositories, Google Scholar, also eluded us. Here, search results are purposefully presented in such a way that it is not possible to extract the full abstract texts — although some other researchers with a lot of time and effort have had greater success with scraping it.

As shown, we were able to conduct a range of interesting and meaningful analyses using just the abstracts of the scientific articles, but to go further in our research would require overcoming the challenges related to obtaining the full text of the scientific articles. Even after developing a custom tool to extract text from PDFs, we still faced two challenges. Firstly, many articles were paywalled and could not be accessed, and secondly, the repositories we scraped did not systematically link to the PDF page of the article, so the tool could not be utilized across our whole dataset.

If a would-be data scientist is able to surpass all these hurdles, a final barrier to extracting information from scientific articles remains. The way in which scientific texts are structured, with sections such as “Introduction”, “Methodology”, “Findings” and “Discussion”, varies greatly from one article to the next. This makes it especially difficult to answer specific questions such as “What are the risk factors of being an offender of OVAC” that require searching for information in a specific section of the text, although it is less of an issue if you are seeking to answer more general questions, such as “How has the number of research papers changed over time?”. To overcome the difficulty of extracting specific information from unstructured text, we built a Neural Search Engine powered by haystack that uses a distilbert transformer to search for answers to specific questions in the dataset. However, it is currently a proof-of-concept and requires further refinement to reach its full potential of being able to answer the questions accurately.

These challenges create some limitations for the findings of our analysis. We cannot say that our dataset captures the entirety of scientific research into online violence against children, but rather just that which was contained in the repositories that we were able to access — we do not know if this could bias our results in some way (for example, if these repositories were more likely to contain papers from certain fields, or from certain parts of the world). It is worth noting that we conducted all of our searches in English. Fortunately, scientific article abstracts are often translated, so we were able to analyze the text of these, even when the original language of the article was different.

Another limitation is that as scientific knowledge increases over time, more recent articles could be more relevant than historical ones if certain theories or assumptions are later found to be incorrect with further research. However, all articles were given equal weight in our analysis.

Given the challenges that we faced with accessing repositories and articles and the incalculable benefits of greater data openness in scientific research, it is interesting to discuss an initiative that is working toward that goal. A team at Jawaharlal Nehru University (JNU) in India is building a gigantic database of text and images extracted from over 70 million scientific articles. To overcome copyright restrictions, this text will not be able to be downloaded or read, but rather only queried through data mining techniques. This initiative has the potential to radically transform the way that scientific articles are used by researchers, opening them up for exploration using the entirety of the data science toolkit.



We have demonstrated in this case study how text mining and NLP techniques can aid the analysis of scientific literature at every step of the way — from data collection to cleaning and to gain meaningful insights from text. While full texts helped us to answer more specific questions, we found that using just abstracts was often sufficient to gain useful insights. This shows great potential for future abstract-only analysis in cases where access to full-text articles is limited.

Our work has helped Save the Children to understand OVAC and its research space better, and similar types of analysis can benefit other NGOs in many ways. These include:

  • understanding a topic
  • having an overview of the types of research efforts
  • understanding research gaps
  • identifying key resources (e.g. datasets often quoted in papers, most common citations, most active researchers/publishers, etc.)


There are also many other possibilities of NLP methods to extract insights from scientific papers that we have not tried. Here are some ideas for future exploration:

  • Extracting sections from scientific articles — articles are organized in sections, and if we can figure out a way to split articles up into sections, it would be a great first step towards a more structured dataset!
  • Named Entity Recognition — From figuring out which entities are being discussed to using these to answer specific questions, NER unlocks a ton of possible applications.
  • Network Analysis using Citations — This could be an alternative method to cluster the articles, or it could also help to identify the ‘influential’ articles or map out the progress of research.
Internet Safety for Children: Using NLP to Predict the Risk Level of Online Games, Websites, and Applications

Internet Safety for Children: Using NLP to Predict the Risk Level of Online Games, Websites, and Applications

The following work is part of the Omdena AI Challenge on improving internet safety for children, implemented in collaboration with John Zoltner at Save the Children US.

This blog was written by Sabrina Carlson and co-authored by Erum Afzal. Contributors include Anna Kolbasko, Juber Rahman, Erum Afzal, Mateus Broilo, Rahul Gopan, Rubens Carvalho, Vinod Rangayyan, Adele C, and Rosana de Oliveira Gomes.


The Problem

Save the Children is a humanitarian organization that aims to improve the lives of children across the globe. In line with the United Nations’ Sustainable Goal 16.2 to “end abuse, exploitation, trafficking, and all forms of violence and torture against children,” Save the Children and Omdena collaborated to use artificial intelligence to identify and prevent online internet violence against children for their safety. Utilizing numerous data sources and a combination of various artificial intelligence techniques, such as natural language processing (NLP), this project’s collaborators aimed to produce meaningful insights into and prevent online internet violence against children for their safety. One area of concern is online games, websites, and applications that are popular with children, and a number of collaborators targeted this space in hopes of guarding children against online predators in the future.


What We Did

The Common Sense Media website provides expert advice, useful tools, and objective ratings for countless movies, television shows, games, websites, and applications to help parents make informed decisions about which content they want their children to consume. Particularly useful for this project, parents, and children can review games, applications, and websites on the Common Sense Media site. A number of Omdena collaborators had the idea to build web scrapers to collect parent and child reviews of the games, applications, and websites that are popular with children and use natural language processing to identify which platforms are high risk for online internet violence against children for their safety.

The first step was to scrape Common Sense Media to collect game, application, and website reviews from both parents and children. To do so, we used ParseHub software to build web scrapers to collect reviews from this website. ParseHub is a powerful, user-friendly tool that allows one to easily extract data from websites. Using ParseHub, we set three different configurations to scrap parent and child reviews of all games, applications, and websites from the internet that Common Sense Media has determined to be popular among children for their safety.

The resulting dataset includes the following features:

  • 40,433 observations (reviews) from 995 different games/apps/websites
  • Platform type (game, application, website)
  • The risk level for online (sexual) violence against children
  • Indicators for each platform’s content related to positive messages, positive role models/representations, ease of play, violence, sex, language, and consumerism. Common Sense Media provides objective ratings (from a scale of 0–5) for these indicators for the digital content included on the site. We focused on the sex indicator and re-labeled it as CSAM (child sexual abuse material). We determined a platform to be high risk for CSAM if its sex rating was greater than 2 and assigned a platform a low-risk CSAM label if its sex rating was lower than 2.

Figure 1 plots the top 20 platforms in terms of the number of reviews.


Internet Safety Children

Figure 1. 20 Most Popular Platforms by Number of Reviews / Source: Omdena


Figure 2 displays the number of reviews for high and low-risk games, applications, and websites. As illustrated in the figure, there are nearly 25,000 reviews for low-risk platforms, whereas there are close to 16,000 reviews for high-risk platforms.


Internet Safety Children

Figure 2. Number of Reviews by CSAM Risk Level / Source: Omdena


Data Sampling

We randomly sampled 50% of the data in order to process the data in a more efficient way. The following graphic illustrates the code used to sample 50% of the data.


Internet Safety Children


Data Cleaning

It is necessary to clean the data in order to build a successful NLP model. To clean the review messages, we created a function called “clean_text” and used it to perform several transformations, including the following:

  • Converted the review text into all lower-case letters
  • Tokenizing the review text (i.e., splitting the text into words) and removing punctuation marks
  • Removing numbers and stopwords (e.g., a, an, the, this).
  • Using the WordNet lexical database to assign Part-Of-Speech (POS) tags. The POS tags are used to attach labels to words that correspond to a noun, verb, etc.
  • Lemmatizing and transforming the words to their roots (i.e., games→ game, Played→ play)

Figure 3 provides an example of the reviews pre-and post-cleaning. In the “review” column, the text has not been cleaned, while the “review_clean” column includes text that has been lemmatized, tagged for POS, tokenized, etc.


Internet Safety Children

Figure 3. Sample of Cleaned Text / Source: Omdena


Feature Engineering

Before applying the models, we performed some feature engineering, including sentiment analysis, vector extraction, and TF-IDF.


Sentiment Analysis

The first feature engineering step was conducting sentiment analysis. The sentiment analysis was performed on the features to gain insight into how parents and children feel about hundreds of games, applications, and internet websites that are popular with children for their safety. We used Vader, which is part of the NLTK module, for the sentiment analysis. Vader uses a lexicon of words to identify positive or negative sentiments in long sentences. It also takes into account the context of the sentences to determine the sentiment scores. For each text, Vader returns the following four values:

  • Negative count score
  • Positive count score
  • Neutral count score
  • The compound — an overall score that summarizes the previous scores

Figure 4 displays a sample of cleaned reviews containing negative, neutral, positive, and compound scores.


Internet Safety Children

Figure 4. Sample of Sentiment Analysis Scores / Source: Omdena


Extracting Vectors

In the next step, we extracted vector representations for every review. Using the module Gensim, we were able to create a numerical vector representation for every word in the corpus using the contexts in which they appear (Word2Vec). This is performed using shallow neural networks. Extracting vectors in this way is interesting and informative because similar words will have similar representation vectors.

All text can also be transformed into numerical vectors using word vectors (Doc2Vec). We can use these vectors as training features because the same texts will also have similar representations.

It was first necessary to train a Doc2Vec model by feeding in our text data. By applying this model to the review text, we are able to obtain the representation vectors. Finally, we added the TF-IDF (Term Frequency — Inverse Document Frequency) values for every word and every document.

But why not simply count the number of times each word appears in every document? The problem with this approach is that it does not take into account the relative importance of the words in the text. For instance, a word that appears in nearly every review would not likely bring useful information for analysis. In contrast, rare words may be much more meaningful. The TF-IDF metric solves this problem:



The Term Frequency (TF) computes the classic number of times the word appears in the text, while the Inverse Document Frequency (IDF) computes the relative importance of the word depending on the number of texts (reviews) in which the specific word is found. We added TF-IDF columns for every word that appeared in at least 10 different texts. This step allowed us to filter a number of words and, subsequently, reduce the size of the final output. Figure 5 provides the code used to apply TF-IDF and assign the resulting columns to the data frame, and Figure 6 displays the output of the sample code.


Internet Safety Children

Figure 5. TF-IDF Code Sample / Source: Omdena

Internet Safety Children

Figure 6. TF-IDF Sample Code Output / Source: Omdena


Exploratory Data Analysis

The EDA produced a number of interesting insights. Figure 7 provides a sample of reviews that received high negative sentiment scores, and Figure 8 displays a sample of reviews with high positive sentiment scores. The sentiment analysis successfully assigned negative sentiments to reviews with text such as “violence, horror, dead.” The analysis also effectively assigned positive sentiments to reviews containing words such as “fun, cute, exciting.”


Internet Safety Children

Figure 7. Sample of Reviews with High Negative Scores / Source: Omdena


Internet Safety Children

Figure 8. Sample of Reviews with High Positive Scores / Source: Omdena


Figure 9 shows the distribution of the trend of messages among high and low-risk games. Varder categorizes low-risk reviews as positive messages, whereas high-risk reviews should have lower compound sentiments. This shows that the sentiment feature extractions proved helpful in modeling the risk analysis.


Internet Safety Children

Figure 9. High_Low Risk Distribution over Compound Sentiments / Source: Omdena


Modeling High-Risk Games/Applications/Websites

After we successfully scraped the reviews, built the dataset, cleaned the data, and performed feature engineering, we were able to build an NLP model. We choose which features (reviews and clean reviews) to use to train our model.

Then, we split our data into two parts:

  • Training set for training purposes
  • The test set to assess the model performance

After selecting the features and splitting the data into test/training sets, we fit a Random Forest classification model and used the reviews to predict whether a platform is a high risk for CSAM. Figure 10 displays the code used to fit the Random Forest classifier and obtain the metrics.


Internet Safety Children

Figure 10. Random Forest Classifier Code Sample / Source: Omdena


Figure 11 displays a sample of features and their respective importance. The most important features are indeed the ones that were obtained in the sentiment analysis. In addition, the vector representations of the texts were also important in our training. A number of words appear to be fairly important as well.


Figure 11. Feature Importance / Source: Omdena


The Receiver Operating Characteristic Example (ROC) curve and Area Under the Curve (AUC) allow one to evaluate how well a model performs in terms of its ability to distinguish between classes (high/low risk for CSAM in this context). The ROC curve, which plots the true positive rate against the false-positive rate, is displayed in Figure 12. The AUC is 0.77, which indicates that the classifier performed at an acceptable level.


Internet Safety Children

Figure 12. ROC Curve / Source: Omdena


The Precision-Recall (PR) Curve is illustrated in Figure 13. The PR curve is graphed by simply plotting the recall score (x-axis) against the precision score (y-axis). Ideally, we would achieve both a high recall score and a high precision score; however, there is often a trade-off between the two in machine learning. The sci-kit learn documentation states that the Average Precision (AP) “summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight.” The AP here is 0.72, which is an acceptable score.


Internet Safety Children

Figure 13. Precision-Recall Curve / Source: Omdena


It is evident in Figure 13 that the precision decreases as we increase the recall. This indicates that we have to choose a prediction threshold based on our specific needs. For instance, if the end goal is to have a high recall, we should set a low prediction threshold that will allow us to detect most of the observations of the positive class, though the precision will be low. On the contrary, if we want to be really confident about our predictions and are not set on identifying all the positive observations, we should set a high threshold that will allow us to obtain high precision and a low recall.

In order to determine whether or not the model we built performs better than another classifier, we can simply use the AP metric. To assess the quality of our model, we can compare it to a simple decision baseline. With a random classifier for the baseline, the model would simply assign 0 half the time and 1 the other half of the time. Our AP metric is 0.77, which is better than a random classifier.


Conclusion and Observations

It is nearly possible to use just raw text as input to make predictions. The most important aspect is to be able to extract the relevant features from a raw data source. Such data can often complement data science projects, allowing one to extract more meaningful/useful features and increase the model’s predictive power.

We were only able to predict the platform’s risk through user reviews, and it is possible that the reviews are biased. To improve the precision of our predictive model, we can triangulate other features such as player sentiments, game titles, UX/UI features, and in-game chats. Used in combination, these features can provide a number of insightful recommendations. Our predictive model will shed light on CSAM risk in online games, applications, and websites that are popular with children by automatically detecting each platform’s risk level. In the future, we hope that parents will be able to better select platforms for their child’s use based on our use of AI.

A Chatbot Warning System Against Online Predators

A Chatbot Warning System Against Online Predators

Using Natural Language Processing to warn children against online predators.

The following work is part of the Omdena AI Challenge on preventing online violence against children, implemented in collaboration with John Zoltner at Save the Children US.


Protecting Children

Today, children face an evolving threat — online violence. Violence and harassment of children have been growing exponentially for more than 20 years but due to the recent events leading to the closing of schools for over 1 billion children around the world, children are more vulnerable than ever. Online predators use Internet avenues popular with children and adolescents such as game chat rooms to lure them into sexual exploitation and even in-person assaults.

Protection against online sexual violence greatly varies from platform to platform. Some gaming platforms include a profanity filter that looks for problematic words and replaces them with a string of asterisks. Outside the gaming platforms, many chat platforms still do not have any safeguards in place to protect children from predatory adult conversations. However, chat logs can provide information on how a predator might attempt to exploit children and young adults into risky situations, such as sharing photos, web cameras, and sexting (sexual texting).

Often, pattern recognition techniques provide an automated identification of these conversations for potential law enforcement intervention. 

Unfortunately, this strategy uses many man-hours and spans many message categories, which makes it all the more difficult to identify these patterns. It is a challenging task, but one that is worth tackling, and we elaborate on our approach in the rest of the article.


Online Predators



First, let us establish our working definitions.

  • The New Oxford American Dictionary defines a predator as a “person or group that ruthlessly exploits others.”
  • Expanding the term to a sexual predator as “a person seen as obtaining or trying to obtain sexual contact or favor with another person in a metaphorically ’predatory’ manner”, Daniel M. Filler, Virginia Journal of Social Policy & the Law (2003).


The Solution — Data Engineers Unite!


Online Predators

The team focused on the solution’s Predator Analysis portion.


Our solution looked to reduce man-hours and to develop a near real-time warning system to the chat that alerted the child when the conversation changes sentiment. The team used a semi-supervised approach to evaluate if the conversations provide a low, medium, or high risk to the child. The system would evaluate the phrase or sentence and return an effective sentiment warning if warranted. The data for our chatbot (Predator-Pseudo-Victim conversations) was collected from interactions between a predator and a law enforcement officer or volunteer posing as a child.

The chatbot was designed to learn from non-predatory and predatory conversations and distinguish between them. Additionally, it would have the ability to recognize inappropriate messages no matter whether they came from the predator or the child’s side. The corpus also had adult-like conversations initiated from the child’s side.


The Dataset

The team consolidated and cleaned nearly 500 chat log files that contained exchanges between a predator and a pseudo-victim. The collection grew into a corpus containing 807,000-plus messages ranging from “hello” to explicit remarks. The dataset creation proved laborious, where I voluntarily provided more than 630 hours in just labeling data. The dataset received labels, such as male or female as they identified themselves in the chats, predator or victim, and level of risk of the conversation. Nearly half of the project time was dedicated to a properly built and parsed dataset.

This dataset was split into a training, development, and test set. The training set held 75 percent of all messages for the chatbot to learn the contextual format and nuances of conversation. The development set, which was 10 percent of the data, was held away from the chatbot until after model selection, to prove the validity of the model.

The 2 mins video quickly discusses how the team assembled the chatbot’s dataset.



Data Format and Storage

The data was housed in a relational database. It became large enough to serve as a nexus to provide uniquely formatted datasets for the machine learning pipeline.

During the labeling process, few issues arose on how to semantically define a conversation. With many different log formats, ranging from AOL Messenger to SMS and other online platforms, the sentences would start and stop at different points. In conversations, I implemented a similar format as used in the competitively-used Cornell University’s movie corpus that provided a standard structure making it easy to parse the data. Additionally, the corpus contained chat slang, abbreviations, and number-for-words, like “l8r” for “later” and “b4” for “before”, which required a team consensus on how to handle these stopwords. The team did not focus on timestamps due to extremely varied formatting, missing values, and lack of importance to the overall project.


The Model

Many models presented as candidates to the chatbot’s internal workings. The main goal for the team was to have a local and offline solution for now. This was done to reduce privacy concerns and legal issues. Future considerations of this project would evaluate these features with appropriate development operations.


Online Predators

Basic sequence-to-sequence model diagram.


The selected model focused around a Long Short Term Memory (LSTM) network cell, arranged as a sequence-to-sequence configuration. LSTMs have long been proved well-suited to work with sequential data. Our application would use this ability to help the chatbot predict the next plausible word to use for response.

For the sentiment analysis portion, we focused our efforts on an ensemble learning model as well as a support vector machine to help predict when the conversation changed from benign to risky.



Our team successfully built a chatbot and a sentiment analysis model independently. The chatbot learned from its more than 807,000 messages to understand how to parse sentences and structure a proper response. The limited vernacular stemmed from the chatbot’s time to learn and framework limitations.

The greatest challenge to code performance-centered inside the platform chosen, TensorFlow 1.0.0, provided limitations. The code did provide a conversation-capable entity, but the model needs more training data if we want to go beyond proof-of-concept to deploy it in an application.

The project successfully employed message sentiment analysis and was able to warn the user of potentially risky conversations initiated by online predators. The sentiment analysis ranged from low, medium, or high levels of risk.

Future considerations will take this project into a full-functioning environment of TensorFlow 2.1.0, eliminating other frameworks, including PyTorch. The internal model will receive an update to the LSTM structure and performance will be improved with the use of graphics computing processors, such as NVIDIA and its cuDNN framework.

Analyzing Mental Health and Youth Sentiment Through NLP and Social Media

Analyzing Mental Health and Youth Sentiment Through NLP and Social Media

By Mateus Broilo and Andrea Posada Cardenas


We are living in an era where life passes so quickly that mental illness has become a pivotal issue, and perhaps a bridge to some other diseases.

As Ferris Bueller once said:

“Life moves pretty fast. If you don’t stop and look around once in awhile, you could miss it.”

This fear of missing out has caused people of all ages to suffer from mental health issues like anxiety, depression, and even suicide ideation. Contemporary psychology tells us that this is expected — simply because we live on an emotional roller coaster every day.

The way our society functions in the modern day can present us with a range of external contributing factors that impact our mental health — often beyond our control. The message here is not that the odds are hopelessly stacked against us, but that our vulnerability to anxiety and depression is not our fault. — Students Against Depression

According to WHO, good mental health is “a state of well-being in which every individual realizes his or her own potential, can cope with the normal stresses of life, can work productively and fruitfully, and is able to make a contribution to her or his community. At the same time, we find it at WordNet Search as “the psychological state of someone who is functioning at a satisfactory level of emotional and behavioral adjustment”. Notice that it is far from being a perfect definition, but it gives us a hint related to which indicator to look for, e.g. “emotional and behavioral adjustment”.

It’s foreseen that this year (2020) around 1 in 4 people will experience mental health problems. Especially, low-income countries have an estimated treatment gap of 85%, contrary to high-income countries. The latter has a treatment gap of 35% to 50%.

Every single day, tons of information is thrown into the wormhole that is the internet. Millions of young people absorb this information and see the world through the glass of online events and others’ opinions. Social media is a playground for all this information and has a deep impact on the way our youth interacts. Whether by contributing to a movement on Twitter or Facebook (#BlackLifeMatters), staying up to date with the latest news and discussions on Reddit (#COVID19), or engaging in campaigns simply for the greater good, the digital world is where the magic happens and makes worldwide interactions possible. The digital eco-not so friendly-system plays a crucial role and represents an excellent opportunity for analysts to understand what today’s youth think about their future tomorrow.

Take a look at the article written by Fondation Botnar related to the young people’s aspiration.


The power of sentiment analysis

Sentiment analysis, a.k.a  opinion mining or emotional artificial intelligence (AI), uses text analysis, and NLP to identify affective level patterns presented in data. Therefore, a wise question could be: How do the polarities change?


Top Mental Health keywords from Reddit and Twitter



Violin plots

Considering a data set scraped from Reddit and Twitter from 2016–2020, these “dynamic” polarity distributions could be expressed using violin plots.



Sentiment Violin-Plot hued by Year

Sentiment Violin Plots by year. Here positive values refer to positive sentiments, whereas negative values indicate negative sentiments. The thicker part means the values in that section of the violin has a higher frequency, and the thinner part implies lower frequency.




On one hand, we see that as the years go by polarity tends to become more and more neutral. On the other hand, it’s difficult to understand which sentiment falls in what category, and what does the model categorizes as positive vs negative sentiments for each year. Also, text sentiment analysis is subjective and does not really spot complex emotions like sarcasm.


Violin plots according to label

So now, the next attempt was to see polarities according to labels — anxiety, depression, self-harm, suicide, exasperation, loneliness, and bullying.



Sentiment Violin-Plot hued by Year

Sentiment Violin Plots by label



Even if we try to see the polarities by the label, we might end up with surface-level results instead of crisp insights. Look at Self-harm, what’s the meaning of positive self-harm? But it’s still there in the green plot.

We see that most of the polarities are distributed close to the limits of the neutral region, which is ambiguous since it can be viewed as either a lack of positiveness or a lack of accurate sentiment categorization. The question is — how do we gain better insights?

Maybe we try plotting the mean (average) sentiment per year per label.



Mean Sentiment per Year hued by label



Notice that Depression was the only label that went through two consecutive decreasing mean sentiment values and passed from positive (2017–2019) to neutral in 2020. Moreover, Loneliness and Bullying classes are depicted only with one mark each, because they appear only in the data scraped from (Jan - Jun)/2020.


Depression-label word cloud

Before pressing on, let’s just take a look at the Depression-label word cloud. Here we can detect a lot of “emotions” besides the huge “depression” in green, e.g. “low”, “hopeless”, “financial”, “relationship”.


Keywords relating to mental health

Source: Omdena



These are just the most frequent words associated with posts labeled as Depression and not necessarily translates the feelings behind the scene. However, there is a huge “feel” there… Why? For sure, this is related to one of the most common words, which actually is the 6th more common word in the whole data set. In a more in-depth analysis aiming to find interconnections among topics, certainly “feel” would be used as one of the most prominent edges.



feel knowledge graph

“Feel” Knowledge Graph



This Knowledge graph shows all the nodes where “feel” is used as the edge connector. Very insightful but not very visible.

In fact, there’s a much better approach that performs text analysis across lexical categories. So now the question is: “What sort of other feelings related to mental health issues should we be looking for?”.




Empath analysis

The main objective of empath analysis consists of connecting the text within a wide range of sentiments besides just negative, neutral, and positive polarities. Now we’re able to go far beyond trying to detect subjective feelings. For example, look at the second and third lexicon- “sadness” and “suffering”. Empath uses similarity comparisons to map a vocabulary of the text words, (our data set is composed of Reddit and Twitter posts) across Empath’s 200 categories.


AI Mental Health

Empathy Value VS Lexicon




The Empath value is calculated by counting how many times it appears and is normalized according to the total text emotions spotted and the total number of words analyzed. Now we’re able to go much deeper and truly connect the sentiment presented in the text into some real emotion, rather than just counting the most frequent ones and assuming whether it is related to something good or bad.



Empathy value vs Year

Emotion trends hued by lexicon



We choose five lexicons that might be more deeply associated with mental health issues and show in the left plot: “nervousness”, “suffering”, “shame”, “sadness” and “hate”, we tacked these five emotions per year analyzed. And guess what? Sadness skyrocketed in 2020.



Sentiment analysis in the post-COVID world

The year 2020 turned our lives upside down. From now on we will most likely have to rethink the way we eat, travel, have fun, work, connect,… In short, we will have to rethink our entire lives.

There’s absolutely no question that the COVID-19 pandemic plays an essential role in mental health analysis. To take these impacts into account, since COVID-19 began to spread out worldwide in January, we selected all the data comprising the period of (January — June)/2020 to perform the analysis. Take a look at the Word Cloud related to the COVID-19 analysis from May and June.




COVID 19 Top keyword Analysis of Mental Health



Covid 19 Top Keyword Analysis of Mental Health




We can see words like help, anxiety, loneliness, health, depression, isolation. In this case, we can consider that it reflects the emotional state of people on social media. As said earlier that the sentiment analysis under polarity tracking isn’t that insightful, but we display the violin graphs below just for comparing.



Sentiment Violin-plot for COVID 19 Analysis by Months



Sentiment Violin-plot for COVID 19 Analysis by Label



Now we see a very different pattern from the previous one, and why is that? Well, now we’re filtering by the COVID-19 keywords and indeed the sentiment distribution now seems to make sense. Looking more closely at the distribution of the data, the following is observed.



Graph of number of relatable words vs count of words



In the word count from the sample of texts from 2020, only 2.59% of them contain words related to COVID-19. The words we used are “corona”, “virus”, “COVID”, “covid19”, “COVID-19” and “coronavirus”. Furthermore, the frequency of occurrence decreases as the number of related words found increases, the most common being at most three times in the same text.

Till now, we have presented the distribution of sentiments for specific words related to COVID-19. Nonetheless, questions about how these words relate to the general sentiment during the time period under analysis haven’t been answered yet.

The general sentiment has been deteriorating, i.e. becoming more negative, since the beginning of 2020. In particular, June is the month with the most negative sentiment, which coincides with the month with the most contagious cases of COVID-19 in the period considered, with a total number of 241 million cases. Considering the differences between the words related to COVID-19 and words that are completely unrelated, in the former, more negativity in sentiments is perceived in general.



Graph between sentiment vs months in 2020

The sentiment by the label is again observed — this time from January 2020 to June 2020 only.



Violin Plot by label 2020



Exasperation remains stable, with February being the month that attracts the most attention due to its negativity compared to the rest. Likewise, self-harm is quite stable. The months that call out the attention for their negativity in this category are March and June. Contrary to self-harm, in suicides, March doesn’t represent a negative month. However, the rest of the months between February and June not only present a detriment in the sentiment, which worsens over time, but they are also notably negative. June draws attention to having really positive and really negative sentiments (high polarities), which doesn’t happen in the other months. It has to be verified, but it could be that the number of suicides has been increasing in the last months. Regarding anxiety, a downward trend is also observed in the sentiment between February and May. Finally, one should be careful with loneliness, given the high negativity perception in May and June. Given that there are only data for June 2020 for Bullying, this label isn’t analyzed.

The next figure presents the time series corresponding to the sentiment between 2019/05 and 2020/06. A slight downward trend can be observed. This means that the general sentiment has become more negative. Additionally, there are days that present greater negativity, indicated by the troughs. Most of the troughs in the present year are found in the last months since April.



Sentiment Analysis from 2019-05

Incidents that moved the youth




There are other major incidents, besides COVID-19, that have influenced the youth to call for help and to speak up in 2020. The recent murder of George Floyd was the turning point and lighted up the #BlackLivesMatter movements. Have a look at the word cloud on the left — with the most frequent and insightful words

The youth gathered to protest against racism and call for equality and freedom worldwide. The Empath values related to Racism and Mental Health are displayed below.


AI Mental Health

Normalized empathy analysis



The COVID-19 pandemic has led the world towards a scenario of a global economic crisis. Massive unemployment, lack of food, lack of medicines. Perhaps the big Q is: “How will the pandemic affect the younger generations and the generations to come? ”. Unfortunately, there’s no answer to this question. Except that the economic crisis that we’re presently living in is definitely going to affect the younger generation because they’re the ones to study, go to college and find a job in the near future. The big picture tells us that unemployment is increasing on a daily basis and there are not enough resources for all of us. The Word Cloud in the opening of the article reflects some of the most frequent words related to the actual economic crisis.

Stay in touch via our newsletter.

Be notified (a few times a month) about top-notch articles, new real-world projects, and events with our community of changemakers.

Sign up here