Exploring Scientific Literature on Online Violence Against Children Using Natural Language Processing

Exploring Scientific Literature on Online Violence Against Children Using Natural Language Processing

The following work is part of the Omdena AI Challenge on preventing online violence against children, implemented in collaboration with John Zoltner at Save the Children US.

This article is written by Wen Qing LimMaria Guerra-AriasSijuade Oguntayo


Textual Data  –  A Trove of Information

The amount of information available in the world is increasing exponentially year by year and shows no signs of slowing. This rapid increase is driven by expansions in physical storage and the rise of cloud technologies, allowing more data to be exchanged and preserved than ever before. This boom, while great for scientific knowledge, also has possible downsides. As the volume of data grows, so also does the complexity in managing and extracting useful information from it.

More and more, organizations are turning to electronic storage to safeguard their data. Unstructured textual information like newspapers, scientific articles, and social media is now available in unprecedented volumes.

It is estimated that about 80% of enterprise data currently in existence is unstructured data, and this continues to increase at a rate of 55–65% per year.

Unstructured data, unlike structured data, does not have clearly defined types and isn’t easily searchable. This also makes it relatively more complex to perform analysis on.

Text mining processes utilize various analytics and AI technologies to analyze and generate meaningful insights from unstructured text data. Common text mining techniques include Text Analysis, Keyword Extraction, Entity Extraction/Recognition, Document Summarization, etc. A typical text mining pipeline includes data collection (from files, databases, APIs, etc.), data preprocessing (stemming, stopwords removal, etc.), and analytics to ascertain patterns and trends.

Just as data mining in the traditional sense has proven to be invaluable in extracting insights and making predictions from large amounts of data, so too can text mining help in understanding and deriving useful insights from the ever-increasing availability of text data.

Natural Language Processing (NLP) can be thought of as a way for computers to understand and generate human natural language. This is possible by simulating the human ability to comprehend natural language. NLP’s strength comes from the ability of computers to analyze large bodies of text without fatigue and in an unbiased manner (note: unbiased refers to the process, it is possible for the data to be biased).


Online Violence Against Children

As of July 2020, there are over 4.5 billion internet users globally, accounting for over half of the world’s population. About one-third of these are children under the age of 18 (one child in every three in the world). As these numbers rise, sadly, so too does the number of individuals looking to exploit children online. The FBI estimates that at any one time, there are about 750,000 predators going online with the intention of connecting with children.

For our project, we wanted to explore how text mining and NLP techniques could be applied to analyzing the scientific literature on online violence against children (OVAC). We picked scientific articles as our focus, as these can provide a wealth of information — from the different perspectives that have been used to study OVAC (i.e. criminology, psychology, medicine, sociology, law), to the topics that researchers have chosen to focus on, or the regions of the world where they have dedicated their efforts. Text mining allowed us to collect, process, and analyze a large amount of published scientific data on this topic — capturing a meaningful snapshot of the state of scientific knowledge on OVAC.


Data Collection and Preprocessing


Our overall process flow from data collection to analysis


Our first step was to collect datasets of articles that we could find online. The idea was to scrape a variety of repositories for scientific articles related to OVAC, using a set of keywords as search terms. We built scrapers for each repository, making use of the BeautifulSoup and Selenium libraries. Each scraper was unique to the repository and collected information such as the article metadata (i.e Title, Authors, Publisher, Date Published, etc.), the article Abstract, and the article full-text URL (where available). We also built a script to convert the full-text articles from PDF to Text, using Optical Character Recognition (OCR). Only one of the repositories, CORE, had an API that directly allowed us to scrape the full text of the articles.

Having collected over 27,000 articles across 7 repositories, we quickly realized that many articles were not relevant to OVAC. For example, there were many scientific articles about physical sexual violence against children, that also mentioned some sort of online survey. These articles fulfilled the “online AND sexual AND violence AND children” search term but were irrelevant to OVAC. Hence, we had to manually filter the scientific articles for relevance, sieving out 95% of articles that were not related to OVAC.

Faced with such a painfully manual task, some members of the team tried out semi-automated methods of filtering. One method used clustering to find groups of papers that were similar to each other. The idea was that relevant papers would show up in the same group, while irrelevant papers would show up in their own groups. We would then only need to sift through each cluster instead of going through each individual paper, saving almost 10–30 times the effort. However, this assumed perfect clusters, which was often not true. The clustering method was definitely faster and filtered out 41% of articles, but it also left more irrelevant articles undetected. An alternative to clustering would be to train classifiers to identify relevant articles based on a set of pre-labeled articles. This could potentially work better than clustering, but having undetected articles still remains a limitation.

One of the perks of working with scientific articles (read: texts that have been reviewed rigorously) is that minimal data cleaning is required. Steps that we would otherwise have to take when dealing with free texts (e.g. translating slang, abbreviations, and emojis, accounting for typos, etc.) are not needed here. Of course, text pre-processing steps like stemming, stop-word removal, punctuation removal, etc. are still required for some analysis, like clustering or keyword analysis.


Drawing insights from text analysis regarding online violence against children

Armed with a set of relevant articles, the team set off to discover the various types of methods to extract insights from the dataset. We attempted a variety of methods (i.e. TF-IDF, Bag of Words, Clustering, Market Basket Analysis, etc.) in search of answers to a set of questions that we aimed to explore with the dataset. Some analyses were limited by the nature of the datasets (e.g. in keywords analysis, there is a lot of noise and random words in the data. Some trends/patterns emerge but it is not very conclusive), while others showed great potential in picking out useful insights (e.g. clustering, market basket analysis as described below).



Keywords Analysis

Based on the title and abstract texts, we were able to generate a word cloud of the most frequent terms appearing in the OVAC scientific literature. We also used TF-IDF vector analysis to explore the most relevant words, bigrams, and trigrams appearing in the title and abstract texts in each publication year. This allowed us to chart the rise of certain research topics over time — for example around the years 2015 and 2016, terms related to “travel” and “tourism” began to appear more often in the OVAC literature, suggesting that this problem received greater research attention in this period


Word cloud of title and abstract texts from over 1300 scientific articles on online violence against children. Source. www.omdena.com



Geographical Market Basket Analysis


Heat map of the Lift between country pairs. A lift of more than 1 suggests that the presence of one country increases the probability that the other country will also appear in the article. The larger the lift, the more likely they would appear together.


We conducted a Market Basket analysis to find out which countries were likely to appear in the same article. This could potentially give insight to the networks of countries involved in OVAC. While we noticed that many countries appear together because they were geographically close, there were also exceptions.

From the heat map above, this includes country pairs like Malaysia-US, Australia-Canada, Australia-Philippines, and Thailand-Germany. Upon investigation, we found that:

  • Most articles contain these pairs because of exemplification.
  • Some are mentioned as a breakdown of countries where respondents of surveys and studies were conducted. (E.g. Thailand and Germany were mentioned as part of a 6-country survey of adolescents.)
  • More interestingly, there were also articles that mentioned pairs of countries due to offender-victim relationships. (E.g. an article studying offenders in Australia mentioned that they preyed on child victims in the Philippines.)


Topics Clustering Analysis

Another of our solutions used machine learning to separate the documents into different clusters defined by topics. A secondary motive was to explore the possibility that the different documents can form a network of communities not only based on their topics, but also on how the documents relate to each other.

The Louvain Method for community detection is a popular clustering algorithm used to understand the structure, as well as to detect communities of large networks. The TF-IDF representation of the words in the vocabulary was used to build a co-occurrence matrix containing the cosine similarities between each document. The clustering algorithm detected 5 distinct communities.

A manual inspection of the documents in each cluster suggested the following topics –

  • Institutional, Political (legislative) & Social Discourse
  • Online Child Protection — Vulnerabilities
  • Technology
  • Analysis of Offenders
  • Commercial Perspective & Trafficking


Bar chart of Frequency of Articles by Topic


The first two topics appear to be the most published while Commercial Perspective & Trafficking is the least. The cluster and structure detected by the clustering algorithm we noticed, could be visualized in the shape of a Graph Network. The articles were represented as nodes, and nodes of the same topic are grouped together and colored the same, the strength of the relationship between nodes as defined by the cosine similarity is represented by links/edges. Below is a visual representation of the structure of the Graph Network:


Structure of Graph Network — Articles were labeled according to community detection clustering and relationships defined by the cosine similarity between the documents. Other information like the text, date, published data, and URL of the papers were stored as properties of the nodes (vertices), and the links (edges) were defined as the cosine similarity value between documents.


One advantage of restructuring the data in this manner is that it allows the data to be stored in a Graph Database. Traditional relational databases work exceptionally well at capturing repetitive and tabular data, they don’t do quite as well at storing and expressing relationships between the entities within the data elements. A database that embraces this structure can more efficiently store, process, and query connections. Complex analysis can be done on the data by using a pattern and specifying starting points. Graph Databases can efficiently explore connecting data to those initial starting points, collecting and processing information from nodes and relationships while leaving out data outside the search pattern.


Challenges and Limitations

The major challenges we faced were related to compiling our dataset. Only one of the repositories we used, CORE, granted API access which greatly sped up the process of obtaining data. For the rest, the need to build custom scraping scripts meant that we could only cover a limited number of repositories. Other open repositories, such as Semantic Scholar, resisted our scraping efforts, while others such as Web of Science or ESBSCOhost, are completely walled-off to non-subscribers. The great white whale of scientific article repositories, Google Scholar, also eluded us. Here, search results are purposefully presented in such a way that it is not possible to extract the full abstract texts — although some other researchers with a lot of time and effort have had greater success with scraping it.

As shown, we were able to conduct a range of interesting and meaningful analyses using just the abstracts of the scientific articles, but to go further in our research would require overcoming the challenges related to obtaining the full text of the scientific articles. Even after developing a custom tool to extract text from PDFs, we still faced two challenges. Firstly, many articles were paywalled and could not be accessed, and secondly, the repositories we scraped did not systematically link to the PDF page of the article, so the tool could not be utilized across our whole dataset.

If a would-be data scientist is able to surpass all these hurdles, a final barrier to extracting information from scientific articles remains. The way in which scientific texts are structured, with sections such as “Introduction”, “Methodology”, “Findings” and “Discussion”, varies greatly from one article to the next. This makes it especially difficult to answer specific questions such as “What are the risk factors of being an offender of OVAC” that require searching for information in a specific section of the text, although it is less of an issue if you are seeking to answer more general questions, such as “How has the number of research papers changed over time?”. To overcome the difficulty of extracting specific information from unstructured text, we built a Neural Search Engine powered by haystack that uses a distilbert transformer to search for answers to specific questions in the dataset. However, it is currently a proof-of-concept and requires further refinement to reach its full potential of being able to answer the questions accurately.

These challenges create some limitations for the findings of our analysis. We cannot say that our dataset captures the entirety of scientific research into online violence against children, but rather just that which was contained in the repositories that we were able to access — we do not know if this could bias our results in some way (for example, if these repositories were more likely to contain papers from certain fields, or from certain parts of the world). It is worth noting that we conducted all of our searches in English. Fortunately, scientific article abstracts are often translated, so we were able to analyze the text of these, even when the original language of the article was different.

Another limitation is that as scientific knowledge increases over time, more recent articles could be more relevant than historical ones if certain theories or assumptions are later found to be incorrect with further research. However, all articles were given equal weight in our analysis.

Given the challenges that we faced with accessing repositories and articles and the incalculable benefits of greater data openness in scientific research, it is interesting to discuss an initiative that is working toward that goal. A team at Jawaharlal Nehru University (JNU) in India is building a gigantic database of text and images extracted from over 70 million scientific articles. To overcome copyright restrictions, this text will not be able to be downloaded or read, but rather only queried through data mining techniques. This initiative has the potential to radically transform the way that scientific articles are used by researchers, opening them up for exploration using the entirety of the data science toolkit.



We have demonstrated in this case study how text mining and NLP techniques can aid the analysis of scientific literature at every step of the way — from data collection to cleaning and to gain meaningful insights from text. While full texts helped us to answer more specific questions, we found that using just abstracts was often sufficient to gain useful insights. This shows great potential for future abstract-only analysis in cases where access to full-text articles is limited.

Our work has helped Save the Children to understand OVAC and its research space better, and similar types of analysis can benefit other NGOs in many ways. These include:

  • understanding a topic
  • having an overview of the types of research efforts
  • understanding research gaps
  • identifying key resources (e.g. datasets often quoted in papers, most common citations, most active researchers/publishers, etc.)


There are also many other possibilities of NLP methods to extract insights from scientific papers that we have not tried. Here are some ideas for future exploration:

  • Extracting sections from scientific articles — articles are organized in sections, and if we can figure out a way to split articles up into sections, it would be a great first step towards a more structured dataset!
  • Named Entity Recognition — From figuring out which entities are being discussed to using these to answer specific questions, NER unlocks a ton of possible applications.
  • Network Analysis using Citations — This could be an alternative method to cluster the articles, or it could also help to identify the ‘influential’ articles or map out the progress of research.
Internet Safety for Children: Using NLP to Predict the Risk Level of Online Games, Websites, and Applications

Internet Safety for Children: Using NLP to Predict the Risk Level of Online Games, Websites, and Applications

The following work is part of the Omdena AI Challenge on improving internet safety for children, implemented in collaboration with John Zoltner at Save the Children US.

This blog was written by Sabrina Carlson and co-authored by Erum Afzal. Contributors include Anna Kolbasko, Juber Rahman, Erum Afzal, Mateus Broilo, Rahul Gopan, Rubens Carvalho, Vinod Rangayyan, Adele C, and Rosana de Oliveira Gomes.


The Problem

Save the Children is a humanitarian organization that aims to improve the lives of children across the globe. In line with the United Nations’ Sustainable Goal 16.2 to “end abuse, exploitation, trafficking, and all forms of violence and torture against children,” Save the Children and Omdena collaborated to use artificial intelligence to identify and prevent online internet violence against children for their safety. Utilizing numerous data sources and a combination of various artificial intelligence techniques, such as natural language processing (NLP), this project’s collaborators aimed to produce meaningful insights into and prevent online internet violence against children for their safety. One area of concern is online games, websites, and applications that are popular with children, and a number of collaborators targeted this space in hopes of guarding children against online predators in the future.


What We Did

The Common Sense Media website provides expert advice, useful tools, and objective ratings for countless movies, television shows, games, websites, and applications to help parents make informed decisions about which content they want their children to consume. Particularly useful for this project, parents, and children can review games, applications, and websites on the Common Sense Media site. A number of Omdena collaborators had the idea to build web scrapers to collect parent and child reviews of the games, applications, and websites that are popular with children and use natural language processing to identify which platforms are high risk for online internet violence against children for their safety.

The first step was to scrape Common Sense Media to collect game, application, and website reviews from both parents and children. To do so, we used ParseHub software to build web scrapers to collect reviews from this website. ParseHub is a powerful, user-friendly tool that allows one to easily extract data from websites. Using ParseHub, we set three different configurations to scrap parent and child reviews of all games, applications, and websites from the internet that Common Sense Media has determined to be popular among children for their safety.

The resulting dataset includes the following features:

  • 40,433 observations (reviews) from 995 different games/apps/websites
  • Platform type (game, application, website)
  • The risk level for online (sexual) violence against children
  • Indicators for each platform’s content related to positive messages, positive role models/representations, ease of play, violence, sex, language, and consumerism. Common Sense Media provides objective ratings (from a scale of 0–5) for these indicators for the digital content included on the site. We focused on the sex indicator and re-labeled it as CSAM (child sexual abuse material). We determined a platform to be high risk for CSAM if its sex rating was greater than 2 and assigned a platform a low-risk CSAM label if its sex rating was lower than 2.

Figure 1 plots the top 20 platforms in terms of the number of reviews.


Internet Safety Children

Figure 1. 20 Most Popular Platforms by Number of Reviews / Source: Omdena


Figure 2 displays the number of reviews for high and low-risk games, applications, and websites. As illustrated in the figure, there are nearly 25,000 reviews for low-risk platforms, whereas there are close to 16,000 reviews for high-risk platforms.


Internet Safety Children

Figure 2. Number of Reviews by CSAM Risk Level / Source: Omdena


Data Sampling

We randomly sampled 50% of the data in order to process the data in a more efficient way. The following graphic illustrates the code used to sample 50% of the data.


Internet Safety Children


Data Cleaning

It is necessary to clean the data in order to build a successful NLP model. To clean the review messages, we created a function called “clean_text” and used it to perform several transformations, including the following:

  • Converted the review text into all lower-case letters
  • Tokenizing the review text (i.e., splitting the text into words) and removing punctuation marks
  • Removing numbers and stopwords (e.g., a, an, the, this).
  • Using the WordNet lexical database to assign Part-Of-Speech (POS) tags. The POS tags are used to attach labels to words that correspond to a noun, verb, etc.
  • Lemmatizing and transforming the words to their roots (i.e., games→ game, Played→ play)

Figure 3 provides an example of the reviews pre-and post-cleaning. In the “review” column, the text has not been cleaned, while the “review_clean” column includes text that has been lemmatized, tagged for POS, tokenized, etc.


Internet Safety Children

Figure 3. Sample of Cleaned Text / Source: Omdena


Feature Engineering

Before applying the models, we performed some feature engineering, including sentiment analysis, vector extraction, and TF-IDF.


Sentiment Analysis

The first feature engineering step was conducting sentiment analysis. The sentiment analysis was performed on the features to gain insight into how parents and children feel about hundreds of games, applications, and internet websites that are popular with children for their safety. We used Vader, which is part of the NLTK module, for the sentiment analysis. Vader uses a lexicon of words to identify positive or negative sentiments in long sentences. It also takes into account the context of the sentences to determine the sentiment scores. For each text, Vader returns the following four values:

  • Negative count score
  • Positive count score
  • Neutral count score
  • The compound — an overall score that summarizes the previous scores

Figure 4 displays a sample of cleaned reviews containing negative, neutral, positive, and compound scores.


Internet Safety Children

Figure 4. Sample of Sentiment Analysis Scores / Source: Omdena


Extracting Vectors

In the next step, we extracted vector representations for every review. Using the module Gensim, we were able to create a numerical vector representation for every word in the corpus using the contexts in which they appear (Word2Vec). This is performed using shallow neural networks. Extracting vectors in this way is interesting and informative because similar words will have similar representation vectors.

All text can also be transformed into numerical vectors using word vectors (Doc2Vec). We can use these vectors as training features because the same texts will also have similar representations.

It was first necessary to train a Doc2Vec model by feeding in our text data. By applying this model to the review text, we are able to obtain the representation vectors. Finally, we added the TF-IDF (Term Frequency — Inverse Document Frequency) values for every word and every document.

But why not simply count the number of times each word appears in every document? The problem with this approach is that it does not take into account the relative importance of the words in the text. For instance, a word that appears in nearly every review would not likely bring useful information for analysis. In contrast, rare words may be much more meaningful. The TF-IDF metric solves this problem:



The Term Frequency (TF) computes the classic number of times the word appears in the text, while the Inverse Document Frequency (IDF) computes the relative importance of the word depending on the number of texts (reviews) in which the specific word is found. We added TF-IDF columns for every word that appeared in at least 10 different texts. This step allowed us to filter a number of words and, subsequently, reduce the size of the final output. Figure 5 provides the code used to apply TF-IDF and assign the resulting columns to the data frame, and Figure 6 displays the output of the sample code.


Internet Safety Children

Figure 5. TF-IDF Code Sample / Source: Omdena

Internet Safety Children

Figure 6. TF-IDF Sample Code Output / Source: Omdena


Exploratory Data Analysis

The EDA produced a number of interesting insights. Figure 7 provides a sample of reviews that received high negative sentiment scores, and Figure 8 displays a sample of reviews with high positive sentiment scores. The sentiment analysis successfully assigned negative sentiments to reviews with text such as “violence, horror, dead.” The analysis also effectively assigned positive sentiments to reviews containing words such as “fun, cute, exciting.”


Internet Safety Children

Figure 7. Sample of Reviews with High Negative Scores / Source: Omdena


Internet Safety Children

Figure 8. Sample of Reviews with High Positive Scores / Source: Omdena


Figure 9 shows the distribution of the trend of messages among high and low-risk games. Varder categorizes low-risk reviews as positive messages, whereas high-risk reviews should have lower compound sentiments. This shows that the sentiment feature extractions proved helpful in modeling the risk analysis.


Internet Safety Children

Figure 9. High_Low Risk Distribution over Compound Sentiments / Source: Omdena


Modeling High-Risk Games/Applications/Websites

After we successfully scraped the reviews, built the dataset, cleaned the data, and performed feature engineering, we were able to build an NLP model. We choose which features (reviews and clean reviews) to use to train our model.

Then, we split our data into two parts:

  • Training set for training purposes
  • The test set to assess the model performance

After selecting the features and splitting the data into test/training sets, we fit a Random Forest classification model and used the reviews to predict whether a platform is a high risk for CSAM. Figure 10 displays the code used to fit the Random Forest classifier and obtain the metrics.


Internet Safety Children

Figure 10. Random Forest Classifier Code Sample / Source: Omdena


Figure 11 displays a sample of features and their respective importance. The most important features are indeed the ones that were obtained in the sentiment analysis. In addition, the vector representations of the texts were also important in our training. A number of words appear to be fairly important as well.


Figure 11. Feature Importance / Source: Omdena


The Receiver Operating Characteristic Example (ROC) curve and Area Under the Curve (AUC) allow one to evaluate how well a model performs in terms of its ability to distinguish between classes (high/low risk for CSAM in this context). The ROC curve, which plots the true positive rate against the false-positive rate, is displayed in Figure 12. The AUC is 0.77, which indicates that the classifier performed at an acceptable level.


Internet Safety Children

Figure 12. ROC Curve / Source: Omdena


The Precision-Recall (PR) Curve is illustrated in Figure 13. The PR curve is graphed by simply plotting the recall score (x-axis) against the precision score (y-axis). Ideally, we would achieve both a high recall score and a high precision score; however, there is often a trade-off between the two in machine learning. The sci-kit learn documentation states that the Average Precision (AP) “summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight.” The AP here is 0.72, which is an acceptable score.


Internet Safety Children

Figure 13. Precision-Recall Curve / Source: Omdena


It is evident in Figure 13 that the precision decreases as we increase the recall. This indicates that we have to choose a prediction threshold based on our specific needs. For instance, if the end goal is to have a high recall, we should set a low prediction threshold that will allow us to detect most of the observations of the positive class, though the precision will be low. On the contrary, if we want to be really confident about our predictions and are not set on identifying all the positive observations, we should set a high threshold that will allow us to obtain high precision and a low recall.

In order to determine whether or not the model we built performs better than another classifier, we can simply use the AP metric. To assess the quality of our model, we can compare it to a simple decision baseline. With a random classifier for the baseline, the model would simply assign 0 half the time and 1 the other half of the time. Our AP metric is 0.77, which is better than a random classifier.


Conclusion and Observations

It is nearly possible to use just raw text as input to make predictions. The most important aspect is to be able to extract the relevant features from a raw data source. Such data can often complement data science projects, allowing one to extract more meaningful/useful features and increase the model’s predictive power.

We were only able to predict the platform’s risk through user reviews, and it is possible that the reviews are biased. To improve the precision of our predictive model, we can triangulate other features such as player sentiments, game titles, UX/UI features, and in-game chats. Used in combination, these features can provide a number of insightful recommendations. Our predictive model will shed light on CSAM risk in online games, applications, and websites that are popular with children by automatically detecting each platform’s risk level. In the future, we hope that parents will be able to better select platforms for their child’s use based on our use of AI.

A Chatbot Warning System Against Online Predators

A Chatbot Warning System Against Online Predators

Using Natural Language Processing to warn children against online predators.

The following work is part of the Omdena AI Challenge on preventing online violence against children, implemented in collaboration with John Zoltner at Save the Children US.


Protecting Children

Today, children face an evolving threat — online violence. Violence and harassment of children have been growing exponentially for more than 20 years but due to the recent events leading to the closing of schools for over 1 billion children around the world, children are more vulnerable than ever. Online predators use Internet avenues popular with children and adolescents such as game chat rooms to lure them into sexual exploitation and even in-person assaults.

Protection against online sexual violence greatly varies from platform to platform. Some gaming platforms include a profanity filter that looks for problematic words and replaces them with a string of asterisks. Outside the gaming platforms, many chat platforms still do not have any safeguards in place to protect children from predatory adult conversations. However, chat logs can provide information on how a predator might attempt to exploit children and young adults into risky situations, such as sharing photos, web cameras, and sexting (sexual texting).

Often, pattern recognition techniques provide an automated identification of these conversations for potential law enforcement intervention. 

Unfortunately, this strategy uses many man-hours and spans many message categories, which makes it all the more difficult to identify these patterns. It is a challenging task, but one that is worth tackling, and we elaborate on our approach in the rest of the article.


Online Predators



First, let us establish our working definitions.

  • The New Oxford American Dictionary defines a predator as a “person or group that ruthlessly exploits others.”
  • Expanding the term to a sexual predator as “a person seen as obtaining or trying to obtain sexual contact or favor with another person in a metaphorically ’predatory’ manner”, Daniel M. Filler, Virginia Journal of Social Policy & the Law (2003).


The Solution — Data Engineers Unite!


Online Predators

The team focused on the solution’s Predator Analysis portion.


Our solution looked to reduce man-hours and to develop a near real-time warning system to the chat that alerted the child when the conversation changes sentiment. The team used a semi-supervised approach to evaluate if the conversations provide a low, medium, or high risk to the child. The system would evaluate the phrase or sentence and return an effective sentiment warning if warranted. The data for our chatbot (Predator-Pseudo-Victim conversations) was collected from interactions between a predator and a law enforcement officer or volunteer posing as a child.

The chatbot was designed to learn from non-predatory and predatory conversations and distinguish between them. Additionally, it would have the ability to recognize inappropriate messages no matter whether they came from the predator or the child’s side. The corpus also had adult-like conversations initiated from the child’s side.


The Dataset

The team consolidated and cleaned nearly 500 chat log files that contained exchanges between a predator and a pseudo-victim. The collection grew into a corpus containing 807,000-plus messages ranging from “hello” to explicit remarks. The dataset creation proved laborious, where I voluntarily provided more than 630 hours in just labeling data. The dataset received labels, such as male or female as they identified themselves in the chats, predator or victim, and level of risk of the conversation. Nearly half of the project time was dedicated to a properly built and parsed dataset.

This dataset was split into a training, development, and test set. The training set held 75 percent of all messages for the chatbot to learn the contextual format and nuances of conversation. The development set, which was 10 percent of the data, was held away from the chatbot until after model selection, to prove the validity of the model.

The 2 mins video quickly discusses how the team assembled the chatbot’s dataset.



Data Format and Storage

The data was housed in a relational database. It became large enough to serve as a nexus to provide uniquely formatted datasets for the machine learning pipeline.

During the labeling process, few issues arose on how to semantically define a conversation. With many different log formats, ranging from AOL Messenger to SMS and other online platforms, the sentences would start and stop at different points. In conversations, I implemented a similar format as used in the competitively-used Cornell University’s movie corpus that provided a standard structure making it easy to parse the data. Additionally, the corpus contained chat slang, abbreviations, and number-for-words, like “l8r” for “later” and “b4” for “before”, which required a team consensus on how to handle these stopwords. The team did not focus on timestamps due to extremely varied formatting, missing values, and lack of importance to the overall project.


The Model

Many models presented as candidates to the chatbot’s internal workings. The main goal for the team was to have a local and offline solution for now. This was done to reduce privacy concerns and legal issues. Future considerations of this project would evaluate these features with appropriate development operations.


Online Predators

Basic sequence-to-sequence model diagram.


The selected model focused around a Long Short Term Memory (LSTM) network cell, arranged as a sequence-to-sequence configuration. LSTMs have long been proved well-suited to work with sequential data. Our application would use this ability to help the chatbot predict the next plausible word to use for response.

For the sentiment analysis portion, we focused our efforts on an ensemble learning model as well as a support vector machine to help predict when the conversation changed from benign to risky.



Our team successfully built a chatbot and a sentiment analysis model independently. The chatbot learned from its more than 807,000 messages to understand how to parse sentences and structure a proper response. The limited vernacular stemmed from the chatbot’s time to learn and framework limitations.

The greatest challenge to code performance-centered inside the platform chosen, TensorFlow 1.0.0, provided limitations. The code did provide a conversation-capable entity, but the model needs more training data if we want to go beyond proof-of-concept to deploy it in an application.

The project successfully employed message sentiment analysis and was able to warn the user of potentially risky conversations initiated by online predators. The sentiment analysis ranged from low, medium, or high levels of risk.

Future considerations will take this project into a full-functioning environment of TensorFlow 2.1.0, eliminating other frameworks, including PyTorch. The internal model will receive an update to the LSTM structure and performance will be improved with the use of graphics computing processors, such as NVIDIA and its cuDNN framework.


Stay in touch via our newsletter.

Be notified (a few times a month) about top-notch articles, new real-world projects, and events with our community of changemakers.

Sign up here