Exploring Scientific Literature on Online Violence Against Children Using Natural Language Processing

Exploring Scientific Literature on Online Violence Against Children Using Natural Language Processing

The following work is part of the Omdena AI Challenge on preventing online violence against children, implemented in collaboration with John Zoltner at Save the Children US.

This article is written by Wen Qing LimMaria Guerra-AriasSijuade Oguntayo


Textual Data  –  A Trove of Information

The amount of information available in the world is increasing exponentially year by year and shows no signs of slowing. This rapid increase is driven by expansions in physical storage and the rise of cloud technologies, allowing more data to be exchanged and preserved than ever before. This boom, while great for scientific knowledge, also has possible downsides. As the volume of data grows, so also does the complexity in managing and extracting useful information from it.

More and more, organizations are turning to electronic storage to safeguard their data. Unstructured textual information like newspapers, scientific articles, and social media is now available in unprecedented volumes.

It is estimated that about 80% of enterprise data currently in existence is unstructured data, and this continues to increase at a rate of 55–65% per year.

Unstructured data, unlike structured data, does not have clearly defined types and isn’t easily searchable. This also makes it relatively more complex to perform analysis on.

Text mining processes utilize various analytics and AI technologies to analyze and generate meaningful insights from unstructured text data. Common text mining techniques include Text Analysis, Keyword Extraction, Entity Extraction/Recognition, Document Summarization, etc. A typical text mining pipeline includes data collection (from files, databases, APIs, etc.), data preprocessing (stemming, stopwords removal, etc.), and analytics to ascertain patterns and trends.

Just as data mining in the traditional sense has proven to be invaluable in extracting insights and making predictions from large amounts of data, so too can text mining help in understanding and deriving useful insights from the ever-increasing availability of text data.

Natural Language Processing (NLP) can be thought of as a way for computers to understand and generate human natural language. This is possible by simulating the human ability to comprehend natural language. NLP’s strength comes from the ability of computers to analyze large bodies of text without fatigue and in an unbiased manner (note: unbiased refers to the process, it is possible for the data to be biased).


Online Violence Against Children

As of July 2020, there are over 4.5 billion internet users globally, accounting for over half of the world’s population. About one-third of these are children under the age of 18 (one child in every three in the world). As these numbers rise, sadly, so too does the number of individuals looking to exploit children online. The FBI estimates that at any one time, there are about 750,000 predators going online with the intention of connecting with children.

For our project, we wanted to explore how text mining and NLP techniques could be applied to analyzing the scientific literature on online violence against children (OVAC). We picked scientific articles as our focus, as these can provide a wealth of information — from the different perspectives that have been used to study OVAC (i.e. criminology, psychology, medicine, sociology, law), to the topics that researchers have chosen to focus on, or the regions of the world where they have dedicated their efforts. Text mining allowed us to collect, process, and analyze a large amount of published scientific data on this topic — capturing a meaningful snapshot of the state of scientific knowledge on OVAC.


Data Collection and Preprocessing


Our overall process flow from data collection to analysis


Our first step was to collect datasets of articles that we could find online. The idea was to scrape a variety of repositories for scientific articles related to OVAC, using a set of keywords as search terms. We built scrapers for each repository, making use of the BeautifulSoup and Selenium libraries. Each scraper was unique to the repository and collected information such as the article metadata (i.e Title, Authors, Publisher, Date Published, etc.), the article Abstract, and the article full-text URL (where available). We also built a script to convert the full-text articles from PDF to Text, using Optical Character Recognition (OCR). Only one of the repositories, CORE, had an API that directly allowed us to scrape the full text of the articles.

Having collected over 27,000 articles across 7 repositories, we quickly realized that many articles were not relevant to OVAC. For example, there were many scientific articles about physical sexual violence against children, that also mentioned some sort of online survey. These articles fulfilled the “online AND sexual AND violence AND children” search term but were irrelevant to OVAC. Hence, we had to manually filter the scientific articles for relevance, sieving out 95% of articles that were not related to OVAC.

Faced with such a painfully manual task, some members of the team tried out semi-automated methods of filtering. One method used clustering to find groups of papers that were similar to each other. The idea was that relevant papers would show up in the same group, while irrelevant papers would show up in their own groups. We would then only need to sift through each cluster instead of going through each individual paper, saving almost 10–30 times the effort. However, this assumed perfect clusters, which was often not true. The clustering method was definitely faster and filtered out 41% of articles, but it also left more irrelevant articles undetected. An alternative to clustering would be to train classifiers to identify relevant articles based on a set of pre-labeled articles. This could potentially work better than clustering, but having undetected articles still remains a limitation.

One of the perks of working with scientific articles (read: texts that have been reviewed rigorously) is that minimal data cleaning is required. Steps that we would otherwise have to take when dealing with free texts (e.g. translating slang, abbreviations, and emojis, accounting for typos, etc.) are not needed here. Of course, text pre-processing steps like stemming, stop-word removal, punctuation removal, etc. are still required for some analysis, like clustering or keyword analysis.


Drawing insights from text analysis regarding online violence against children

Armed with a set of relevant articles, the team set off to discover the various types of methods to extract insights from the dataset. We attempted a variety of methods (i.e. TF-IDF, Bag of Words, Clustering, Market Basket Analysis, etc.) in search of answers to a set of questions that we aimed to explore with the dataset. Some analyses were limited by the nature of the datasets (e.g. in keywords analysis, there is a lot of noise and random words in the data. Some trends/patterns emerge but it is not very conclusive), while others showed great potential in picking out useful insights (e.g. clustering, market basket analysis as described below).



Keywords Analysis

Based on the title and abstract texts, we were able to generate a word cloud of the most frequent terms appearing in the OVAC scientific literature. We also used TF-IDF vector analysis to explore the most relevant words, bigrams, and trigrams appearing in the title and abstract texts in each publication year. This allowed us to chart the rise of certain research topics over time — for example around the years 2015 and 2016, terms related to “travel” and “tourism” began to appear more often in the OVAC literature, suggesting that this problem received greater research attention in this period


Word cloud of title and abstract texts from over 1300 scientific articles on online violence against children. Source. www.omdena.com



Geographical Market Basket Analysis


Heat map of the Lift between country pairs. A lift of more than 1 suggests that the presence of one country increases the probability that the other country will also appear in the article. The larger the lift, the more likely they would appear together.


We conducted a Market Basket analysis to find out which countries were likely to appear in the same article. This could potentially give insight to the networks of countries involved in OVAC. While we noticed that many countries appear together because they were geographically close, there were also exceptions.

From the heat map above, this includes country pairs like Malaysia-US, Australia-Canada, Australia-Philippines, and Thailand-Germany. Upon investigation, we found that:

  • Most articles contain these pairs because of exemplification.
  • Some are mentioned as a breakdown of countries where respondents of surveys and studies were conducted. (E.g. Thailand and Germany were mentioned as part of a 6-country survey of adolescents.)
  • More interestingly, there were also articles that mentioned pairs of countries due to offender-victim relationships. (E.g. an article studying offenders in Australia mentioned that they preyed on child victims in the Philippines.)


Topics Clustering Analysis

Another of our solutions used machine learning to separate the documents into different clusters defined by topics. A secondary motive was to explore the possibility that the different documents can form a network of communities not only based on their topics, but also on how the documents relate to each other.

The Louvain Method for community detection is a popular clustering algorithm used to understand the structure, as well as to detect communities of large networks. The TF-IDF representation of the words in the vocabulary was used to build a co-occurrence matrix containing the cosine similarities between each document. The clustering algorithm detected 5 distinct communities.

A manual inspection of the documents in each cluster suggested the following topics –

  • Institutional, Political (legislative) & Social Discourse
  • Online Child Protection — Vulnerabilities
  • Technology
  • Analysis of Offenders
  • Commercial Perspective & Trafficking


Bar chart of Frequency of Articles by Topic


The first two topics appear to be the most published while Commercial Perspective & Trafficking is the least. The cluster and structure detected by the clustering algorithm we noticed, could be visualized in the shape of a Graph Network. The articles were represented as nodes, and nodes of the same topic are grouped together and colored the same, the strength of the relationship between nodes as defined by the cosine similarity is represented by links/edges. Below is a visual representation of the structure of the Graph Network:


Structure of Graph Network — Articles were labeled according to community detection clustering and relationships defined by the cosine similarity between the documents. Other information like the text, date, published data, and URL of the papers were stored as properties of the nodes (vertices), and the links (edges) were defined as the cosine similarity value between documents.


One advantage of restructuring the data in this manner is that it allows the data to be stored in a Graph Database. Traditional relational databases work exceptionally well at capturing repetitive and tabular data, they don’t do quite as well at storing and expressing relationships between the entities within the data elements. A database that embraces this structure can more efficiently store, process, and query connections. Complex analysis can be done on the data by using a pattern and specifying starting points. Graph Databases can efficiently explore connecting data to those initial starting points, collecting and processing information from nodes and relationships while leaving out data outside the search pattern.


Challenges and Limitations

The major challenges we faced were related to compiling our dataset. Only one of the repositories we used, CORE, granted API access which greatly sped up the process of obtaining data. For the rest, the need to build custom scraping scripts meant that we could only cover a limited number of repositories. Other open repositories, such as Semantic Scholar, resisted our scraping efforts, while others such as Web of Science or ESBSCOhost, are completely walled-off to non-subscribers. The great white whale of scientific article repositories, Google Scholar, also eluded us. Here, search results are purposefully presented in such a way that it is not possible to extract the full abstract texts — although some other researchers with a lot of time and effort have had greater success with scraping it.

As shown, we were able to conduct a range of interesting and meaningful analyses using just the abstracts of the scientific articles, but to go further in our research would require overcoming the challenges related to obtaining the full text of the scientific articles. Even after developing a custom tool to extract text from PDFs, we still faced two challenges. Firstly, many articles were paywalled and could not be accessed, and secondly, the repositories we scraped did not systematically link to the PDF page of the article, so the tool could not be utilized across our whole dataset.

If a would-be data scientist is able to surpass all these hurdles, a final barrier to extracting information from scientific articles remains. The way in which scientific texts are structured, with sections such as “Introduction”, “Methodology”, “Findings” and “Discussion”, varies greatly from one article to the next. This makes it especially difficult to answer specific questions such as “What are the risk factors of being an offender of OVAC” that require searching for information in a specific section of the text, although it is less of an issue if you are seeking to answer more general questions, such as “How has the number of research papers changed over time?”. To overcome the difficulty of extracting specific information from unstructured text, we built a Neural Search Engine powered by haystack that uses a distilbert transformer to search for answers to specific questions in the dataset. However, it is currently a proof-of-concept and requires further refinement to reach its full potential of being able to answer the questions accurately.

These challenges create some limitations for the findings of our analysis. We cannot say that our dataset captures the entirety of scientific research into online violence against children, but rather just that which was contained in the repositories that we were able to access — we do not know if this could bias our results in some way (for example, if these repositories were more likely to contain papers from certain fields, or from certain parts of the world). It is worth noting that we conducted all of our searches in English. Fortunately, scientific article abstracts are often translated, so we were able to analyze the text of these, even when the original language of the article was different.

Another limitation is that as scientific knowledge increases over time, more recent articles could be more relevant than historical ones if certain theories or assumptions are later found to be incorrect with further research. However, all articles were given equal weight in our analysis.

Given the challenges that we faced with accessing repositories and articles and the incalculable benefits of greater data openness in scientific research, it is interesting to discuss an initiative that is working toward that goal. A team at Jawaharlal Nehru University (JNU) in India is building a gigantic database of text and images extracted from over 70 million scientific articles. To overcome copyright restrictions, this text will not be able to be downloaded or read, but rather only queried through data mining techniques. This initiative has the potential to radically transform the way that scientific articles are used by researchers, opening them up for exploration using the entirety of the data science toolkit.



We have demonstrated in this case study how text mining and NLP techniques can aid the analysis of scientific literature at every step of the way — from data collection to cleaning and to gain meaningful insights from text. While full texts helped us to answer more specific questions, we found that using just abstracts was often sufficient to gain useful insights. This shows great potential for future abstract-only analysis in cases where access to full-text articles is limited.

Our work has helped Save the Children to understand OVAC and its research space better, and similar types of analysis can benefit other NGOs in many ways. These include:

  • understanding a topic
  • having an overview of the types of research efforts
  • understanding research gaps
  • identifying key resources (e.g. datasets often quoted in papers, most common citations, most active researchers/publishers, etc.)


There are also many other possibilities of NLP methods to extract insights from scientific papers that we have not tried. Here are some ideas for future exploration:

  • Extracting sections from scientific articles — articles are organized in sections, and if we can figure out a way to split articles up into sections, it would be a great first step towards a more structured dataset!
  • Named Entity Recognition — From figuring out which entities are being discussed to using these to answer specific questions, NER unlocks a ton of possible applications.
  • Network Analysis using Citations — This could be an alternative method to cluster the articles, or it could also help to identify the ‘influential’ articles or map out the progress of research.
Internet Safety for Children: Using NLP to Predict the Risk Level of Online Games, Websites, and Applications

Internet Safety for Children: Using NLP to Predict the Risk Level of Online Games, Websites, and Applications

The following work is part of the Omdena AI Challenge on improving internet safety for children, implemented in collaboration with John Zoltner at Save the Children US.

This blog was written by Sabrina Carlson and co-authored by Erum Afzal. Contributors include Anna Kolbasko, Juber Rahman, Erum Afzal, Mateus Broilo, Rahul Gopan, Rubens Carvalho, Vinod Rangayyan, Adele C, and Rosana de Oliveira Gomes.


The Problem

Save the Children is a humanitarian organization that aims to improve the lives of children across the globe. In line with the United Nations’ Sustainable Goal 16.2 to “end abuse, exploitation, trafficking, and all forms of violence and torture against children,” Save the Children and Omdena collaborated to use artificial intelligence to identify and prevent online internet violence against children for their safety. Utilizing numerous data sources and a combination of various artificial intelligence techniques, such as natural language processing (NLP), this project’s collaborators aimed to produce meaningful insights into and prevent online internet violence against children for their safety. One area of concern is online games, websites, and applications that are popular with children, and a number of collaborators targeted this space in hopes of guarding children against online predators in the future.


What We Did

The Common Sense Media website provides expert advice, useful tools, and objective ratings for countless movies, television shows, games, websites, and applications to help parents make informed decisions about which content they want their children to consume. Particularly useful for this project, parents, and children can review games, applications, and websites on the Common Sense Media site. A number of Omdena collaborators had the idea to build web scrapers to collect parent and child reviews of the games, applications, and websites that are popular with children and use natural language processing to identify which platforms are high risk for online internet violence against children for their safety.

The first step was to scrape Common Sense Media to collect game, application, and website reviews from both parents and children. To do so, we used ParseHub software to build web scrapers to collect reviews from this website. ParseHub is a powerful, user-friendly tool that allows one to easily extract data from websites. Using ParseHub, we set three different configurations to scrap parent and child reviews of all games, applications, and websites from the internet that Common Sense Media has determined to be popular among children for their safety.

The resulting dataset includes the following features:

  • 40,433 observations (reviews) from 995 different games/apps/websites
  • Platform type (game, application, website)
  • The risk level for online (sexual) violence against children
  • Indicators for each platform’s content related to positive messages, positive role models/representations, ease of play, violence, sex, language, and consumerism. Common Sense Media provides objective ratings (from a scale of 0–5) for these indicators for the digital content included on the site. We focused on the sex indicator and re-labeled it as CSAM (child sexual abuse material). We determined a platform to be high risk for CSAM if its sex rating was greater than 2 and assigned a platform a low-risk CSAM label if its sex rating was lower than 2.

Figure 1 plots the top 20 platforms in terms of the number of reviews.


Internet Safety Children

Figure 1. 20 Most Popular Platforms by Number of Reviews / Source: Omdena


Figure 2 displays the number of reviews for high and low-risk games, applications, and websites. As illustrated in the figure, there are nearly 25,000 reviews for low-risk platforms, whereas there are close to 16,000 reviews for high-risk platforms.


Internet Safety Children

Figure 2. Number of Reviews by CSAM Risk Level / Source: Omdena


Data Sampling

We randomly sampled 50% of the data in order to process the data in a more efficient way. The following graphic illustrates the code used to sample 50% of the data.


Internet Safety Children


Data Cleaning

It is necessary to clean the data in order to build a successful NLP model. To clean the review messages, we created a function called “clean_text” and used it to perform several transformations, including the following:

  • Converted the review text into all lower-case letters
  • Tokenizing the review text (i.e., splitting the text into words) and removing punctuation marks
  • Removing numbers and stopwords (e.g., a, an, the, this).
  • Using the WordNet lexical database to assign Part-Of-Speech (POS) tags. The POS tags are used to attach labels to words that correspond to a noun, verb, etc.
  • Lemmatizing and transforming the words to their roots (i.e., games→ game, Played→ play)

Figure 3 provides an example of the reviews pre-and post-cleaning. In the “review” column, the text has not been cleaned, while the “review_clean” column includes text that has been lemmatized, tagged for POS, tokenized, etc.


Internet Safety Children

Figure 3. Sample of Cleaned Text / Source: Omdena


Feature Engineering

Before applying the models, we performed some feature engineering, including sentiment analysis, vector extraction, and TF-IDF.


Sentiment Analysis

The first feature engineering step was conducting sentiment analysis. The sentiment analysis was performed on the features to gain insight into how parents and children feel about hundreds of games, applications, and internet websites that are popular with children for their safety. We used Vader, which is part of the NLTK module, for the sentiment analysis. Vader uses a lexicon of words to identify positive or negative sentiments in long sentences. It also takes into account the context of the sentences to determine the sentiment scores. For each text, Vader returns the following four values:

  • Negative count score
  • Positive count score
  • Neutral count score
  • The compound — an overall score that summarizes the previous scores

Figure 4 displays a sample of cleaned reviews containing negative, neutral, positive, and compound scores.


Internet Safety Children

Figure 4. Sample of Sentiment Analysis Scores / Source: Omdena


Extracting Vectors

In the next step, we extracted vector representations for every review. Using the module Gensim, we were able to create a numerical vector representation for every word in the corpus using the contexts in which they appear (Word2Vec). This is performed using shallow neural networks. Extracting vectors in this way is interesting and informative because similar words will have similar representation vectors.

All text can also be transformed into numerical vectors using word vectors (Doc2Vec). We can use these vectors as training features because the same texts will also have similar representations.

It was first necessary to train a Doc2Vec model by feeding in our text data. By applying this model to the review text, we are able to obtain the representation vectors. Finally, we added the TF-IDF (Term Frequency — Inverse Document Frequency) values for every word and every document.

But why not simply count the number of times each word appears in every document? The problem with this approach is that it does not take into account the relative importance of the words in the text. For instance, a word that appears in nearly every review would not likely bring useful information for analysis. In contrast, rare words may be much more meaningful. The TF-IDF metric solves this problem:



The Term Frequency (TF) computes the classic number of times the word appears in the text, while the Inverse Document Frequency (IDF) computes the relative importance of the word depending on the number of texts (reviews) in which the specific word is found. We added TF-IDF columns for every word that appeared in at least 10 different texts. This step allowed us to filter a number of words and, subsequently, reduce the size of the final output. Figure 5 provides the code used to apply TF-IDF and assign the resulting columns to the data frame, and Figure 6 displays the output of the sample code.


Internet Safety Children

Figure 5. TF-IDF Code Sample / Source: Omdena

Internet Safety Children

Figure 6. TF-IDF Sample Code Output / Source: Omdena


Exploratory Data Analysis

The EDA produced a number of interesting insights. Figure 7 provides a sample of reviews that received high negative sentiment scores, and Figure 8 displays a sample of reviews with high positive sentiment scores. The sentiment analysis successfully assigned negative sentiments to reviews with text such as “violence, horror, dead.” The analysis also effectively assigned positive sentiments to reviews containing words such as “fun, cute, exciting.”


Internet Safety Children

Figure 7. Sample of Reviews with High Negative Scores / Source: Omdena


Internet Safety Children

Figure 8. Sample of Reviews with High Positive Scores / Source: Omdena


Figure 9 shows the distribution of the trend of messages among high and low-risk games. Varder categorizes low-risk reviews as positive messages, whereas high-risk reviews should have lower compound sentiments. This shows that the sentiment feature extractions proved helpful in modeling the risk analysis.


Internet Safety Children

Figure 9. High_Low Risk Distribution over Compound Sentiments / Source: Omdena


Modeling High-Risk Games/Applications/Websites

After we successfully scraped the reviews, built the dataset, cleaned the data, and performed feature engineering, we were able to build an NLP model. We choose which features (reviews and clean reviews) to use to train our model.

Then, we split our data into two parts:

  • Training set for training purposes
  • The test set to assess the model performance

After selecting the features and splitting the data into test/training sets, we fit a Random Forest classification model and used the reviews to predict whether a platform is a high risk for CSAM. Figure 10 displays the code used to fit the Random Forest classifier and obtain the metrics.


Internet Safety Children

Figure 10. Random Forest Classifier Code Sample / Source: Omdena


Figure 11 displays a sample of features and their respective importance. The most important features are indeed the ones that were obtained in the sentiment analysis. In addition, the vector representations of the texts were also important in our training. A number of words appear to be fairly important as well.


Figure 11. Feature Importance / Source: Omdena


The Receiver Operating Characteristic Example (ROC) curve and Area Under the Curve (AUC) allow one to evaluate how well a model performs in terms of its ability to distinguish between classes (high/low risk for CSAM in this context). The ROC curve, which plots the true positive rate against the false-positive rate, is displayed in Figure 12. The AUC is 0.77, which indicates that the classifier performed at an acceptable level.


Internet Safety Children

Figure 12. ROC Curve / Source: Omdena


The Precision-Recall (PR) Curve is illustrated in Figure 13. The PR curve is graphed by simply plotting the recall score (x-axis) against the precision score (y-axis). Ideally, we would achieve both a high recall score and a high precision score; however, there is often a trade-off between the two in machine learning. The sci-kit learn documentation states that the Average Precision (AP) “summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight.” The AP here is 0.72, which is an acceptable score.


Internet Safety Children

Figure 13. Precision-Recall Curve / Source: Omdena


It is evident in Figure 13 that the precision decreases as we increase the recall. This indicates that we have to choose a prediction threshold based on our specific needs. For instance, if the end goal is to have a high recall, we should set a low prediction threshold that will allow us to detect most of the observations of the positive class, though the precision will be low. On the contrary, if we want to be really confident about our predictions and are not set on identifying all the positive observations, we should set a high threshold that will allow us to obtain high precision and a low recall.

In order to determine whether or not the model we built performs better than another classifier, we can simply use the AP metric. To assess the quality of our model, we can compare it to a simple decision baseline. With a random classifier for the baseline, the model would simply assign 0 half the time and 1 the other half of the time. Our AP metric is 0.77, which is better than a random classifier.


Conclusion and Observations

It is nearly possible to use just raw text as input to make predictions. The most important aspect is to be able to extract the relevant features from a raw data source. Such data can often complement data science projects, allowing one to extract more meaningful/useful features and increase the model’s predictive power.

We were only able to predict the platform’s risk through user reviews, and it is possible that the reviews are biased. To improve the precision of our predictive model, we can triangulate other features such as player sentiments, game titles, UX/UI features, and in-game chats. Used in combination, these features can provide a number of insightful recommendations. Our predictive model will shed light on CSAM risk in online games, applications, and websites that are popular with children by automatically detecting each platform’s risk level. In the future, we hope that parents will be able to better select platforms for their child’s use based on our use of AI.

A Chatbot Warning System Against Online Predators

A Chatbot Warning System Against Online Predators

Using Natural Language Processing to warn children against online predators.

The following work is part of the Omdena AI Challenge on preventing online violence against children, implemented in collaboration with John Zoltner at Save the Children US.


Protecting Children

Today, children face an evolving threat — online violence. Violence and harassment of children have been growing exponentially for more than 20 years but due to the recent events leading to the closing of schools for over 1 billion children around the world, children are more vulnerable than ever. Online predators use Internet avenues popular with children and adolescents such as game chat rooms to lure them into sexual exploitation and even in-person assaults.

Protection against online sexual violence greatly varies from platform to platform. Some gaming platforms include a profanity filter that looks for problematic words and replaces them with a string of asterisks. Outside the gaming platforms, many chat platforms still do not have any safeguards in place to protect children from predatory adult conversations. However, chat logs can provide information on how a predator might attempt to exploit children and young adults into risky situations, such as sharing photos, web cameras, and sexting (sexual texting).

Often, pattern recognition techniques provide an automated identification of these conversations for potential law enforcement intervention. 

Unfortunately, this strategy uses many man-hours and spans many message categories, which makes it all the more difficult to identify these patterns. It is a challenging task, but one that is worth tackling, and we elaborate on our approach in the rest of the article.


Online Predators



First, let us establish our working definitions.

  • The New Oxford American Dictionary defines a predator as a “person or group that ruthlessly exploits others.”
  • Expanding the term to a sexual predator as “a person seen as obtaining or trying to obtain sexual contact or favor with another person in a metaphorically ’predatory’ manner”, Daniel M. Filler, Virginia Journal of Social Policy & the Law (2003).


The Solution — Data Engineers Unite!


Online Predators

The team focused on the solution’s Predator Analysis portion.


Our solution looked to reduce man-hours and to develop a near real-time warning system to the chat that alerted the child when the conversation changes sentiment. The team used a semi-supervised approach to evaluate if the conversations provide a low, medium, or high risk to the child. The system would evaluate the phrase or sentence and return an effective sentiment warning if warranted. The data for our chatbot (Predator-Pseudo-Victim conversations) was collected from interactions between a predator and a law enforcement officer or volunteer posing as a child.

The chatbot was designed to learn from non-predatory and predatory conversations and distinguish between them. Additionally, it would have the ability to recognize inappropriate messages no matter whether they came from the predator or the child’s side. The corpus also had adult-like conversations initiated from the child’s side.


The Dataset

The team consolidated and cleaned nearly 500 chat log files that contained exchanges between a predator and a pseudo-victim. The collection grew into a corpus containing 807,000-plus messages ranging from “hello” to explicit remarks. The dataset creation proved laborious, where I voluntarily provided more than 630 hours in just labeling data. The dataset received labels, such as male or female as they identified themselves in the chats, predator or victim, and level of risk of the conversation. Nearly half of the project time was dedicated to a properly built and parsed dataset.

This dataset was split into a training, development, and test set. The training set held 75 percent of all messages for the chatbot to learn the contextual format and nuances of conversation. The development set, which was 10 percent of the data, was held away from the chatbot until after model selection, to prove the validity of the model.

The 2 mins video quickly discusses how the team assembled the chatbot’s dataset.



Data Format and Storage

The data was housed in a relational database. It became large enough to serve as a nexus to provide uniquely formatted datasets for the machine learning pipeline.

During the labeling process, few issues arose on how to semantically define a conversation. With many different log formats, ranging from AOL Messenger to SMS and other online platforms, the sentences would start and stop at different points. In conversations, I implemented a similar format as used in the competitively-used Cornell University’s movie corpus that provided a standard structure making it easy to parse the data. Additionally, the corpus contained chat slang, abbreviations, and number-for-words, like “l8r” for “later” and “b4” for “before”, which required a team consensus on how to handle these stopwords. The team did not focus on timestamps due to extremely varied formatting, missing values, and lack of importance to the overall project.


The Model

Many models presented as candidates to the chatbot’s internal workings. The main goal for the team was to have a local and offline solution for now. This was done to reduce privacy concerns and legal issues. Future considerations of this project would evaluate these features with appropriate development operations.


Online Predators

Basic sequence-to-sequence model diagram.


The selected model focused around a Long Short Term Memory (LSTM) network cell, arranged as a sequence-to-sequence configuration. LSTMs have long been proved well-suited to work with sequential data. Our application would use this ability to help the chatbot predict the next plausible word to use for response.

For the sentiment analysis portion, we focused our efforts on an ensemble learning model as well as a support vector machine to help predict when the conversation changed from benign to risky.



Our team successfully built a chatbot and a sentiment analysis model independently. The chatbot learned from its more than 807,000 messages to understand how to parse sentences and structure a proper response. The limited vernacular stemmed from the chatbot’s time to learn and framework limitations.

The greatest challenge to code performance-centered inside the platform chosen, TensorFlow 1.0.0, provided limitations. The code did provide a conversation-capable entity, but the model needs more training data if we want to go beyond proof-of-concept to deploy it in an application.

The project successfully employed message sentiment analysis and was able to warn the user of potentially risky conversations initiated by online predators. The sentiment analysis ranged from low, medium, or high levels of risk.

Future considerations will take this project into a full-functioning environment of TensorFlow 2.1.0, eliminating other frameworks, including PyTorch. The internal model will receive an update to the LSTM structure and performance will be improved with the use of graphics computing processors, such as NVIDIA and its cuDNN framework.

Applying Machine Learning to Predict Illegal Dumpsites

Applying Machine Learning to Predict Illegal Dumpsites

By Ramansh Sharma, Rosana de Oliveira Gomes, Simone Vaccari, Emma Roscow, and Prejith Premkumar


Just like any other day, we start our morning with a coffee and a snack to go from our favorite bakery. Later on the same day, we check out our mail where we find letters, newspapers, magazines, and possibly a package that just arrived. Finally at night, after a rough week, we decide to go out to have drinks with friends. Sounds like a pretty uneventful day, right?

Except that we produced lots of trash in the form of plastic, glass, paper, ad more.

According to eurostat, it is estimated that an average person in Europe produces more than 1.3 kg of waste per day (in Canada and the USA, it can go up to more than 2 kg). This is equivalent to a person producing 800 kg of trash per year. Now imagine millions of… billions of people doing the same. Every day!

To give you an even clearer perspective: less than 40% of all the waste produced in Europe is recycled — and it is even less across the other continents. Even further, it is estimated that 20% of all generated waste ends up on illegal dumping (s) in Europe, and 50% in Africa.

TrashOut is an environmental project which aims to map and monitor all illegal dumping (s) around the world and to reduce waste generation by helping citizens to recycle more. This is done through a mobile and web application that helps users with locating and monitoring illegal dumping (s), finding the nearest recycling center or bin, joining local green organizations, reading sustainability-related news, and notifying users about updates on their reports.

In this article, we discuss our analysis of illegal dumping (s) across the world, both in local and global scales.


The problem


Photo by Ocean Cleanup Group on Unsplash


The problem statement for this project was to “build machine learning models on illegal dumping (s) to see if there are any patterns that can help to understand what causes illegal dumping (s), predict potential dumpsites, and eventually how to avoid them”. We decided to tackle this wordy problem statement by dividing it into three manageable sub-tasks to be worked on throughout the duration of the project:

  • Sub-task 1.1: Spatial patterns of existing TrashOut dumpsites
  • Sub-task 1.2: Predict potential dumpsites using Machine Learning
  • Sub-task 1.3: Understanding patterns of existing dumpsites to prevent future potential illegal dumping (s)



  • TrashOut: Reports on illegal dumping (s) provided by users through the TrashOut mobile App. For each report, a number of features are recorded, and the most relevant for this analysis were: location (latitude and longitude, city, country, and continent), date, picture, size, and type of waste.
  • Open Street Maps (OSM): Geospatial dataset and information on the cities road network, including the type of roads (e.g. motorway, primary, residential, etc)
  • Socioeconomic Data and Applications Center (SEDAC): Population density at 1km grid, from which we also calculated the population density gradient to account for population density in the neighboring cells
  • FourSquare: Information about nearby venues
  • World Bank Indicators, World Bank’s “What a Waste 2.0”, Eurostat, European Commission Directorate-General for Environment: Datasets for socio-economic indicators.
  • Non-dumpsites Control Dataset: we generated our own Control Dataset, which was required to train the model on where dumpsites do not occur. For every TrashOut dumpsite location, we selected a pseudo-random location 1 km away and assigned this as a potential non-dumpsite location.



The first challenge was to identify and extract meaningful information for the spatial analysis from the available datasets. Our assumption was that illegal dumping (s) are more likely to occur in highly populated places, in proximity to main roads and in proximity to venues of interest such as sports venues, museums, restaurants, etc. Based on this assumption, we used the available dataset to extract, for every TrashOut dumpsites as well as for every location of our Non-dumpsite/Control Non-dumpsites, the 17 features described in Table 1:


Table 1: Datasets and API’s used to acquire different features for dumpsites * For the control dataset, the source for Continent was pycountry-convert library.


Sub-task 1.1: Finding existing dumpsites

City-based Analysis of Illegal Dumpsites/ Dumping (s)

We performed an in-depth analysis focused on six shortlisted cities, with the goal to represent different social statuses and geographical locations so all continents were included, and based on the availability of a considerable number of TrashOut dumpsite reports. The cities analyzed were:

  • Bratislava, Slovakia (Europe)
  • Campbell River, British Columbia (Canada)
  • London, UK (Europe)
  • Mamuju, Indonesia (Asia)
  • Maputo, Mozambique (Africa)
  • Torreon, Mexico (Central America)

For the city-based analysis, we accessed the road network information from the OSM dataset by using the Python package OSMnx. This API allows easy interaction with the OSM data without needing to download it, which makes it very accessible in any location around the world. We structured the analysis in a Colab Notebook for consistency and analyzed the following features for each city: distance to three types of roads (motorway, main and residential), distance to the city center, population density, size, and type of waste.


Results for Bratislava

The proportion of TrashOut dumpsites vs. Control Non-dumpsites and their proximity to nearest roads within 1 km is shown in Figure 1, however, the statistical assessment was undertaken within 100 m using the two-proportion Z-test. The three graphs are generated for each road type (motorways, main roads, and residential roads) with the purpose to identify whether dumpsites are more likely to appear in proximity of a specific road type. In Bratislava, around one-fifth of dumpsites were found in proximity to the main road (within 100 m), and these were found more likely to be reported next to the main road (within 100 m) compared to locations of Control Non-dumpsites. However, most dumpsites are not reported on roadsides, and in fact, being further away from a road was found to be a slightly better predictor of where a dumpsite might occur.


Figure 1: Proximity to a nearest major road for dumpsites and control datasets


The location of TrashOut dumpsites across Bratislava, colored by reported size, is shown in Figure 2. The majority (around three-quarters) of dumpsites are estimated by TrashOut users to be too big to be taken away in a bag. Dumpsites of all sizes are found throughout the city, but the largest dumpsites tend to be further away from motorways.


Figure 2: Size of dumpsites in the city of Bratislava


Several types of waste were reported alongside other types of waste within the TrashOut dumpsites. The number of dumpsites containing each type of waste is shown in the bar chart in Figure 3.1, whereas in Figure 3.2 is shown the percentage of dumpsites containing several types of waste in a matrix. The majority of reported dumpsites in Bratislava contain what TrashOut users describe as domestic waste. Domestic waste often coincides with plastic waste, which itself is found in around half of the dumpsites. Around one-third of dumpsites are reported to contain construction waste.



Figure 3: Waste types in the TrashOut datasets for Bratislava



Visualizing the distribution of dumpsite reports throughout the city with the spatial analysis undertaken can be informative in preparation to clean up existing dumpsites, as well as for identifying potential new hotspots. The following observations were drawn from this city-level geospatial analysis.

Information about the type and size of dumpsites may be important for local authorities and decision-makers to consider how best to clean up dumpsites. Having a spatial visualization of the locations and characteristics of each dumpsite across each city area, not only helps to inform management efforts to clean up existing dumpsites, but also to try minimizing potential new dumpsites by introducing bins for specific types of waste, or holding events to increase recycling awareness.

Plastic waste is found alongside other types in many dumpsites, which is not surprising. Waste that can be separated occurs simultaneously in reports: domestic and plastic, glass or metal. This might suggest that infrastructure is lacking (i.e. waste collection facilities), or the population is not aware of waste sorting and recycling.

The amount of construction waste in reports for every part of the world suggests that legislation for construction and demolition waste needs to be improved and compliance needs to be checked/assessed in many places. This might suggest that residents find construction waste difficult or costly to dispose of legally, or that construction companies are neglecting their responsibility to clean up.

It is important to stress that we cannot say where dumpsites actually appear, only where they are reported to TrashOut. Dumpsites may be reported with higher frequency in some areas because there are more residents or passersby to report them, regardless of whether there are more dumpsites in those areas.

The use of these tools and analysis will always need to be supported by local knowledge, as well as with the involvement of local municipalities and authorities.


1.2: Predicting potential dumpsites 

Features to train the Machine Learning model

The second subtask focused on creating a Machine Learning model that could predict whether a location is at risk of becoming a dumpsite. Since we have already seen the variables that were considered to be of a strong influence on dumpsites in Table 1, these variables could be used to predict whether a new location could turn into a dumpsite.

When acquiring the venue categories, we set a radius parameter in the Foursquare library until which distance it is supposed to fetch venue categories information. Although we created datasets with radii 500m, 1km, 2km, and 3km, we came upon the conclusion that the 1km radius dataset was the most appropriate one with the best model performance. It was not too near to the location from which the data was being collected therefore not losing any vital information, and at the same time not too far so that irrelevant information needed to be fetched.

The features: Number of Venue Categories, Nearest Venue Categories, and Frequent Venue Categories were only acquired up until 1km from a given location. Moreover, the five nearest venue categories and five most frequent venue categories were acquired for each given location as separate variables. If the Foursquare API failed to acquire not all 5 (or even none in some cases) categories within a 1km radius, then a None string would be placed instead in the empty variables.

A similar approach was taken for the OSM library for the distance to roads features. The value was only collected for roads up till a 1km radius from a location, with the exception of few cases where the API returned a distance slightly beyond 1km.

For the population density feature, our team discussed different approach ideas, and eventually, we decided that, instead of having a singular value for the population density of the given location, the probability of a dumpsite occurring in (or in very close vicinity of) that location is also affected by the surrounding population. Therefore if the location is in the center of a 1×1 square km cell, then the population densities of the eight 1×1 square km cells around the center cell are also considered. This would be a rather good way to see if the dumpsite is in the middle of a highly-populated area, in the outskirts of the city, or in a nowhere land. Using these nine different population densities (with a more weightage on the center cell’s density), a population gradient is calculated for the location which is given to the model as a separate feature in addition to the population density.

These were the 17 features that would be used for the Machine Learning model part of this subtask. But how do we teach the model what a dumpsite is?


Control dataset

We wanted to train a Machine Learning model such that it learns to understand what constitutes a dumpsite in the 17 variables we gathered. We fetched and calculated the features for every one of the approximately 56,000 dumpsites in the TrashOut dataset that we had. However, it is not possible to train a model just by showing it what a dumpsite is. This is because when we show a location that is highly unlikely to become a dumpsite, we want our model to confidently tell us so. An analogous comparison would be to show 56,000 cats to a child and then expect he/she to recognize that a dog is not a cat.

The solution lies in creating a control dataset. In order to teach the Machine Learning model what a dumpsite is, we also need to teach it to understand what a dumpsite is not.

For the sake of simplicity, we can also call the control dataset non-dumpsites. So how do we go about finding non-dumpsites? Any location that is not a dumpsite is in essence a non-dumpsite. However, that will not help the model learn meaningful differences between the two classes: dumpsites and non-dumpsites. Instead, what we can do is find close geographical points to the dumpsites that we already have and use them as the control dataset. Once again, we experimented with multiple distances from the dumpsite to pick these points and found that a distance of 1 km works best. The advantages of choosing these points are:

  • The points are close enough to the dumpsites so that there are subtle changes in the features that the model will be able to learn and appropriately map to the two classes.
  • The points are not too close so that the model fails to realize key differences between the features of the two classes.
  • The points are not so far that there is no correlation between the two classes, therefore, preventing nullification of the purpose of the control dataset.
  • When we choose a point near a reported dumpsite, we assume that a location nearby a known dumpsite has active users of the TrashOut app in the area, so if there was no dumpsite report, we assumed there was no dumpsite in that location.


Figure 4: Illustration of determining the approach to creating a control dataset.


We also took measures to make sure that the non-dumpsite point generated for each dumpsite did not contain another dumpsite, was not in the vicinity of another dumpsite, and was not in a major water body.

The control dataset was made for every dumpsite so that the two classes were balanced for binary classification. Additionally, all the features that were used in the dumpsite dataset were used for the control dataset as well.



Our team investigated three different Machine Learning models throughout the project:

  1. Random forest classifier — This approach did not work because the model failed to understand the data in a thorough manner and yielded extremely low accuracy. ❌
  2. Neural Networks (Dense and Ensemble) — These series of models and its iterations did not work either because the model was tremendously overfitting. It would be unsuitable for real-world purposes. ❌
  3. Light Gradient Boosted Model (LGBM) — This model was our final model. It had good accuracy and the minimum generalization error among all three models. ✅



The final accuracy of our model was 80% on the test set. We employed the use of k-fold cross-validation to maximize the accuracy our model could achieve on the test set. We also observed how important the individual features were when it came to classifying a given location to be prone to a dumpsite or not. This analysis was done with the help of SHAPELY and PDP plots as shown in Figure 5 below.


Figure 5: Importance of every individual variable on the model performance.


Figure 6: Probability output change in the model induced by a change in distance to roads feature


The SHAPELY plot in Figure 5 shows the contribution of each feature towards the prediction. It depicts the importance that each feature has. A high value of importance signifies that the model considers it as a very important factor when determining if a given location is a dumpsite or not. It is indicated that the most important feature of the model for both classes is the distance to the road variable.

The Partial Dependence Plot in Figure 6 helps one understand the effect of a specific variable on the model output. As the value of a feature changes, its effect on the model will also change accordingly. We compute these plots for all the numerical values that we have, to understand its effect on the prediction of the model. The one shown above is the plot analyzing the effect of distance to roads variable. As it can be observed, as the distance from a given location to a major road increases (positive x-axis), the probability that that location is a dumpsite decreases. The soft spike from 1500 m to 2500 m is due to how we manually placed the value of 2500 m in an example when the API could not find a road up till that distance. Regardless, this situation can be manually handled in the deployment implementation.

One of the key achievements of the team was being able to generate a full city heat map of the city of Bratislava (most of our tests were based here) by running the model on more than 700 locations in the city. In the heat map, the actual dumpsites are plotted in blue markers while the roads are marked in black lines (major roads and highways are visibly thicker than minor roads). The spectrum goes from whitish-yellow to dark red with yellow regions resembling a low probability of becoming a dumpsite and the red regions resembling a high probability of becoming a dumpsite. The heat map provides many beneficial usages. For example, municipalities and local authorities can make smaller heat maps for regional neighborhoods to determine which areas are at a high risk of becoming a dumpsite.

Another important variational use of this heat map is to combine it with valuable insights about socio-economic factors, population density, distance to roads, etc. The reason being that even though the model has considerably good accuracy, the wisdom still lies in the intuition of local authorities and municipalities. These officials will be better equipped to analyze key neighborhoods and areas to find the places where there is a major road or highway nearby, has a high population density, and a certain set of venue categories in close proximity. Then, a heat map can be generated for that area and specific regions can be identified which require immediate attention to mitigate the possibility of becoming a dumpsite.


Figure 7: Heat map of the city of Bratislava using the ML model to predict dumpsites


Sub-task 1.3: Preventing future dumpsites

Global analysis of illegal dumpsites/ dumping (s)

In order to analyze illegal dumpsites/ dumping (s) on a global scale, we combined the data from TrashOut with two other datasets:

From this setup, it was possible to divide the countries analyzed into four clusters, using unsupervised learning:


Figure 8: Global analysis clustering summary. Source: Omdena


Small population developed countries (Blue cluster): countries with a small population and population growth, but high urban populations and access to electricity. These countries also have low urban population growth and GDP. The countries in this cluster present a high production of glass, metal, and paper/cardboard waste, and the highest production of yard/green waste.

High population developed countries (Orange cluster): countries with the highest GDP, and also high access to electricity, urban population, tourism. Low inflation and urban population growth. also produces high amounts of glass and paper/cardboard and is responsible for the highest production of special waste and total municipal solid waste: These countries are also associated with the lowest production of organic waste.

High population developing countries (Green cluster): countries with the highest population and inflation; moderate access to electricity and GDP, and growing urban and total population. They generate high amounts of organic, rubber/leather and wood waste, and low amounts of glass. This can be associated with the high population being less concentrated in cities and also to the level of industrialization of such countries, which may be the ones that most produce factory materials such as leather and rubber.

Small population developing countries/low income (Red cluster): countries with a small population and GDP; lowest electricity access, but highest GDP and population (including urban) growth. Most of the waste produced in these countries is from food and organic sources. These are the countries that also produce the lowest amounts of glass, metal, and rubber/leather waste. Such scenarios can be associated with the low populations and also possible use of waste in a sustainable way, as these countries also present low production of green/yard and wood waste.


Figure 9: Some features in the global analysis in country clusters.


Combining this analysis to the illegal dumpsites/ dumping (s) data from TrashOut, our team obtained the following insights:

  • Plastic waste is highly produced across all clusters, indicating that this kind of trash needs to have a global awareness strategy.
  • Different types of illegal trash are associated with rural and urban areas, making it important on a global scale. Countries with a higher rural population (low income) will produce more illegal organic waste, whereas developed countries present more reports on illegal dumpings/ dumping (s) of plastic, cardboard, yard/green waste, rubber/leather, and special waste.
  • Socio-economic factors such as infrastructure, sanitation, inflation, and tourism play a moderate role in the different production of illegal waste worldwide.
  • We identify that population is the most important factor in the production of special waste, municipal solid waste, and the total amounts of waste per year. However, most of the special waste produced in developed countries (majority) is reported.



In this project, we have analyzed data on a local and global scale to understand which factors contribute to illegal dumping, as well as predict and finding possible ways to avoid it.

It is important to stress that we cannot say where dumpsites actually appear, only where they are reported to TrashOut. Dumpsites may be reported with higher frequency in some areas because there are more residents or passersby to report them, regardless of the number of dumpsites in those areas.

Nevertheless, visualizing the distribution of dumpsite reports with the spatial analysis undertaken can be informative for identifying potential new hotspots. The following observations can be extracted from our analysis:


On a city level

  • The prediction of the ML model and the heatmap can be used as tools for targeted waste management interventions, but will always need to be supported by local knowledge as well as with the involvement of local municipalities and authorities.
  • The main road and motorway junctions are locations where illegal disposal of waste is prone to occur. We can witness this in Bratislava and Torreón.
  • A lot of reports occur in natural resources areas, e.g. watercourses or natural parks. We can see this in Bratislava, Mamuju, and Campbell River. This may be due to two factors: ease of disposal without being caught; people walking by those areas may be more environmentally aware and wanting to preserve more the places where they go to enjoy nature. Consequently, they create reports more often.
  • Waste that can be separated occurs simultaneously in reports: domestic and plastic, glass or metal. This might suggest that infrastructure is lacking (i.e. waste collection facilities), or the population is not aware of waste sorting. This is especially clear in Maputo city.
  • The amount of construction waste in reports for every part of the world suggests that legislation for construction and demolition waste needs to be improved and compliance needs to be checked/assessed in many places. This applies to companies and individuals. Construction waste seems to be a problem for many cities.
  • TrashOut reports seem to be created in pulses, not on a regular basis. For all the cities examined, there are certain months when the number of reports created exceeds the average by far. Moreover, reports seem to be generally created in specific parts of a city (eg. London).


On a global scale

  • Plastic waste production is high across the globe, making it an international problem;
  • Among all the socio-economic factors, population plays a strong role in the production of waste in the world (legal and illegal);
  • The level of development and socio-economic factors (infrastructure, sanitation, education, among others) play an important role in the kind of waste produced by countries.
  • In particular:
  1. Small developed countries present a high production of glass, metal, and paper/cardboard waste, and the highest production of yard/green waste
  2. High population developed countries: high amounts of glass and paper/cardboard and the highest production of special waste and total municipal solid waste. These countries are also associated with the lowest production of organic waste
  3. High population developing countries: high amounts of organic, rubber/leather and wood waste, and low amounts of glass;
  4. Small developing countries/low income: most of the waste produced in these countries is from food and organic sources.


Possible Factors to Avoid Illegal Dumping

  1. Organization of clean-up events in areas where many dumpsites are already existing can be arranged to clean up the targeted areas in a fun, interactive, and educational way. Collaboration with local authorities should be put in place to improve existing waste infrastructure and build new ones, if necessary.
  2. Those areas identified as high risk for becoming new potential dumpsites could be targeted with waste infrastructure development/enhancement programs. Additionally, learning events could be organized to raise awareness about dumpsites risks, and how to minimize, or avoid altogether, dumpsites by using properly the waste facility infrastructure existing.
  3. Examples of learning events consist of learning sessions on how to use waste infrastructure and recycling bins according to local/national authorities, the benefits of a sustainable way of living through the 4R cycle (Refuse, Reduce, Reuse, Recycle), and how to avoid single-use items (or a specific type of waste as can be highlighted by a city-level analysis).

How to Become a Data Engineer: The AI Plumber?

How to Become a Data Engineer: The AI Plumber?

By Natu Lauchande
21st Century roadmap to becoming a Data Engineer

What is a data engineer?

In broad strokes, a data engineer is responsible for engineering systems and tools that allow companies to collect raw data from a variety of sources, volume, and velocity into a format consumable by the broader organization. The most common downstream consumers of data engineering products are the AI/Machine Learning and Analytics functions of a company.

The best way to start talking and discussing this new and loosely defined role is the Data Science hierarchy of needs brilliantly depicted by Monica Rogatin in the pyramid below.







                                         Source: The Medium post “The AI Hierarchy of Needs”



A data engineer is the lead player on the first 3 foundational rows of the Pyramid: Collect, Move/Store and Explore and Transform. A plethora of roles from Data Analysts, Data Scientists, and Machine Learning Engineers are the heirs and lead role players on the higher phases of the value chain unlocking.

A Data Engineer is part of the functioning that provides the base to the highly critical job of the Data Scientists by hiding all the complexities involving the management, storage, and processing of the data assets of the company. He or she is a master of data ingestion, enrichment, and operations.



Source: Oreilly



With the deluge of data available within public and private companies, the ability to unlock this value is the critical factor in providing cheaper and better services to stakeholders and customers.


Skills of the trade

Data Engineers do come in different flavors and types. The core skills of the trade can be summarized below in order from essential to important:

  • Software Engineering: Data Engineering in its essence, is a discipline of Software Engineering where the same rhythms and methodologies of work are applied in order to execute the task at the end. The use of version control, unit testing, and agile techniques to ensure business alignment and quick delivery are paramount for success.
  • Relational Database/Data Warehouse Systems: Most of the data access in the data engineering space is democratized through access to ad-hoc querying into a relational database environment. Allowing expert users with basic knowledge of SQL to retrieve the data that they need in order to respond to a business query or decision.
  • Scalable Data Systems/Big Data: It’s central to the modern data engineer to understand data systems architectures. A good grasp of how distributed and parallel processing work is needed. The different types of indexing available in their environment to allow proper and efficient processing of the data at their disposal is a great skill to have.
  • Operating Systems / Command Line: Familiarity with your local environment of development being OS/*NiX/MIN is primal, particularly the command line where a lot of ad-hoc wrangling can happen.
  • Data Visualisation: A fundamental skill to effectively expose data products to a more general audience and quickly unlock data value through clear infographics, charts, and interactive analytics. Familiarity with a tool like Tableau, Superset, or Power BI is a must.
  • Data Science (Basics): An increasingly important user and stakeholder of a Data Engineering organization is the data science team. Understanding how data is used in the context of exploratory data analysis, machine learning, and predictive analytics ensures a virtuous cycle between critical data functions.

Data Engineers don’t need to be experts in all of the areas above. Having two core expertise in the above and a good understanding of the other areas go a long way in delivering value to a project.

A Data Engineer can come in different shapes and forms, so being very specific about your role is very important. As a nascent profession, it lacks standards and consistent job descriptions.

Typically transitions to successful data engineers are seen from the following backgrounds in the industry:

Software Developer/Engineer, Data Scientist, Database Administrator, Business Intelligence Developer, and, Data Analyst.


The path to mastery

To master data engineering I would start with the prerequisite of getting deep experience and expertise in two or more of the following areas.

  • Distributed Systems / Big Data
  • Database Systems / Data Warehousing
  • Software Development
  • Data Visualization

The most traditional path to mastery is a degree in a discipline with high Computing exposure (CS, EE, Info Sys., Applied Maths/Phys, Actuarial Science/Q) or a Quantitative degree followed by a couple of years in Software Development or Data Science with practical exposure to backend services and production systems. The data engineering field is loaded up with rockstar engineers from non-traditional backgrounds ( high school dropouts, literature majors, etc.).

A couple of top online courses and specialization available at the top websites ( Coursera, Udacity, Udemy, etc.) covering Big Data / Data Engineering tooling can give a good foundation to aspiring Data Engineers. The ones with the best reviews in your preferred learning platform will assist you in building a skill set for the role.

After this initial foundations I would recommend the following books for fundamentals in architecture:

  • Designing Data Data-Intensive Systems —Martin Klepmann
  • Data Engineering Cookbook—Andreas Kretz
  • Foundation of Architecting Data Solutions — Malaska et. AL,
  • Streaming Systems — Akidau et. al
  • The Data Warehouse Toolkit — Ralph Kimball

Nothing is more valuable at this stage than getting practical exposure in a real-world data engineer role. Keep practicing and growing the craft for the rest of your career.

Omdena as an organization that promotes AI challenges with volunteers across the world is the ideal place for anyone to sharpen their data engineering skills. In many of the Omdena challenges one of the most important skills needed is data engineering skills to prepare data, set up data pipelines, and operationalize pipelines.


Typical tools of the trade

With all the excitement in the field, a plethora of tools are popping up in the market, and knowing which one to use becomes a problem as there are many overlapping uses of them. A typical data engineer product/service does not differ much in terms of the complexity of a software system.

A typical data engineering pipeline will require expertise in at least one tool per function/category:


1. Function : Pipeline Creation / Management

Apache Airflow

  • End to end workflow authoring and management tool.
  • Provides a computing environment where your processes can run.

Alternatives: Azkaban, Luigi, AWS SWF


 2. Function: Data Processing

Apache Spark

  • A fundamental tool to process data in many formats at high scalability.
  • Allows facile enrichment and processing in SQL, Scala, and Python.

Alternatives: Apache Flink, Apache Beam, Faust


3. Function: Distributed Log/Queueing Systems

Apache Kafka — Scalable distributed queuing system that allows data to be processed and moved at a very high speed and large volumes.


4. Function: Stream Processing

Alternatives: Apache Flink


5. Function: Data/File Format

Apache Parquet — Very efficient data format geared for analytics and aggregations at scale on cloud or on-premises.

Alternatives: Arrow, CSV, etc.


6. Function: Data Warehousing /Querying


  • A cloud-based data warehouse system for structured and relational data storage and analytics.

Alternatives: AWS Redshift, Apache Hive, etc.


Keep in mind that tools go and come over the years. Focus on the picture and functional areas will keep you updated and ready to learn the new fancy tool.

Starting or joining an open-source that uses any data engineering tool is a good move from a growth perspective and longer-term mentorship by captains of the industry.


The future

In order to fulfill the promise of unlocking the value of data, more investment in the Data Engineering space is expected. There’ll be increasingly intelligent tooling available to handle the current and future challenges around data governance, privacy, and security.

I can see an increase in blending AI and ML techniques directly on the Data Engineering toolchain from an operations perspective and data quality assurance. Good examples of such tools are Deequ from AWS Labs that applies machine learning to data profiling. At the center of modern Data Engineering are areas like synthetic data generation to alleviate issues around data privacy when the cost of acquisition of data and compliance is too high Tools to watch out on the synthetic data space: Snorkel and the use of generative adversarial neural networks to generate everyday tabular data.

With the rise of Auto ML for prediction and data analytics, a central role will be given to the underpinning data infrastructure engineering of the datasets that drives the enterprise strategy. From here, we can only see an outlook of increasing relevance and opportunities to contribute positively to society.

I would like to acknowledge Laisha WadhwaJames Wanderi, and Michael Burkhardt for their input and suggestions on the article.

Stay in touch via our newsletter.

Be notified (a few times a month) about top-notch articles, new real-world projects, and events with our community of changemakers.

Sign up here