Adopting an Agile Navigation Approach in an AI Project

Adopting an Agile Navigation Approach in an AI Project

By Diana Roccaro

Applying Natural Language Processing in an agile AI project to investigate Online Violence against Children.

I had the honor to contribute to the 29th Omdena project, aimed at investigating Online Violence against Children (OVAC), conducted in partnership with Save The Children. I tried to follow our kick-off meeting on August 20, 2020, sitting in a train running the Albula Line of the Rhaetian Railway, a very scenic route belonging to the UNESCO World Heritage. A bird would have chosen a more direct path to cover the actual distance of 47 kilometers, but our train had to cross 55 bridges and 39 tunnels spread across 62 kilometers.

Setting Project Goals

Only when you know your goal, you can select the most efficient path. The kick-off meeting, which I mostly missed due to a constantly interrupted internet connection, heavily focused on Child Sexual Abuse Materials (CSAM), a term basically synonymous with child pornography. The problem statement handed over to us, in contrast, was intentionally kept very broad and did not specify at all, what kind of OVAC our project team was supposed to analyze. For this reason, every task team formed during the 8 weeks of our project set their focus somewhat differently.

 

Agile AI

My attempt to categorize Online Violence against Children into distinct subclasses and to allocate project tasks to these subclasses. / Source: omdena.com

 

The aim of our project was to apply Natural Language Processing (NLP) techniques to investigate OVAC. This article is restricted to my analysis of news articles. I am a research neuroscientist (and later on a medical writer and machine learning engineer) by education and thus clearly no expert neither in the domain of OVAC, nor in the field of journalism. But whatever we humans write, our choice of words will inevitably be influenced by our background knowledge, and hopefully also adapted to our target audience. Human language is very dynamic. Thus, whatever we try to analyze using NLP, it can’t be a bad idea to first try to obtain some minimal level of domain knowledge from experts on the field. Understanding a domain and reading about definitions and human classification systems will help us not only to select the terms we want to look for but also to define classes that our machine counterparts shall learn to recognize and predict.

 

Collecting Data

To retrieve news articles in digital form, we need to — decide on a provider of digital news articles and search for specific terms. If that search engine even offered advanced search options, we could think about the most sophisticated and goal-directed query syntax. Although standard in scientific repositories, such advanced functionality is rarely built into news article search engines, which forces us to spend some time and effort selecting the best terms to search for.

 

Agile AI

Photo by Ray Hennessy on Unsplash.

 

So how will we select our search terms, being neither OVAC domain experts nor experts of journalistic terminology? Just as the bird trying to fly towards its target, carefully paying attention to environmental cues telling it to adapt direction, we can adopt an agile methodology, as software developers would call it. As long as we lack the bird’s perfect navigation strategy, we can simply use trial and error. We input a search term, analyze what we get back, and adjust our search accordingly. Scientific terms will unlikely be used by reporting agencies. It may thus be a wise idea to use different search terms to collect news articles than to collect scientific publications.

I decided to retrieve news articles from the Thai e-newspaper “Bangkok Post” and started out comparing the specificity of various search terms to our topic of interest, Online Violence against Children. Since our kick-off presentation focused on CSAM, I decided to try its non-scientific equivalent “child pornography” and found each article in the results list to in fact be related to OVAC, thus relevant to our project. In contrast, most other search terms returned many irrelevant articles, in addition. For example, only 6 out of the 52 (thus 12 %) news articles obtained from Bangkok Post by searching for “online grooming” proved to in fact be on OVAC.

 

online violence children

Specificity of results retrieved using various search terms related to Online Violence against Children. The higher the OVAC-relevant fraction, the more useful the search term. / Source: omdena.com

 

If the goal is to include only OVAC-relevant articles into our analysis, 3 strategies are conceivable: 1) to use only search terms that produce results 100 % specific to the problem of OVAC, 2) to use various search terms and manually check article per article to include only the relevant ones, or 3) to find some method to (semi-)automate filtering out the relevant articles. Since 1) would restrict the analysis to very few subclasses of OVAC, and 3) has the potential to increase the efficiency of 2), but also since I’m a curious person who has never before built a classifier that relies on machine learning to categorize text documents, I decided to build a news article classifier.

 

Automating the Data Collection Process

The search results I got back using my search terms broadly fell into three categories: 1) news articles on OVAC (our target class), 2) articles on violence against children, but in physical offline instead of online forms (“PVAC”), and 3) articles on online violence, but against adults instead of children (“OVAA”). I wondered whether an algorithm will be capable of picking only project-relevant OVAC-related articles out of a collection of news articles of all three classes. To find this out, I trained a Support Vector Machine (SVM) Classifier based on 209 news articles from Bangkok Post (supervised ML; 131 OVAC + 43 PVAC + 35 OVAA articles).

Whenever we train a machine learning model, we have to decide which metric we want to optimize. My goal was not to find every single article out there, that would be on OVAC, our class of interest, but rather to end up with a collection of news articles, all relevant for our project (OVAC-related). In other words, my aim was to maximize recall — the fraction of true positives among all articles classified as positive for our target class OVAC. Already in the 2nd training attempt, I reached a recall of 100 %: every article predicted by the SVM to be on OVAC, was in fact on OVAC.

A classifier as reliable as this may prevent us from battling our way through every single article to confirm its relevance to our project. On the other hand, such great performance can only be achieved on a test set of news articles that shows a class distribution similar to the article set it was trained on. Most notably, a classifier not trained to recognize entirely irrelevant articles will evidently always fail to detect such.

 

online violence children

Classification metrics obtained for a Support Vector Machine (SVM) Classifier trained to recognize news articles related to our target class, OVAC. / Source: omdena.com

 

One of the advantages of news articles, if compared to research publications, is that these are published with a very short delay. News articles thus have the power to provide insights about trends over time with a much shorter time lag bias than research publications.

Quantifying the Issue

One goal stated in our problem statement included capturing the severity of the situation. Will the severity of a problem be reflected by the number of news articles published on that problem? Rather not. A case of an actor condemned for possessing child pornography may be reported by 20 news articles and followed up closely, whereas an international ring of pedophiles acting over decades may receive much less media attention and be reported by only a specific reporting agency.

Features can be extracted from news articles with the help of NLP techniques, and then converted into numeric variables with the aim to quantify the magnitude of a problem. They can surely serve as indicators pointing towards potentially underlying trends, but need to be interpreted with a healthy portion of caution and skepticism. Too many biases may play their part as well, such as the reporting bias resulting in an over-representation of the case of the famous actor, as compared to a less attention-grabbing but potentially much more severe case.

Nevertheless, I decided to give it a try and investigate trends over time with respect to the reporting of child pornography cases. Instead of using advanced NLP, I decided to rely on basic NumPy functionality and created a dummy variable for 7 selected verbs, often reported in the context of child pornography. For each article, represented by separate rows of a pandas data frame, the dummy variable would assume a value of 1, if the article contains a certain text string such as “possess”, or alternatively a value of 0 if it does not.

 

online violence children

Appending dummy variables to a pandas dataframe to indicate the presence (1) or absence (0) of a given text string in each article. / Source: omdena.com

 

For every reporting year, I summed up these word occurrences to obtain the total number of articles, in which a given string was mentioned together with child pornography, per year. I used the streamgraph package developed for the statistics package “R” to visualize these yearly article frequencies over the past 10 years.

 

online violence children

Yearly frequency of selected verbs reported together with child pornography. Blue: “stream”, green: “spread”, light green: “share”, yellow: “record”, light orange: “publish”, dark orange: “possess”, red: “create”. / Source: omdena.com

 

Different NLP techniques have different requirements with respect to the cleanness of a text string. To create a word cloud, it’s beneficial to lowercase every word, so that the frequency of the capitalized and non-capitalized version of the same word will be added together. To analyze, how often the possession of pornography is mentioned in news articles, it is beneficial to use word reduction techniques such as stemming and/or lemmatization, so that the frequencies of differently inflected or derived word forms like “possess”, “possessed”, “possession”, “possessing” will be added together. (I used an alternative approach to solve the same problem, above.)

To avoid wasting time, it is wise to think about such requirements before jumping into actual text cleaning. Some cleaning steps won’t be required for a certain NLP analysis, whereas others will make the analysis difficult or even impossible. A highly efficient, customizable function developed by my collaborator Sij allowed me to clean text strings in virtually no time.

 

online violence children

Very efficient, customizable text cleaning function developed by my collaborator Sijuade. Booleans can be adjusted to specify, which of the 6 standard steps shall be applied. / Source: omdena.com

 

I created n-grams, document term matrices, and word clouds to identify the most common words and word sequences in news articles. These most common terms include “sexual”, “social media” and “facebook”, “law” and “government”, “police”, the “Philippines”, “women” and “teacher”. In analogy to the famous “garbage in, garbage out”, these results of course reflect the principle of “search terms in, keywords out”. Moreover, they illustrate that OVAC often occurs on social media platforms, that Facebook plays a major role in the Thai social media market, and that the Thai news agency “Bangkok Post” often reports on the Philippines.

Besides identifying the most common words or word-sequences (n-Grams) in a collection of text documents, it’s also relevant to investigate, how these words relate to each other. I followed the instructions provided by Jason Brownlee in a Machine Learning Mastery article to calculate my own word embeddings based on the 209 news articles from Bangkok Post with the aim to investigate word similarities and to possibly detect previously unknown words.

 

online violence children

Visualization of word vector embeddings trained on 209 Thai news articles of all three classes (OVAC/PVAC/OVAA). Depicted are the 9 closest neighbors (used in a most similar context) to 8 selected words: pornography, CSAM, cyberbullying, trolling, grooming, sexting, cyberharassment, and cybercrime (see bottom-right for color labels). / Source: omdena.com

 

I then calculated cosine similarities between selected pairs of word vector embeddings to quantify the similarity between specific words, resp. more precisely: the similarity between the contexts, in which different words typically appear. I found (“sexting” and “stalking”) and (“sexting” and “bullying”) to occur in very similar contexts (cosine similarity scores of 0.86 and 0.83, respectively), whereas (“pornography” and “bullying”) occurred in rather different contexts (0.37).

Furthermore, such visualization of word embeddings has allowed me to identify another term in the family of OVAC related problems, I wouldn’t have known before: “trolling”. I could now go back to the search engine of Bangkok Post to assess the specificity of news articles obtained using that search term for the problem of OVAC.

 

agile ai

Photo by Matheo JBT on Unsplash.

Insights and Conclusion

So, did all of these NLP techniques help me to gain valuable insights about the problem of Online Violence against Children? I would say that a substantial portion of my personal insights rather came from applying human intelligence while reading about the problem and talking to subject matter experts.

I consider an initial exchange with domain experts, in our case from Save The Children, complemented by some literature review, as invaluable for defining the scope and search terms. Amongst others, I also had the chance to interview a lady working for ECPAT International, a global network of organizations working towards ending the sexual exploitation and abuse of children worldwide.

I came to the conclusion, that the prevalence and importance of the various subclasses of OVAC seem to differ by geographic location. Analogously, also the barriers to overcome in the fight against OVAC likely differ by geographic location. Poverty, cultural norms, the available infrastructure, current legislation, and data protection regulations all critically determine, which forms of OVAC are most present in a society, and to what extent these are considered as normal or as something to be prevented in the future.

In some countries, coming into touch with child pornography is a part of everyday life, already in early childhood. As long as governments don’t provide their population with alternative opportunities to earn the minimum amount of money that would allow them to make an acceptable living, some of the problems around OVAC will be difficult to change. As long as teenagers enjoy being groomed by strangers online, making them aware of the associated risks may have little effect. Providers of social media platforms or chat forums have to respect the data privacy of their users, a goal typically in conflict with efforts to prevent potential online violence against minors.

I believe that the problem of Online Violence against Children can only be sustainably prevented if all stakeholders pull together. Parents can only truly care for the wellbeing of their children if they have enough to survive. Platform providers can only help to prevent OVAC, if prevention methods can be aligned with the current, typically local, regulatory requirements. And Omdena collaborators can only select the best terms to retrieve news articles if they already understand the problem to some extent. Every new insight gained can help re-adjusting the direction, and many tunnels and bridges will still need to be crossed on the long journey to the bright final destination: an internet providing a safe place for every child on this planet.

Exploring Scientific Literature on Online Violence Against Children Using Natural Language Processing

Exploring Scientific Literature on Online Violence Against Children Using Natural Language Processing

The following work is part of the Omdena AI Challenge on preventing online violence against children, implemented in collaboration with John Zoltner at Save the Children US.

This article is written by Wen Qing LimMaria Guerra-AriasSijuade Oguntayo

 

Textual Data  –  A Trove of Information

The amount of information available in the world is increasing exponentially year by year and shows no signs of slowing. This rapid increase is driven by expansions in physical storage and the rise of cloud technologies, allowing more data to be exchanged and preserved than ever before. This boom, while great for scientific knowledge, also has possible downsides. As the volume of data grows, so also does the complexity in managing and extracting useful information from it.

More and more, organizations are turning to electronic storage to safeguard their data. Unstructured textual information like newspapers, scientific articles, and social media is now available in unprecedented volumes.

It is estimated that about 80% of enterprise data currently in existence is unstructured data, and this continues to increase at a rate of 55–65% per year.

Unstructured data, unlike structured data, does not have clearly defined types and isn’t easily searchable. This also makes it relatively more complex to perform analysis on.

Text mining processes utilize various analytics and AI technologies to analyze and generate meaningful insights from unstructured text data. Common text mining techniques include Text Analysis, Keyword Extraction, Entity Extraction/Recognition, Document Summarization, etc. A typical text mining pipeline includes data collection (from files, databases, APIs, etc.), data preprocessing (stemming, stopwords removal, etc.), and analytics to ascertain patterns and trends.

Just as data mining in the traditional sense has proven to be invaluable in extracting insights and making predictions from large amounts of data, so too can text mining help in understanding and deriving useful insights from the ever-increasing availability of text data.

Natural Language Processing (NLP) can be thought of as a way for computers to understand and generate human natural language. This is possible by simulating the human ability to comprehend natural language. NLP’s strength comes from the ability of computers to analyze large bodies of text without fatigue and in an unbiased manner (note: unbiased refers to the process, it is possible for the data to be biased).

 

Online Violence Against Children

As of July 2020, there are over 4.5 billion internet users globally, accounting for over half of the world’s population. About one-third of these are children under the age of 18 (one child in every three in the world). As these numbers rise, sadly, so too does the number of individuals looking to exploit children online. The FBI estimates that at any one time, there are about 750,000 predators going online with the intention of connecting with children.

For our project, we wanted to explore how text mining and NLP techniques could be applied to analyzing the scientific literature on online violence against children (OVAC). We picked scientific articles as our focus, as these can provide a wealth of information — from the different perspectives that have been used to study OVAC (i.e. criminology, psychology, medicine, sociology, law), to the topics that researchers have chosen to focus on, or the regions of the world where they have dedicated their efforts. Text mining allowed us to collect, process, and analyze a large amount of published scientific data on this topic — capturing a meaningful snapshot of the state of scientific knowledge on OVAC.

 

Data Collection and Preprocessing

 

Our overall process flow from data collection to analysis

 

Our first step was to collect datasets of articles that we could find online. The idea was to scrape a variety of repositories for scientific articles related to OVAC, using a set of keywords as search terms. We built scrapers for each repository, making use of the BeautifulSoup and Selenium libraries. Each scraper was unique to the repository and collected information such as the article metadata (i.e Title, Authors, Publisher, Date Published, etc.), the article Abstract, and the article full-text URL (where available). We also built a script to convert the full-text articles from PDF to Text, using Optical Character Recognition (OCR). Only one of the repositories, CORE, had an API that directly allowed us to scrape the full text of the articles.

Having collected over 27,000 articles across 7 repositories, we quickly realized that many articles were not relevant to OVAC. For example, there were many scientific articles about physical sexual violence against children, that also mentioned some sort of online survey. These articles fulfilled the “online AND sexual AND violence AND children” search term but were irrelevant to OVAC. Hence, we had to manually filter the scientific articles for relevance, sieving out 95% of articles that were not related to OVAC.

Faced with such a painfully manual task, some members of the team tried out semi-automated methods of filtering. One method used clustering to find groups of papers that were similar to each other. The idea was that relevant papers would show up in the same group, while irrelevant papers would show up in their own groups. We would then only need to sift through each cluster instead of going through each individual paper, saving almost 10–30 times the effort. However, this assumed perfect clusters, which was often not true. The clustering method was definitely faster and filtered out 41% of articles, but it also left more irrelevant articles undetected. An alternative to clustering would be to train classifiers to identify relevant articles based on a set of pre-labeled articles. This could potentially work better than clustering, but having undetected articles still remains a limitation.

One of the perks of working with scientific articles (read: texts that have been reviewed rigorously) is that minimal data cleaning is required. Steps that we would otherwise have to take when dealing with free texts (e.g. translating slang, abbreviations, and emojis, accounting for typos, etc.) are not needed here. Of course, text pre-processing steps like stemming, stop-word removal, punctuation removal, etc. are still required for some analysis, like clustering or keyword analysis.

 

Drawing insights from text analysis regarding online violence against children

Armed with a set of relevant articles, the team set off to discover the various types of methods to extract insights from the dataset. We attempted a variety of methods (i.e. TF-IDF, Bag of Words, Clustering, Market Basket Analysis, etc.) in search of answers to a set of questions that we aimed to explore with the dataset. Some analyses were limited by the nature of the datasets (e.g. in keywords analysis, there is a lot of noise and random words in the data. Some trends/patterns emerge but it is not very conclusive), while others showed great potential in picking out useful insights (e.g. clustering, market basket analysis as described below).

 

 

Keywords Analysis

Based on the title and abstract texts, we were able to generate a word cloud of the most frequent terms appearing in the OVAC scientific literature. We also used TF-IDF vector analysis to explore the most relevant words, bigrams, and trigrams appearing in the title and abstract texts in each publication year. This allowed us to chart the rise of certain research topics over time — for example around the years 2015 and 2016, terms related to “travel” and “tourism” began to appear more often in the OVAC literature, suggesting that this problem received greater research attention in this period

 

Word cloud of title and abstract texts from over 1300 scientific articles on online violence against children. Source. www.omdena.com

 

 

Geographical Market Basket Analysis

 

Heat map of the Lift between country pairs. A lift of more than 1 suggests that the presence of one country increases the probability that the other country will also appear in the article. The larger the lift, the more likely they would appear together.

 

We conducted a Market Basket analysis to find out which countries were likely to appear in the same article. This could potentially give insight to the networks of countries involved in OVAC. While we noticed that many countries appear together because they were geographically close, there were also exceptions.

From the heat map above, this includes country pairs like Malaysia-US, Australia-Canada, Australia-Philippines, and Thailand-Germany. Upon investigation, we found that:

  • Most articles contain these pairs because of exemplification.
  • Some are mentioned as a breakdown of countries where respondents of surveys and studies were conducted. (E.g. Thailand and Germany were mentioned as part of a 6-country survey of adolescents.)
  • More interestingly, there were also articles that mentioned pairs of countries due to offender-victim relationships. (E.g. an article studying offenders in Australia mentioned that they preyed on child victims in the Philippines.)

 

Topics Clustering Analysis

Another of our solutions used machine learning to separate the documents into different clusters defined by topics. A secondary motive was to explore the possibility that the different documents can form a network of communities not only based on their topics, but also on how the documents relate to each other.

The Louvain Method for community detection is a popular clustering algorithm used to understand the structure, as well as to detect communities of large networks. The TF-IDF representation of the words in the vocabulary was used to build a co-occurrence matrix containing the cosine similarities between each document. The clustering algorithm detected 5 distinct communities.

A manual inspection of the documents in each cluster suggested the following topics –

  • Institutional, Political (legislative) & Social Discourse
  • Online Child Protection — Vulnerabilities
  • Technology
  • Analysis of Offenders
  • Commercial Perspective & Trafficking

 

Bar chart of Frequency of Articles by Topic

 

The first two topics appear to be the most published while Commercial Perspective & Trafficking is the least. The cluster and structure detected by the clustering algorithm we noticed, could be visualized in the shape of a Graph Network. The articles were represented as nodes, and nodes of the same topic are grouped together and colored the same, the strength of the relationship between nodes as defined by the cosine similarity is represented by links/edges. Below is a visual representation of the structure of the Graph Network:

 

Structure of Graph Network — Articles were labeled according to community detection clustering and relationships defined by the cosine similarity between the documents. Other information like the text, date, published data, and URL of the papers were stored as properties of the nodes (vertices), and the links (edges) were defined as the cosine similarity value between documents.

 

One advantage of restructuring the data in this manner is that it allows the data to be stored in a Graph Database. Traditional relational databases work exceptionally well at capturing repetitive and tabular data, they don’t do quite as well at storing and expressing relationships between the entities within the data elements. A database that embraces this structure can more efficiently store, process, and query connections. Complex analysis can be done on the data by using a pattern and specifying starting points. Graph Databases can efficiently explore connecting data to those initial starting points, collecting and processing information from nodes and relationships while leaving out data outside the search pattern.

 

Challenges and Limitations

The major challenges we faced were related to compiling our dataset. Only one of the repositories we used, CORE, granted API access which greatly sped up the process of obtaining data. For the rest, the need to build custom scraping scripts meant that we could only cover a limited number of repositories. Other open repositories, such as Semantic Scholar, resisted our scraping efforts, while others such as Web of Science or ESBSCOhost, are completely walled-off to non-subscribers. The great white whale of scientific article repositories, Google Scholar, also eluded us. Here, search results are purposefully presented in such a way that it is not possible to extract the full abstract texts — although some other researchers with a lot of time and effort have had greater success with scraping it.

As shown, we were able to conduct a range of interesting and meaningful analyses using just the abstracts of the scientific articles, but to go further in our research would require overcoming the challenges related to obtaining the full text of the scientific articles. Even after developing a custom tool to extract text from PDFs, we still faced two challenges. Firstly, many articles were paywalled and could not be accessed, and secondly, the repositories we scraped did not systematically link to the PDF page of the article, so the tool could not be utilized across our whole dataset.

If a would-be data scientist is able to surpass all these hurdles, a final barrier to extracting information from scientific articles remains. The way in which scientific texts are structured, with sections such as “Introduction”, “Methodology”, “Findings” and “Discussion”, varies greatly from one article to the next. This makes it especially difficult to answer specific questions such as “What are the risk factors of being an offender of OVAC” that require searching for information in a specific section of the text, although it is less of an issue if you are seeking to answer more general questions, such as “How has the number of research papers changed over time?”. To overcome the difficulty of extracting specific information from unstructured text, we built a Neural Search Engine powered by haystack that uses a distilbert transformer to search for answers to specific questions in the dataset. However, it is currently a proof-of-concept and requires further refinement to reach its full potential of being able to answer the questions accurately.

These challenges create some limitations for the findings of our analysis. We cannot say that our dataset captures the entirety of scientific research into online violence against children, but rather just that which was contained in the repositories that we were able to access — we do not know if this could bias our results in some way (for example, if these repositories were more likely to contain papers from certain fields, or from certain parts of the world). It is worth noting that we conducted all of our searches in English. Fortunately, scientific article abstracts are often translated, so we were able to analyze the text of these, even when the original language of the article was different.

Another limitation is that as scientific knowledge increases over time, more recent articles could be more relevant than historical ones if certain theories or assumptions are later found to be incorrect with further research. However, all articles were given equal weight in our analysis.

Given the challenges that we faced with accessing repositories and articles and the incalculable benefits of greater data openness in scientific research, it is interesting to discuss an initiative that is working toward that goal. A team at Jawaharlal Nehru University (JNU) in India is building a gigantic database of text and images extracted from over 70 million scientific articles. To overcome copyright restrictions, this text will not be able to be downloaded or read, but rather only queried through data mining techniques. This initiative has the potential to radically transform the way that scientific articles are used by researchers, opening them up for exploration using the entirety of the data science toolkit.

 

Conclusion

We have demonstrated in this case study how text mining and NLP techniques can aid the analysis of scientific literature at every step of the way — from data collection to cleaning and to gain meaningful insights from text. While full texts helped us to answer more specific questions, we found that using just abstracts was often sufficient to gain useful insights. This shows great potential for future abstract-only analysis in cases where access to full-text articles is limited.

Our work has helped Save the Children to understand OVAC and its research space better, and similar types of analysis can benefit other NGOs in many ways. These include:

  • understanding a topic
  • having an overview of the types of research efforts
  • understanding research gaps
  • identifying key resources (e.g. datasets often quoted in papers, most common citations, most active researchers/publishers, etc.)

 

There are also many other possibilities of NLP methods to extract insights from scientific papers that we have not tried. Here are some ideas for future exploration:

  • Extracting sections from scientific articles — articles are organized in sections, and if we can figure out a way to split articles up into sections, it would be a great first step towards a more structured dataset!
  • Named Entity Recognition — From figuring out which entities are being discussed to using these to answer specific questions, NER unlocks a ton of possible applications.
  • Network Analysis using Citations — This could be an alternative method to cluster the articles, or it could also help to identify the ‘influential’ articles or map out the progress of research.
Internet Safety for Children: Using NLP to Predict the Risk Level of Online Games, Websites, and Applications

Internet Safety for Children: Using NLP to Predict the Risk Level of Online Games, Websites, and Applications

The following work is part of the Omdena AI Challenge on improving internet safety for children, implemented in collaboration with John Zoltner at Save the Children US.

This blog was written by Sabrina Carlson and co-authored by Erum Afzal. Contributors include Anna Kolbasko, Juber Rahman, Erum Afzal, Mateus Broilo, Rahul Gopan, Rubens Carvalho, Vinod Rangayyan, Adele C, and Rosana de Oliveira Gomes.

 

The Problem

Save the Children is a humanitarian organization that aims to improve the lives of children across the globe. In line with the United Nations’ Sustainable Goal 16.2 to “end abuse, exploitation, trafficking, and all forms of violence and torture against children,” Save the Children and Omdena collaborated to use artificial intelligence to identify and prevent online internet violence against children for their safety. Utilizing numerous data sources and a combination of various artificial intelligence techniques, such as natural language processing (NLP), this project’s collaborators aimed to produce meaningful insights into and prevent online internet violence against children for their safety. One area of concern is online games, websites, and applications that are popular with children, and a number of collaborators targeted this space in hopes of guarding children against online predators in the future.

 

What We Did

The Common Sense Media website provides expert advice, useful tools, and objective ratings for countless movies, television shows, games, websites, and applications to help parents make informed decisions about which content they want their children to consume. Particularly useful for this project, parents, and children can review games, applications, and websites on the Common Sense Media site. A number of Omdena collaborators had the idea to build web scrapers to collect parent and child reviews of the games, applications, and websites that are popular with children and use natural language processing to identify which platforms are high risk for online internet violence against children for their safety.

The first step was to scrape Common Sense Media to collect game, application, and website reviews from both parents and children. To do so, we used ParseHub software to build web scrapers to collect reviews from this website. ParseHub is a powerful, user-friendly tool that allows one to easily extract data from websites. Using ParseHub, we set three different configurations to scrap parent and child reviews of all games, applications, and websites from the internet that Common Sense Media has determined to be popular among children for their safety.

The resulting dataset includes the following features:

  • 40,433 observations (reviews) from 995 different games/apps/websites
  • Platform type (game, application, website)
  • The risk level for online (sexual) violence against children
  • Indicators for each platform’s content related to positive messages, positive role models/representations, ease of play, violence, sex, language, and consumerism. Common Sense Media provides objective ratings (from a scale of 0–5) for these indicators for the digital content included on the site. We focused on the sex indicator and re-labeled it as CSAM (child sexual abuse material). We determined a platform to be high risk for CSAM if its sex rating was greater than 2 and assigned a platform a low-risk CSAM label if its sex rating was lower than 2.

Figure 1 plots the top 20 platforms in terms of the number of reviews.

 

Internet Safety Children

Figure 1. 20 Most Popular Platforms by Number of Reviews / Source: Omdena

 

Figure 2 displays the number of reviews for high and low-risk games, applications, and websites. As illustrated in the figure, there are nearly 25,000 reviews for low-risk platforms, whereas there are close to 16,000 reviews for high-risk platforms.

 

Internet Safety Children

Figure 2. Number of Reviews by CSAM Risk Level / Source: Omdena

 

Data Sampling

We randomly sampled 50% of the data in order to process the data in a more efficient way. The following graphic illustrates the code used to sample 50% of the data.

 

Internet Safety Children

 

Data Cleaning

It is necessary to clean the data in order to build a successful NLP model. To clean the review messages, we created a function called “clean_text” and used it to perform several transformations, including the following:

  • Converted the review text into all lower-case letters
  • Tokenizing the review text (i.e., splitting the text into words) and removing punctuation marks
  • Removing numbers and stopwords (e.g., a, an, the, this).
  • Using the WordNet lexical database to assign Part-Of-Speech (POS) tags. The POS tags are used to attach labels to words that correspond to a noun, verb, etc.
  • Lemmatizing and transforming the words to their roots (i.e., games→ game, Played→ play)

Figure 3 provides an example of the reviews pre-and post-cleaning. In the “review” column, the text has not been cleaned, while the “review_clean” column includes text that has been lemmatized, tagged for POS, tokenized, etc.

 

Internet Safety Children

Figure 3. Sample of Cleaned Text / Source: Omdena

 

Feature Engineering

Before applying the models, we performed some feature engineering, including sentiment analysis, vector extraction, and TF-IDF.

 

Sentiment Analysis

The first feature engineering step was conducting sentiment analysis. The sentiment analysis was performed on the features to gain insight into how parents and children feel about hundreds of games, applications, and internet websites that are popular with children for their safety. We used Vader, which is part of the NLTK module, for the sentiment analysis. Vader uses a lexicon of words to identify positive or negative sentiments in long sentences. It also takes into account the context of the sentences to determine the sentiment scores. For each text, Vader returns the following four values:

  • Negative count score
  • Positive count score
  • Neutral count score
  • The compound — an overall score that summarizes the previous scores

Figure 4 displays a sample of cleaned reviews containing negative, neutral, positive, and compound scores.

 

Internet Safety Children

Figure 4. Sample of Sentiment Analysis Scores / Source: Omdena

 

Extracting Vectors

In the next step, we extracted vector representations for every review. Using the module Gensim, we were able to create a numerical vector representation for every word in the corpus using the contexts in which they appear (Word2Vec). This is performed using shallow neural networks. Extracting vectors in this way is interesting and informative because similar words will have similar representation vectors.

All text can also be transformed into numerical vectors using word vectors (Doc2Vec). We can use these vectors as training features because the same texts will also have similar representations.

It was first necessary to train a Doc2Vec model by feeding in our text data. By applying this model to the review text, we are able to obtain the representation vectors. Finally, we added the TF-IDF (Term Frequency — Inverse Document Frequency) values for every word and every document.

But why not simply count the number of times each word appears in every document? The problem with this approach is that it does not take into account the relative importance of the words in the text. For instance, a word that appears in nearly every review would not likely bring useful information for analysis. In contrast, rare words may be much more meaningful. The TF-IDF metric solves this problem:

 

TF-IDF

The Term Frequency (TF) computes the classic number of times the word appears in the text, while the Inverse Document Frequency (IDF) computes the relative importance of the word depending on the number of texts (reviews) in which the specific word is found. We added TF-IDF columns for every word that appeared in at least 10 different texts. This step allowed us to filter a number of words and, subsequently, reduce the size of the final output. Figure 5 provides the code used to apply TF-IDF and assign the resulting columns to the data frame, and Figure 6 displays the output of the sample code.

 

Internet Safety Children

Figure 5. TF-IDF Code Sample / Source: Omdena

 
Internet Safety Children

Figure 6. TF-IDF Sample Code Output / Source: Omdena

 

Exploratory Data Analysis

The EDA produced a number of interesting insights. Figure 7 provides a sample of reviews that received high negative sentiment scores, and Figure 8 displays a sample of reviews with high positive sentiment scores. The sentiment analysis successfully assigned negative sentiments to reviews with text such as “violence, horror, dead.” The analysis also effectively assigned positive sentiments to reviews containing words such as “fun, cute, exciting.”

 

Internet Safety Children

Figure 7. Sample of Reviews with High Negative Scores / Source: Omdena

 

Internet Safety Children

Figure 8. Sample of Reviews with High Positive Scores / Source: Omdena

 

Figure 9 shows the distribution of the trend of messages among high and low-risk games. Varder categorizes low-risk reviews as positive messages, whereas high-risk reviews should have lower compound sentiments. This shows that the sentiment feature extractions proved helpful in modeling the risk analysis.

 

Internet Safety Children

Figure 9. High_Low Risk Distribution over Compound Sentiments / Source: Omdena

 

Modeling High-Risk Games/Applications/Websites

After we successfully scraped the reviews, built the dataset, cleaned the data, and performed feature engineering, we were able to build an NLP model. We choose which features (reviews and clean reviews) to use to train our model.

Then, we split our data into two parts:

  • Training set for training purposes
  • The test set to assess the model performance

After selecting the features and splitting the data into test/training sets, we fit a Random Forest classification model and used the reviews to predict whether a platform is a high risk for CSAM. Figure 10 displays the code used to fit the Random Forest classifier and obtain the metrics.

 

Internet Safety Children

Figure 10. Random Forest Classifier Code Sample / Source: Omdena

 

Figure 11 displays a sample of features and their respective importance. The most important features are indeed the ones that were obtained in the sentiment analysis. In addition, the vector representations of the texts were also important in our training. A number of words appear to be fairly important as well.

 

Figure 11. Feature Importance / Source: Omdena

 

The Receiver Operating Characteristic Example (ROC) curve and Area Under the Curve (AUC) allow one to evaluate how well a model performs in terms of its ability to distinguish between classes (high/low risk for CSAM in this context). The ROC curve, which plots the true positive rate against the false-positive rate, is displayed in Figure 12. The AUC is 0.77, which indicates that the classifier performed at an acceptable level.

 

Internet Safety Children

Figure 12. ROC Curve / Source: Omdena

 

The Precision-Recall (PR) Curve is illustrated in Figure 13. The PR curve is graphed by simply plotting the recall score (x-axis) against the precision score (y-axis). Ideally, we would achieve both a high recall score and a high precision score; however, there is often a trade-off between the two in machine learning. The sci-kit learn documentation states that the Average Precision (AP) “summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight.” The AP here is 0.72, which is an acceptable score.

 

Internet Safety Children

Figure 13. Precision-Recall Curve / Source: Omdena

 

It is evident in Figure 13 that the precision decreases as we increase the recall. This indicates that we have to choose a prediction threshold based on our specific needs. For instance, if the end goal is to have a high recall, we should set a low prediction threshold that will allow us to detect most of the observations of the positive class, though the precision will be low. On the contrary, if we want to be really confident about our predictions and are not set on identifying all the positive observations, we should set a high threshold that will allow us to obtain high precision and a low recall.

In order to determine whether or not the model we built performs better than another classifier, we can simply use the AP metric. To assess the quality of our model, we can compare it to a simple decision baseline. With a random classifier for the baseline, the model would simply assign 0 half the time and 1 the other half of the time. Our AP metric is 0.77, which is better than a random classifier.

 

Conclusion and Observations

It is nearly possible to use just raw text as input to make predictions. The most important aspect is to be able to extract the relevant features from a raw data source. Such data can often complement data science projects, allowing one to extract more meaningful/useful features and increase the model’s predictive power.

We were only able to predict the platform’s risk through user reviews, and it is possible that the reviews are biased. To improve the precision of our predictive model, we can triangulate other features such as player sentiments, game titles, UX/UI features, and in-game chats. Used in combination, these features can provide a number of insightful recommendations. Our predictive model will shed light on CSAM risk in online games, applications, and websites that are popular with children by automatically detecting each platform’s risk level. In the future, we hope that parents will be able to better select platforms for their child’s use based on our use of AI.

 
 
 
A Chatbot Warning System Against Online Predators

A Chatbot Warning System Against Online Predators

Using Natural Language Processing to warn children against online predators.

The following work is part of the Omdena AI Challenge on preventing online violence against children, implemented in collaboration with John Zoltner at Save the Children US.

 

Protecting Children

Today, children face an evolving threat — online violence. Violence and harassment of children have been growing exponentially for more than 20 years but due to the recent events leading to the closing of schools for over 1 billion children around the world, children are more vulnerable than ever. Online predators use Internet avenues popular with children and adolescents such as game chat rooms to lure them into sexual exploitation and even in-person assaults.

Protection against online sexual violence greatly varies from platform to platform. Some gaming platforms include a profanity filter that looks for problematic words and replaces them with a string of asterisks. Outside the gaming platforms, many chat platforms still do not have any safeguards in place to protect children from predatory adult conversations. However, chat logs can provide information on how a predator might attempt to exploit children and young adults into risky situations, such as sharing photos, web cameras, and sexting (sexual texting).

Often, pattern recognition techniques provide an automated identification of these conversations for potential law enforcement intervention. 

Unfortunately, this strategy uses many man-hours and spans many message categories, which makes it all the more difficult to identify these patterns. It is a challenging task, but one that is worth tackling, and we elaborate on our approach in the rest of the article.

 

Online Predators

STEPHEN ORSILLO/SHUTTERSTOCK

 

First, let us establish our working definitions.

  • The New Oxford American Dictionary defines a predator as a “person or group that ruthlessly exploits others.”
  • Expanding the term to a sexual predator as “a person seen as obtaining or trying to obtain sexual contact or favor with another person in a metaphorically ’predatory’ manner”, Daniel M. Filler, Virginia Journal of Social Policy & the Law (2003).

 

The Solution — Data Engineers Unite!

 

Online Predators

The team focused on the solution’s Predator Analysis portion.

 

Our solution looked to reduce man-hours and to develop a near real-time warning system to the chat that alerted the child when the conversation changes sentiment. The team used a semi-supervised approach to evaluate if the conversations provide a low, medium, or high risk to the child. The system would evaluate the phrase or sentence and return an effective sentiment warning if warranted. The data for our chatbot (Predator-Pseudo-Victim conversations) was collected from interactions between a predator and a law enforcement officer or volunteer posing as a child.

The chatbot was designed to learn from non-predatory and predatory conversations and distinguish between them. Additionally, it would have the ability to recognize inappropriate messages no matter whether they came from the predator or the child’s side. The corpus also had adult-like conversations initiated from the child’s side.

 

The Dataset

The team consolidated and cleaned nearly 500 chat log files that contained exchanges between a predator and a pseudo-victim. The collection grew into a corpus containing 807,000-plus messages ranging from “hello” to explicit remarks. The dataset creation proved laborious, where I voluntarily provided more than 630 hours in just labeling data. The dataset received labels, such as male or female as they identified themselves in the chats, predator or victim, and level of risk of the conversation. Nearly half of the project time was dedicated to a properly built and parsed dataset.

This dataset was split into a training, development, and test set. The training set held 75 percent of all messages for the chatbot to learn the contextual format and nuances of conversation. The development set, which was 10 percent of the data, was held away from the chatbot until after model selection, to prove the validity of the model.

The 2 mins video quickly discusses how the team assembled the chatbot’s dataset.

 

 

Data Format and Storage

The data was housed in a relational database. It became large enough to serve as a nexus to provide uniquely formatted datasets for the machine learning pipeline.

During the labeling process, few issues arose on how to semantically define a conversation. With many different log formats, ranging from AOL Messenger to SMS and other online platforms, the sentences would start and stop at different points. In conversations, I implemented a similar format as used in the competitively-used Cornell University’s movie corpus that provided a standard structure making it easy to parse the data. Additionally, the corpus contained chat slang, abbreviations, and number-for-words, like “l8r” for “later” and “b4” for “before”, which required a team consensus on how to handle these stopwords. The team did not focus on timestamps due to extremely varied formatting, missing values, and lack of importance to the overall project.

 

The Model

Many models presented as candidates to the chatbot’s internal workings. The main goal for the team was to have a local and offline solution for now. This was done to reduce privacy concerns and legal issues. Future considerations of this project would evaluate these features with appropriate development operations.

 

Online Predators

Basic sequence-to-sequence model diagram.

 

The selected model focused around a Long Short Term Memory (LSTM) network cell, arranged as a sequence-to-sequence configuration. LSTMs have long been proved well-suited to work with sequential data. Our application would use this ability to help the chatbot predict the next plausible word to use for response.

For the sentiment analysis portion, we focused our efforts on an ensemble learning model as well as a support vector machine to help predict when the conversation changed from benign to risky.

 

Conclusion

Our team successfully built a chatbot and a sentiment analysis model independently. The chatbot learned from its more than 807,000 messages to understand how to parse sentences and structure a proper response. The limited vernacular stemmed from the chatbot’s time to learn and framework limitations.

The greatest challenge to code performance-centered inside the platform chosen, TensorFlow 1.0.0, provided limitations. The code did provide a conversation-capable entity, but the model needs more training data if we want to go beyond proof-of-concept to deploy it in an application.

The project successfully employed message sentiment analysis and was able to warn the user of potentially risky conversations initiated by online predators. The sentiment analysis ranged from low, medium, or high levels of risk.

Future considerations will take this project into a full-functioning environment of TensorFlow 2.1.0, eliminating other frameworks, including PyTorch. The internal model will receive an update to the LSTM structure and performance will be improved with the use of graphics computing processors, such as NVIDIA and its cuDNN framework.

 
 

Stay in touch via our newsletter.

Be notified (a few times a month) about top-notch articles, new real-world projects, and events with our community of changemakers.

Sign up here