Applying Natural Language Processing in an agile AI project to investigate Online Violence against Children.
I had the honor to contribute to the 29th Omdena project, aimed at investigating Online Violence against Children (OVAC), conducted in partnership with Save The Children. I tried to follow our kick-off meeting on August 20, 2020, sitting in a train running the Albula Line of the Rhaetian Railway, a very scenic route belonging to the UNESCO World Heritage. A bird would have chosen a more direct path to cover the actual distance of 47 kilometers, but our train had to cross 55 bridges and 39 tunnels spread across 62 kilometers.
Setting Project Goals
Only when you know your goal, you can select the most efficient path. The kick-off meeting, which I mostly missed due to a constantly interrupted internet connection, heavily focused on Child Sexual Abuse Materials (CSAM), a term basically synonymous with child pornography. The problem statement handed over to us, in contrast, was intentionally kept very broad and did not specify at all, what kind of OVAC our project team was supposed to analyze. For this reason, every task team formed during the 8 weeks of our project set their focus somewhat differently.
The aim of our project was to apply Natural Language Processing (NLP) techniques to investigate OVAC. This article is restricted to my analysis of news articles. I am a research neuroscientist (and later on a medical writer and machine learning engineer) by education and thus clearly no expert neither in the domain of OVAC, nor in the field of journalism. But whatever we humans write, our choice of words will inevitably be influenced by our background knowledge, and hopefully also adapted to our target audience. Human language is very dynamic. Thus, whatever we try to analyze using NLP, it can’t be a bad idea to first try to obtain some minimal level of domain knowledge from experts on the field. Understanding a domain and reading about definitions and human classification systems will help us not only to select the terms we want to look for but also to define classes that our machine counterparts shall learn to recognize and predict.
To retrieve news articles in digital form, we need to — decide on a provider of digital news articles and search for specific terms. If that search engine even offered advanced search options, we could think about the most sophisticated and goal-directed query syntax. Although standard in scientific repositories, such advanced functionality is rarely built into news article search engines, which forces us to spend some time and effort selecting the best terms to search for.
So how will we select our search terms, being neither OVAC domain experts nor experts of journalistic terminology? Just as the bird trying to fly towards its target, carefully paying attention to environmental cues telling it to adapt direction, we can adopt an agile methodology, as software developers would call it. As long as we lack the bird’s perfect navigation strategy, we can simply use trial and error. We input a search term, analyze what we get back, and adjust our search accordingly. Scientific terms will unlikely be used by reporting agencies. It may thus be a wise idea to use different search terms to collect news articles than to collect scientific publications.
I decided to retrieve news articles from the Thai e-newspaper “Bangkok Post” and started out comparing the specificity of various search terms to our topic of interest, Online Violence against Children. Since our kick-off presentation focused on CSAM, I decided to try its non-scientific equivalent “child pornography” and found each article in the results list to in fact be related to OVAC, thus relevant to our project. In contrast, most other search terms returned many irrelevant articles, in addition. For example, only 6 out of the 52 (thus 12 %) news articles obtained from Bangkok Post by searching for “online grooming” proved to in fact be on OVAC.
If the goal is to include only OVAC-relevant articles into our analysis, 3 strategies are conceivable: 1) to use only search terms that produce results 100 % specific to the problem of OVAC, 2) to use various search terms and manually check article per article to include only the relevant ones, or 3) to find some method to (semi-)automate filtering out the relevant articles. Since 1) would restrict the analysis to very few subclasses of OVAC, and 3) has the potential to increase the efficiency of 2), but also since I’m a curious person who has never before built a classifier that relies on machine learning to categorize text documents, I decided to build a news article classifier.
Automating the Data Collection Process
The search results I got back using my search terms broadly fell into three categories: 1) news articles on OVAC (our target class), 2) articles on violence against children, but in physical offline instead of online forms (“PVAC”), and 3) articles on online violence, but against adults instead of children (“OVAA”). I wondered whether an algorithm will be capable of picking only project-relevant OVAC-related articles out of a collection of news articles of all three classes. To find this out, I trained a Support Vector Machine (SVM) Classifier based on 209 news articles from Bangkok Post (supervised ML; 131 OVAC + 43 PVAC + 35 OVAA articles).
Whenever we train a machine learning model, we have to decide which metric we want to optimize. My goal was not to find every single article out there, that would be on OVAC, our class of interest, but rather to end up with a collection of news articles, all relevant for our project (OVAC-related). In other words, my aim was to maximize recall — the fraction of true positives among all articles classified as positive for our target class OVAC. Already in the 2nd training attempt, I reached a recall of 100 %: every article predicted by the SVM to be on OVAC, was in fact on OVAC.
A classifier as reliable as this may prevent us from battling our way through every single article to confirm its relevance to our project. On the other hand, such great performance can only be achieved on a test set of news articles that shows a class distribution similar to the article set it was trained on. Most notably, a classifier not trained to recognize entirely irrelevant articles will evidently always fail to detect such.
One of the advantages of news articles, if compared to research publications, is that these are published with a very short delay. News articles thus have the power to provide insights about trends over time with a much shorter time lag bias than research publications.
Quantifying the Issue
One goal stated in our problem statement included capturing the severity of the situation. Will the severity of a problem be reflected by the number of news articles published on that problem? Rather not. A case of an actor condemned for possessing child pornography may be reported by 20 news articles and followed up closely, whereas an international ring of pedophiles acting over decades may receive much less media attention and be reported by only a specific reporting agency.
Features can be extracted from news articles with the help of NLP techniques, and then converted into numeric variables with the aim to quantify the magnitude of a problem. They can surely serve as indicators pointing towards potentially underlying trends, but need to be interpreted with a healthy portion of caution and skepticism. Too many biases may play their part as well, such as the reporting bias resulting in an over-representation of the case of the famous actor, as compared to a less attention-grabbing but potentially much more severe case.
Nevertheless, I decided to give it a try and investigate trends over time with respect to the reporting of child pornography cases. Instead of using advanced NLP, I decided to rely on basic NumPy functionality and created a dummy variable for 7 selected verbs, often reported in the context of child pornography. For each article, represented by separate rows of a pandas data frame, the dummy variable would assume a value of 1, if the article contains a certain text string such as “possess”, or alternatively a value of 0 if it does not.
For every reporting year, I summed up these word occurrences to obtain the total number of articles, in which a given string was mentioned together with child pornography, per year. I used the streamgraph package developed for the statistics package “R” to visualize these yearly article frequencies over the past 10 years.
Different NLP techniques have different requirements with respect to the cleanness of a text string. To create a word cloud, it’s beneficial to lowercase every word, so that the frequency of the capitalized and non-capitalized version of the same word will be added together. To analyze, how often the possession of pornography is mentioned in news articles, it is beneficial to use word reduction techniques such as stemming and/or lemmatization, so that the frequencies of differently inflected or derived word forms like “possess”, “possessed”, “possession”, “possessing” will be added together. (I used an alternative approach to solve the same problem, above.)
To avoid wasting time, it is wise to think about such requirements before jumping into actual text cleaning. Some cleaning steps won’t be required for a certain NLP analysis, whereas others will make the analysis difficult or even impossible. A highly efficient, customizable function developed by my collaborator Sij allowed me to clean text strings in virtually no time.
I created n-grams, document term matrices, and word clouds to identify the most common words and word sequences in news articles. These most common terms include “sexual”, “social media” and “facebook”, “law” and “government”, “police”, the “Philippines”, “women” and “teacher”. In analogy to the famous “garbage in, garbage out”, these results of course reflect the principle of “search terms in, keywords out”. Moreover, they illustrate that OVAC often occurs on social media platforms, that Facebook plays a major role in the Thai social media market, and that the Thai news agency “Bangkok Post” often reports on the Philippines.
Besides identifying the most common words or word-sequences (n-Grams) in a collection of text documents, it’s also relevant to investigate, how these words relate to each other. I followed the instructions provided by Jason Brownlee in a Machine Learning Mastery article to calculate my own word embeddings based on the 209 news articles from Bangkok Post with the aim to investigate word similarities and to possibly detect previously unknown words.
I then calculated cosine similarities between selected pairs of word vector embeddings to quantify the similarity between specific words, resp. more precisely: the similarity between the contexts, in which different words typically appear. I found (“sexting” and “stalking”) and (“sexting” and “bullying”) to occur in very similar contexts (cosine similarity scores of 0.86 and 0.83, respectively), whereas (“pornography” and “bullying”) occurred in rather different contexts (0.37).
Furthermore, such visualization of word embeddings has allowed me to identify another term in the family of OVAC related problems, I wouldn’t have known before: “trolling”. I could now go back to the search engine of Bangkok Post to assess the specificity of news articles obtained using that search term for the problem of OVAC.
Insights and Conclusion
So, did all of these NLP techniques help me to gain valuable insights about the problem of Online Violence against Children? I would say that a substantial portion of my personal insights rather came from applying human intelligence while reading about the problem and talking to subject matter experts.
I consider an initial exchange with domain experts, in our case from Save The Children, complemented by some literature review, as invaluable for defining the scope and search terms. Amongst others, I also had the chance to interview a lady working for ECPAT International, a global network of organizations working towards ending the sexual exploitation and abuse of children worldwide.
I came to the conclusion, that the prevalence and importance of the various subclasses of OVAC seem to differ by geographic location. Analogously, also the barriers to overcome in the fight against OVAC likely differ by geographic location. Poverty, cultural norms, the available infrastructure, current legislation, and data protection regulations all critically determine, which forms of OVAC are most present in a society, and to what extent these are considered as normal or as something to be prevented in the future.
In some countries, coming into touch with child pornography is a part of everyday life, already in early childhood. As long as governments don’t provide their population with alternative opportunities to earn the minimum amount of money that would allow them to make an acceptable living, some of the problems around OVAC will be difficult to change. As long as teenagers enjoy being groomed by strangers online, making them aware of the associated risks may have little effect. Providers of social media platforms or chat forums have to respect the data privacy of their users, a goal typically in conflict with efforts to prevent potential online violence against minors.
I believe that the problem of Online Violence against Children can only be sustainably prevented if all stakeholders pull together. Parents can only truly care for the wellbeing of their children if they have enough to survive. Platform providers can only help to prevent OVAC, if prevention methods can be aligned with the current, typically local, regulatory requirements. And Omdena collaborators can only select the best terms to retrieve news articles if they already understand the problem to some extent. Every new insight gained can help re-adjusting the direction, and many tunnels and bridges will still need to be crossed on the long journey to the bright final destination: an internet providing a safe place for every child on this planet.