Investigating Online Violence against Children (OVAC) through Natural Language Processing by analyzing news articles and collecting data as a part of Omdena project, conducted in partnership with Save The Children.

 

 

Setting Project Goals

 

Attempt to categorize Online Violence against Children into distinct subclasses and to allocate project tasks to these subclasses - Source: Omdena

Attempt to categorize Online Violence against Children into distinct subclasses and to allocate project tasks to these subclasses – Source: Omdena

 

Collecting Data – Trial and Error

So how will we select our search terms, being neither OVAC domain experts nor experts of journalistic terminology? Just as the bird trying to fly towards its target, carefully paying attention to environmental cues telling it to adapt direction, we can adopt an agile methodology, as software developers would call it. As long as we lack the bird’s perfect navigation strategy, we can simply use trial and error. We input a search term, analyze what we get back, and adjust our search accordingly. Scientific terms will unlikely be used by reporting agencies. It may thus be a wise idea to use different search terms to collect news articles than to collect scientific publications.

I decided to retrieve news articles from the Thai e-newspaper “Bangkok Post” and started out comparing the specificity of various search terms to our topic of interest, Online Violence against Children. Since our kick-off presentation focused on CSAM, I decided to try its non-scientific equivalent “child pornography” and found each article in the results list to in fact be related to OVAC, thus relevant to our project. In contrast, most other search terms returned many irrelevant articles, in addition. For example, only 6 out of the 52 (thus 12 %) news articles obtained from Bangkok Post by searching for “online grooming” proved to in fact be on OVAC.

 

Specificity of results retrieved using various search terms related to Online Violence against Children. The higher the OVAC-relevant fraction, the more useful the search term - Source: Omdena

Specificity of results retrieved using various search terms related to Online Violence against Children. The higher the OVAC-relevant fraction, the more useful the search term – Source: Omdena

If the goal is to include only OVAC-relevant articles into our analysis, 3 strategies are conceivable:

1) to use only search terms that produce results 100 % specific to the problem of OVAC.

2) to use various search terms and manually check articles per article to include only the relevant ones.

3) to find some method to (semi-)automate filtering out the relevant articles.

Since 1) would restrict the analysis to very few subclasses of OVAC, and 3) has the potential to increase the efficiency of 2), but also since I’m a curious person who has never before built a classifier that relies on machine learning to categorize text documents, I decided to build a news article classifier.

 

Automating the Data Collection Process

Whenever we train a machine learning model, we have to decide which metric we want to optimize. My goal was not to find every single article out there, that would be on OVAC, our class of interest, but rather to end up with a collection of news articles, all relevant for our project (OVAC-related). In other words, my aim was to maximize recall — the fraction of true positives among all articles classified as positive for our target class OVAC. Already in the 2nd training attempt, I reached a recall of 100 %: every article predicted by the SVM to be on OVAC, was in fact on OVAC.

A classifier as reliable as this may prevent us from battling our way through every single article to confirm its relevance to our project. On the other hand, such great performance can only be achieved on a test set of news articles that shows a class distribution similar to the article set it was trained on. Most notably, a classifier not trained to recognize entirely irrelevant articles will evidently always fail to detect such.

 

Obtained classification metrics for a Support Vector Machine (SVM) Classifier trained to recognize news articles related to our target class, OVAC - Source: Omdena

Obtained classification metrics for a Support Vector Machine (SVM) Classifier trained to recognize news articles related to our target class, OVAC – Source: Omdena

One of the advantages of news articles, if compared to research publications, is that they are published with a very short delay. News articles thus have the power to provide insights about trends over time with a much shorter time lag bias than research publications.

 

Quantifying the Issue

Features can be extracted from news articles with the help of NLP techniques, and then converted into numeric variables with the aim to quantify the magnitude of a problem. They can surely serve as indicators pointing towards potentially underlying trends, but need to be interpreted with a healthy portion of caution and skepticism. Too many biases may play their part as well, such as the reporting bias resulting in an over-representation of the case of the famous actor, as compared to a less attention-grabbing but potentially much more severe case.

Nevertheless, I decided to give it a try and investigate trends over time with respect to the reporting of child pornography cases. Instead of using advanced NLP, I decided to rely on basic NumPy functionality and created a dummy variable for 7 selected verbs, often reported in the context of child pornography. For each article, represented by separate rows of a pandas data frame, the dummy variable would assume a value of 1, if the article contains a certain text string such as “possess”, or alternatively a value of 0 if it does not.

 

Appending dummy variables to a Pandas dataframe to indicate the presence (1) or absence (0) of a given text string in each article - Source: Omdena

Appending dummy variables to a Pandas dataframe to indicate the presence (1) or absence (0) of a given text string in each article – Source: Omdena

For every reporting year, I summed up these word occurrences to obtain the total number of articles, in which a given string was mentioned together with child pornography, per year. I used the streamgraph package developed for the statistics package “R” to visualize these yearly article frequencies over the past 10 years.

 

Yearly frequency of selected verbs reported together with child pornography. Blue: “stream”, green: “spread”, light green: “share”, yellow: “record”, light orange: “publish”, dark orange: “possess”, red: “create” - Source: Omdena

Yearly frequency of selected verbs reported together with child pornography. Blue: “stream”, green: “spread”, light green: “share”, yellow: “record”, light orange: “publish”, dark orange: “possess”, red: “create” – Source: Omdena

Different NLP techniques have different requirements with respect to the cleanness of a text string. To create a word cloud, it’s beneficial to lowercase every word, so that the frequency of the capitalized and non-capitalized versions of the same word will be added together. To analyze, how often the possession of pornography is mentioned in news articles, it is beneficial to use word reduction techniques such as stemming and/or lemmatization, so that the frequencies of differently inflected or derived word forms like “possess”, “possessed”, “possession”, “possessing” will be added together. (I used an alternative approach to solve the same problem, above.)

To avoid wasting time, it is wise to think about such requirements before jumping into actual text cleaning. Some cleaning steps won’t be required for a certain NLP analysis, whereas others will make the analysis difficult or even impossible. A highly efficient, customizable function developed by the team allowed me to clean text strings in virtually no time.

 

A customizable text cleaning function. Booleans can be adjusted to specify, which of 6 standard steps shall be applied - Source: Omdena

A customizable text cleaning function. Booleans can be adjusted to specify, which of 6 standard steps shall be applied – Source: Omdena

I created n-grams, document term matrices, and word clouds to identify the most common words and word sequences in news articles. These most common terms include “sexual”, “social media” and “Facebook”, “law” and “government”, “police”, the “Philippines”, “women” and “teacher”. In analogy to the famous “garbage in, garbage out”, these results of course reflect the principle of “search terms in, keywords out”. Moreover, they illustrate that OVAC often occurs on social media platforms, that Facebook plays a major role in the Thai social media market, and that the Thai news agency “Bangkok Post” often reports on the Philippines.

Besides identifying the most common words or word-sequences (n-Grams) in a collection of text documents, it’s also relevant to investigate, how these words relate to each other. I followed the instructions provided in a Machine Learning Mastery article to calculate my own word embeddings based on the 209 news articles from Bangkok Post with the aim to investigate word similarities and to possibly detect previously unknown words.

 

Online violence against children through AI - Source: Omdena

Visualization of word vector embeddings trained on 209 Thai news articles of all three classes (OVAC/PVAC/OVAA). Depicted are the 9 closest neighbors (used in most similar context) to 8 selected words: pornography, CSAM, cyberbullying, trolling, grooming, sexting, cyberharassment, and cybercrime (see bottom-right for color labels) – Source: Omdena

I then calculated cosine similarities between selected pairs of word vector embeddings to quantify the similarity between specific words, resp. more precisely: the similarity between the contexts, in which different words typically appear. I found (“sexting” and “stalking”) and (“sexting” and “bullying”) to occur in very similar contexts (cosine similarity scores of 0.86 and 0.83, respectively), whereas (“pornography” and “bullying”) occurred in rather different contexts (0.37).

Furthermore, such visualization of word embeddings has allowed me to identify another term in the family of OVAC related problems, I wouldn’t have known before: “trolling”. I could now go back to the search engine of Bangkok Post to assess the specificity of news articles obtained using that search term for the problem of OVAC. To learn more about the data cleaning and modeling, read Internet Safety for Children: Using NLP to Predict the Risk Level of Online Games, Websites, and Applications

 

 

 

Insights and Conclusion

I consider an initial exchange with domain experts, in our case from Save The Children, complemented by some literature review, as invaluable for defining the scope and search terms. Amongst others, I also had the chance to interview a lady working for ECPAT International, a global network of organizations working towards ending the sexual exploitation and abuse of children worldwide.

I came to the conclusion, that the prevalence and importance of the various subclasses of OVAC seem to differ by geographic location. Analogously, also the barriers to overcome in the fight against OVAC likely differ by geographic location. Poverty, cultural norms, the available infrastructure, current legislation, and data protection regulations all critically determine, which forms of OVAC are most present in a society, and to what extent these are considered as normal or as something to be prevented in the future.

In some countries, coming into touch with child pornography is a part of everyday life, already in early childhood. As long as governments don’t provide their population with alternative opportunities to earn the minimum amount of money that would allow them to make an acceptable living, some of the problems around OVAC will be difficult to change. As long as teenagers enjoy being groomed by strangers online, making them aware of the associated risks may have little effect. Providers of social media platforms or chat forums have to respect the data privacy of their users, a goal typically in conflict with efforts to prevent potential online violence against minors.

I believe that the problem of Online Violence against Children can only be sustainably prevented if all stakeholders pull together. Parents can only truly care for the well-being of their children if they have enough to survive. Platform providers can only help to prevent OVAC, if prevention methods can be aligned with the current, typically local, regulatory requirements. And Omdena collaborators can only select the best terms to retrieve news articles if they already understand the problem to some extent. Every new insight gained can help re-adjusting the direction, and many tunnels and bridges will still need to be crossed on the long journey to the bright final destination: an internet providing a safe place for every child on this planet.

Develop Your Career and Make a Real-World Impact

Innovation

The world´s only place for truly collaborative AI projects to apply your skills on real-world data with changemakers from around the world.

Apply & grow your skills in our real-world projects

Upcoming AI Projects

AI Teams

Make an impact in our upcoming projects in Natural Language Processing, Computer Vision, Machine Learning, Remote Sensing, and more.

Check out our projects!

Stay in touch via our newsletter.

Be notified (a few times a month) about top-notch articles, new real-world projects, and events with our community of changemakers.

Sign up here