Overcoming an Imbalanced Dataset using Oversampling. Casestudy: Sexual Abuse

Overcoming an Imbalanced Dataset using Oversampling. Casestudy: Sexual Abuse

The problem: Overcoming an imbalanced data set

When it comes to data science, sexual harassment is an imbalanced data problem, meaning there are few (known) instances of harassment in the entire dataset.

An imbalanced problem is defined as a dataset which has disproportional class counts. Oversampling is one way to combat this by creating synthetic minority samples. 

 

The solution: The power of oversampling

SMOTE — Synthetic Minority Over-sampling Technique — is a common oversampling method widely used in machine learning with imbalanced high-dimensional datasets using Oversampling. The SMOTE technique generates randomly new examples or instances of the minority class from the nearest neighbors of a line joining the minority class sample to increase the number of instances. SMOTE creates synthetic minority samples using the popular K nearest neighbor algorithm.

 

K nearest neighbors draw a line between the minority points and generate points in the middle of the line. It is a technique that was experimented on, nowadays one can find many different versions of SMOTE which build upon the classic formula. Let’s visualize how oversampling effects the data in general.

 

graph between feature 0 and feature 1 before considering oversampling

Visual representation of data without oversampling

 

Visual representation of data with oversampling

 

 

For visualization’s sake, two features are picked and from their distribution, it’s clearly seen that the minority samples match the majority sample count.

 

Impact on the predictions

Let’s compare the predictive power of oversampling vs. not oversampling. Random Forest is used as the predictor in both cases. The ProWSyn version of oversampling is selected as the highest performing oversampling method after all the methods are compared using this Python package.

Let’s check the performance of models pre and post oversampling.

 

No oversampling implemented graph

ROCAUC without oversampling

 

 

Oversampling graph

ROCAUC with oversampling

 

 

With ProWSyn oversampling implemented, we can see a 13% increase in the ROCAUC score, which is the Area Under the Receiver Operating Characteristic curve, from 84% to 97%. I was also able to decrease the Brier Score, which is a metric for probability prediction, by 5%.

As you can see from the results, oversampling can significantly boost your model performance when you have to deal with imbalanced datasets using oversampling. In my case, the ProWSyn version of SMOTE performed the best but this depends always on the data and you should try different versions to see which one works the best for you.

 

What is ProWSyn and why does it work so well?

Most Data Science: oversampling methods lack a proper process of assigning correct weights for minority samples, in this case regarding the classification of Sexual Harassment cases. This results in a poor distribution of generated synthetic samples. Proximity Weighted Synthetic Oversampling Technique (ProWSyn) generates effective weight values for the minority data samples based on the sample’s proximity information, i.e., distance from the boundary which results in proper distribution of generated synthetic samples across the minority data set.

 

What is the output?

 

Graph of probability of sexual harassment instance

x: number of instances; y: probability

 

 

After the prediction, the histogram of predicted probabilities looks like the image above. The distribution turned out the be the way I imagined. The model has learned from the many features and it turns out there is a correlation within the feature space which at the end creates such a distinct difference between classes 0 and 1. In simpler terms, there is a pattern within 0 and 1 classes’ features.

More care has to be put into probabilities really close to 1 (100% probability). From the histogram plot above, we can see that the number of points near 100% probability is quite high. It is normal to dismiss someone as a non-predator but much harder to accuse someone, therefore that number should be lower.

 

More About Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up collaboration.

Classifying Sexual Abuse in Chats through the Bag of Words NLP Model

Classifying Sexual Abuse in Chats through the Bag of Words NLP Model

 
This work is part of the Omdena AI project with the award-winning Zero Abuse Project.

 

The problem

Child sexual abuse is a particular problem in academia. In the United States, “an estimated 10% of K–12 students will experience sexual misconduct by a school employee by the time they graduate from high school.” Shocking stories of sexual abuse at Penn State, University of Michigan, Ohio State and other universities are continuously revealed. In some of these cases, it’s taken decades of dedicated work by victims advocates and journalists for the abuse to come light.

 

The solution

The project team built a Natural Language Processing (NLP) model that will be of great help to classify predator individuals in online chats to prevent sexual abuse. 

 

The dataset

The dataset that is used in this report is Pan12. Pan12 contains chat texts between the molesters and children.

 

Importing the NLP packages: Preprocessing

The first major library that needs to be imported is “re” or Regular Expression. The second major library is the Natural Language Tool Kit (NLTK), which is another Python library that will help us to program with natural language. It comes with more than 50 other repositories such as WordNet, providing libraries that can perform tokenizing, stemming, classification, parsing, tagging, and semantic analysis.

The last one is PorterStemmer from the NLTK library, which provides us with stemming. Stemming is the process of linguistic normalization. It reduces words to their word roots for example organization, organizes, organized, organizer, and organize, all would become organized.

 

Importing NLP packages

 

 

Now, we can apply all those changes.

 

NLP preprocessing

 

 

Also, a better approach, especially for large datasets, is to define a function like this and then calling a lambda function on it. Just be careful in our previous approach the output was an array. But by applying lambda function on the following, the output would be a column in the dataset.

 

Labeling the important strings

 

 

This method works for datasets with a limited number of unique features. In the case of having a large number of features in a dataset, this method isn’t efficient. Future versions of this article will address this issue.

 

Finding most-used strings in the subsets

 

Labeling the important strings

 

Then we extracted the features of our dataset. Features are specific words that are being used between the child and the potential molester. The model predicts whether communicated words are sexual or not.

The bag of words model is one of the good choices for us. In this model, a text (a sentence) is considered as a bag (multiset) of its words. It only considers multiplicity without considering the grammar or the order of the words.

To do so CountVectorizer from Sickit-Learn is our module choice, it provides us with a conversion of text documents to a matrix of token accounts. So, we would have a dictionary of words in an array, and if in each set of strings or bag of words any of these predefined texts or words are included, the corresponding number to that specific word in the matrix format of the dictionary would be 1, and 0 if not included.

Then we would fit and transform the function on our features part of our dataset.

 

Creating the NLP model

 

 

Then we would separate the dataset into train_features, test_features, train_labels, and test_labels by randomly considering 80% of values to train sets and 20% of values to test sets. X parameter represents those columns of the dataset having the features and y parameter represents their corresponding labels.

 

Making training and test sets

 

 

Introducing the Gaussian module and fitting it to our training features and labels.

 

Classifying training set by Gaussian method

 

 

Following with the prediction of our labels based on our test_features.

 

Making predictions on the test set

In the end, we would provide a confusion_matrix to check our predicted labels with our test_labels.

Making the confusion matrix to qualify our model

 

Conclusion and summary

In this report, we provided an NLP model to classify potential child molesters to prevent sexual abuse early on. We used BeautifulSoap to import the dataset in XML format, derived the raw text, and cleaned the dataset. Next, we introduced various NLP libraries and packages and applied them to our dataset, used the bag of words model, and output the confusion matrix.

 

More about Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up collaboration.

 

Using AI To Prevent Gang Violence via Analyzing Tweets

Using AI To Prevent Gang Violence via Analyzing Tweets

Applying machine learning to understand gang language and detect threatening tweets related to gang violence.

 

The problem

“Some believe that bolstering school security will deter violence, but this reactionary measure only addresses part of the problem. Instead, we must identify threats, mitigate risk, and protect children and staff before an act of violence occurs.” — Jeniffer Peters, Founder of Voice4Impact (Project Partner)

Chicago is considered the most gang-infested city in the United States, with a population of over 100,000 active members from nearly 60 factions. Gang warfare and retaliation are common in Chicago. In 2020 Chicago has seen a 43% rise in killings so far compared to 2019.

 

 

The solution

It was noticed that gangs often use twitter to communicate with fellow gang members as well as threat other gang members. Gang language is a mixture of icons and some gang terms.

 

Sample Gang Language

 

The project team split the work into two parts:

  • Implement a machine-learning algorithm to understand gang language and detect threatening tweets related to gang violence.
  • Find co-relation between threatening tweets and actual gang violence.

 

Part 1: Detecting violent gang language and influential members

The goal was to classify tweets as threatening or non-threatening so that the threatening ones can be routed to intervention specialists who will then decide what action to take.

 

Step 1: Labeling tweets collaboratively

First, a tool was created to label tweets faster and train the machine learning model. We were only provided the raw tweets. Searching the web, we found LightTag, which is a product designed for exactly this but it is a paid product once you exceed the comically low number of free labels.

We needed a simpler solution that does everything we need, and nothing else. So, we turned to a trusted old friend: Google Spreadsheets. A custom Google Spreadsheet was made (the template publicly available here). It features a scoreboard, so labelers get credit for their contribution, and a mechanism to have at least two people label each tweet to ensure the quality of labels.

 

 

 

To ensure the quality of our labels, we decided we need at least two labels on every tweet, and if they are not the same, a third label would be required to break the tie. Row color-coding makes it easy to see which rows are finished. If the row has been labeled once, it will be colored green. If the row has been labeled twice and the two labels do not agree, it will be colored red. Also on the scoreboard page, is a count of how many tweets are labeled once, labeled twice with conflicting labels, and finished on each page.

 

Step 2: Sentiment analysis (with probability value) of tweets being violent

The sentiment analysis team built a machine learning model to predict whether the tweets are threatening or non-threatening. But first, we needed to address the challenges of an imbalanced dataset where over 90% of the tweet feed was non-threatening, and the scarcity small size of the labeled dataset. We tested multiple techniques, including loss functions specifically designed for imbalanced datasets, undersampling, transfer learning from existing word embeddings algorithms, and ensemble models. We then combined the reservoir of violent signal words to come up with probability value (the probability that a tweet is more prone to using violent words) against each tweet.

 

Step 3: Detect influential members in the twitter gang network

Next, we wanted to identify the influential members of the network. A network analysis resulted in a directed graph and by using the Girvan Newmann algorithm, the communities in the networks could be also detected. Using PageRank values of each node, the influential members were identified.

 

5 steps to build an effective network analysis of tweets

1. Using python’s networkX, a graph using the mentions and authors of the tweets were created

Network Analysis Gang Violence

Network analysis

 

A detailed article on the Network analysis.

The nodes represent mentions in the tweet/author of a tweet. Edge A →B means B was mentioned in the tweet posted by A.

2. Thousands of tweets were used to create a directed graph and using Girvan Newmann algorithm, the communities in the networks were detected. Also, using PageRank values of each node, the influential members in the network could be identified. This value is not crucial to the network analysis but can be useful if one tries to track any gang member who is influential in the network.

3. The members in the communities are either authors or mentions. So, the tweets were then tagged with the community number based on the mention or author names.

4. The total number of signal keywords in all the communities was calculated and so was the total number of signal words for individual communities.

5. The final result was a dataset of tweets that had the community tag and probability of using violent words — based on usage of signal words within the community relative to all the communities. For example, In the picture below, members from Community 1 who are authors or mentions in the tweets are more likely to be inclined towards using violent keywords. So, the tweets which contain authors/mentions from this community are contextually more violent.

 

 

Also, the network analysis can give an insight into which members are more influentials within the community. One can get a notion by looking at the PageRank values of the members of the community. The greater the PageRank, the more influential a member is.

 

Page Rank vs Gang Member

 

Part 2: Correlation between actual violence and tweets

Next, we wanted to understand, if there is any co-relation between actually Crimes and mention of ‘Gun’ in a threatening tweet.

Below is the correlation between the two metrics on the same day, 1-day, and 2-day shift.

 

Same day

 

1-day shift

 

2-day shift

 

Through this analysis, we can see that there is a correlation between the number of crimes and the use of a gun in threatening tweets with a 2-day shift. This can be very useful for authorities to prevent gang violence.

 

More about Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up collaboration.

 

An Attempt to Identify Cybersex Crimes through Artificial Intelligence

An Attempt to Identify Cybersex Crimes through Artificial Intelligence

Classifying the online chats between two persons as sexual abuse or non-sexual abuse using text mining and deep learning.

 

The problem

The vast growth in technology and social media has revolutionized our lives by transforming the way we connect and communicate. Unfortunately, the darker side of this development has exposed a lot of children and teenagers from various ages to become victims of online sexual abuse.

To help combat the severity of the problem, I joined an Omdena project together with the Zero Abuse Project. Among 45 Omdena collaborators from across 6 continents, the goal was to build AI models to identify patterns in the behavior of institutions when they cover-up incidents of sexual abuse.

The identification and analysis of sexual crimes assure public safety and has been made possible by leveraging AI. Natural Language Processing and various machine learning techniques have played a major in the successful identification of online sexual abuse.

 

The solution

The main idea of this task was to classify online chats between two persons as sexual abuse or non-sexual abuse. We planned in implementing this by using text mining and deep learning techniques such as LSTM-RNN. In the following example, our idea aimed at classifying the chats as predatory or non-predatory.

 

Classifying online chats 

We have used the open-source PAN2012 dataset provided in the context of the Sexual Predator Identification (SPI) Task in 2012 initiated by PAN (Plagiarism analysis, authorship identification, and near-duplicate detection) lab. However, the realistic data provided by PAN has a high noise level with unbalanced training samples and varying length of conversations.

The challenging part of this dataset was in changing the chat text abbreviations and cyber slang texts such as “u” for “you”, “ur” for “your” and “l8r” for “later”. Such words are necessary for feature selection and for improving the performance of the model used for the classification.

 

Wait, are we stuck with preprocessing?

Initially, with the huge dataset and high noise levels, preprocessing did seem like a herculean task! Well, 80% of the time goes into preprocessing in order to achieve the best results. We managed to implement it by using text mining techniques. We started off by carrying out a basic analysis of checking for null characters, finding the sentence length of each text message as well as finding out the words with the highest frequencies. We also implemented the removal of stopwords, stemming, and lemmatization. The aim of both stemming and lemmatization is to reduce the corpus size and complexity for creating embeddings from simpler words which is useful for sentiment analysis. Stopwords are words that are omitted since it does not provide value for the machine’s understanding of texts.

Furthermore, we realized our dataset contained loads of emojis, URLs, hashtags, misspelled words, and slangs. In order to reduce the noise levels to a greater extent, we had to remove the emoticons from the chats using regular expressions and change the misspelled words by creating a dictionary. The tricky part here involved converting the chat slang abbreviations since it was necessary for feature selection. Unfortunately, it was difficult to find a library or database of words that do that. We had to create a dictionary for that purpose.

slang_dict = {"aren't": "are not", "can't": "cannot", "couldn't": "could not","didn't": "did not","doesn't": "does not",
"don't": "do not","hadn't": "had not"......}
def process_data(data):
   data['text_clean']=data['text_clean'].str.lower()
   data['text_clean']=data['text_clean'].astype(str)
   data.replace(slang_dict,regex=True,inplace=True)
   display(data.head(2))
   return data

 

The Exploratory Data Analysis

We further tried to analyze the top 20 frequently words in the chatlogs as unigrams and bigrams. A unigram is an n-gram consisting of a single word from a sequence and bigrams contain two words from a sequence.

 

Top 20 Unigrams

 

From the analysis, we inferred that words such as “age”,” sex”, “hi” etc were very frequently used in the catalogs.

Moving into the language model and classification

The XML dataset provided by PAN2012 is unlabelled and manual labeling is a pretty difficult task considering the number of samples present in the dataset. To solve this situation, sentiment analysis was carried out to identify the polarity and subjectivity of the chatlogs. Polarity is a float which lies in the range of [-1,1] where 1 means positive statement and -1 means a negative statement. Subjective sentences generally refer to personal opinion, emotion, or judgment whereas objective refers to factual information. Subjectivity is also a float which lies in the range of [0,1].

Considering the different number of sentences in conversations (from 1 to more than 500), the extra-long conversations were padded by zeros and then split into parts, each with an equal length of 100. This strategy is helpful to prevent underfitting in the LSTM-RNN model when processing long conversations. These tokenized words were converted into word embeddings to be fed into the LSTM-RNN classifier using the GLoVe pre-trained model.

GloVe stands for global vectors for word representation. It is an unsupervised learning algorithm developed by Stanford for generating word embeddings by aggregating a global word-word co-occurrence matrix from a corpus.

# Co-occurence matrix
def fill_embedding_matrix(tokenizer):
   vocab_size = len(tokenizer.word_index) 
   embedding_matrix = np.zeros((vocab_size+1, 100)) 
   for word, i in tokenizer.word_index.items():
       embedding_vector = embeddings_index.get(word) 
       if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
   return embedding_matrix


The architecture of the LSTM-RNN classifier 

 

Each word embedding is fed into the binary LSTM-RNN classifier. It consists of one embedding layer, two LSTM-RNN layers with 200 units and 50 timesteps as well as a sigmoid layer that is implemented on the Tensorflow framework for the binary classification. The results could have been improved if labeling the chatlogs could be efficient and if the persisting noise in the dataset could be reduced. However, this task of classifying the sexual predators provided us a clearer perspective of an efficient feature selection and new approaches to solving the labeling problem in order to improve the accuracy of the LSTM-RNN classification.

 

More about Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up collaboration.

 

Sources 

[1] https://arxiv.org/ftp/arxiv/papers/1712/1712.03903.pdf

[2] https://pan.webis.de/clef12/pan12-web/sexual-predator-identification.html

[3] https://towardsdatascience.com/fasttext-sentiment-analysis-for-tweets-a-straightforward-guide-9a8c070449a2

Analysing Sexual Abuse at the Workplace Using Supervised Learning

Analysing Sexual Abuse at the Workplace Using Supervised Learning

How oversampling and supervised learning yielded great results for classifying cases of sexual abuse.

 

By Omdena Collaborator Mertcan Coskun


 

 

 

Nowadays, the severity of sexual abuse is gaining more and more traction, not just in the USA but throughout the whole world.

To help combat the problem, I joined an Omdena project together with the Zero Abuse Project. Among 45 Omdena collaborators from across 6 continents, the goal was to build AI models to identify patterns in the behavior of institutions when they cover-up incidents of sexual abuse.

 

My task: Overcoming an imbalanced data set

When it comes to data science, sexual abuse is an imbalanced data problem, meaning there are few (known) instances of harassment in the entire dataset.

An imbalanced problem is defined as a dataset which has disproportional class counts. Oversampling is one way to combat this by creating synthetic minority samples.

Together with other collaborators, I worked on an AI tool that evaluates the risk factors that suggest potential predatory individuals within an organization and those associated with the cover-up.

Our data consists of sexual abuse instances at work and their features. The data is provided by UNICEF.

Instances of sexual harassment is a reported case of sexual harassment which is concluded by law enforcement. The risk factors are going to be our features in the dataset. Features include; state, number of relocation and institution the person is connected to.

Since the nature of the data is sensitive and unique, I have predicted probabilities rather than classes for the prediction output type. In such questions, predicting either 0 or 1 may be too controversial.

To cope with the data imbalance problem and sensitivity, I decided to apply oversampling and implement a random forest model (supervised learning) to analyze the sexual abuse patterns.

 

The power of oversampling

SMOTE — Synthetic Minority Over-sampling Technique — is a common oversampling method widely used in machine learning with imbalanced high-dimensional data. The SMOTE technique generates randomly new examples or instances of the minority class from the nearest neighbors of a line joining the minority class sample to increase the number of instances. SMOTE creates synthetic minority samples using the popular K nearest neighbor algorithm.

K nearest neighbors draws a line between the minority points and generates points in the middle of the line. It is a technique that was experimented on, nowadays one can find many different versions of SMOTE which build upon the classic formula. Let’s visualize how oversampling effects the data in general.

 

Visual representation of data without oversampling

 

 

Visual representation of data with oversampling

 

For visualizations sake, two features are picked and from their distribution it’s clearly seen that the minority samples match the majority sample count.

 

Impact on the predictions

Let’s compare the predictive power of oversampling vs. not oversampling. Random Forest is used as the predictor in both cases. The ProWSyn version of oversampling is selected as the highest performing oversampling method after all the methods are compared using this[1] Python package.

Let’s check the performance of models pre and post oversampling.

 

ROCAUC without oversampling

 

 

ROCAUC with oversampling

 

With ProWSyn oversampling implemented, we can see a 13% increase in ROCAUC score, which is the Area Under the Receiver Operating Characteristic curve, from 84% to 97%. I was also able to decrease the Brier Score[2], which is a metric for probability prediction, by 5%.

As you can see from the results, oversampling can significantly boost your model performance when you have to deal with an imbalanced dataset. In my case, ProWSyn version of SMOTE performed the best but this depends always on the data and you should try different versions to see which one works the best for you.

 

What is ProWSyn and why does it work so well?

Most oversampling methods lack a proper process of assigning correct weights for minority samples. This results in a poor distribution of generated synthetic samples. Proximity Weighted Synthetic Oversampling Technique (ProWSyn) generates effective weight values for the minority data samples based on the sample’s proximity information, i.e., distance from the boundary which results in a proper distribution of generated synthetic samples across the minority data set.[3]

 

What is the output?

 

x: number of instances; y: probability

 

After the prediction, the histogram of predicted probabilities looks like the image above. The distribution turned out the be the way I imagined. The model has learned from the many features and it turns out there is a correlation within the feature space which at the end creates such a distinct difference between classes 0 and 1. In simpler terms, there is a pattern within 0 and 1 classes’ features.

More care has to be put into probabilities really close to 1 (100% probability). From the histogram plot above, we can see that the number of points near 100% probability is quite high. It is normal to dismiss someone as a non-predator but much harder to accuse someone, therefore that number should be lower.

 

What’ next?

I shared a description of applying supervised learning for sexual abuse data.

I was able to identify the main problem, which was the class proportion in target values. Since predicting probabilities in such a sensitive subject required a well-functioning and thought out model, I wanted to fix the biggest problem by creating synthetic instances of sexual harassment in the dataset and have the model it that way. As a result, the predicted probabilities, or red flags, have shown a high level of Brier Score and AUC which means a higher probability prediction performance.

These high scores mean a much better predictive performance in plain English. But this is a double-edged sword, as the model would have a large number of highly probable sexual harassment entities on future data.

Since this machine learning task is much more sensitive than for example predicting the price of second-hand cars, these high probabilities may lead to complications. Having more training data and using a very high threshold may overcome this problem.

 

About Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up collaboration.

Want to become an Omdena Collaborator and join one of our tough projects, apply here.

We are also on LinkedIn, Instagram, Facebook, and Twitter.

 

Sources

[1] https://github.com/analyticalmindsltd/smote_variants

[2] https://link.springer.com/chapter/10.1007/978-3-642-37456-2_27

[3] https://en.wikipedia.org/wiki/Brier_score

Stay in touch via our newsletter.

Be notified (a few times a month) about top-notch articles, new real-world projects, and events with our community of changemakers.

Sign up here