By Omdena Collaborator Harshita Chopra

 
Data mining, topic modeling, document annotations, NLP, and stacking machine learning models: A complete journey.

Artificial Intelligence and its possibilities have always fascinated me. Making machines learn through data is nothing short of amazing. When I got to learn about Omdena and it’s a wonderful initiative for bringing AI to social good using the power of global collaboration, I couldn’t stop myself from participating in its empowering challenges.

I felt truly content to be given the role of a Machine Learning Engineer in my first challenge. Connecting with a team of 50 fabulous collaborators from various countries around the world, including domain experts, data scientists, and AI experts — felt like a golden opportunity to gain knowledge in the best possible way.

It provided a space to create value out of my ideas and learn from the enhancements. The harmonizing environment gave me the experience of leading task groups and interacting with some really innovative minds.

In this blog post, I’ll walk you through a major part of the project I led and contributed to for the past month.

 

The Problem

There has been a surge in Domestic Violence (DV) and online harassment cases during COVID-19 lockdowns in India. Homes are no more a safe place for victims trapped in abusive relationships with their family members.

Domestic violence involves a pattern of psychological, physical, sexual, financial, and emotional abuse. Acts of assault, threats, humiliation, and intimidation are also considered acts of violence.

Data substantiating Domestic Violence from government resources are only available in summary form. Incidents are largely reported via calls, and hence make data and subsequent mapping difficult.

The goal of the challenge was to collect and analyze data from different social media platforms or news sources so as to gain insights on the rise in DV incidents during the nation-wide lockdown.

 

The Solution

Diverse social media platforms come up as a huge and largely untapped resource for social data and evidence. It generates a vast amount of data on a daily basis on a variety of topics. Consequently, it represents a key source of information for anyone seeking to study various issues, even the socially stigmatized and society tabooed topics like Domestic Violence.

Victims experiencing abuse are in need of earlier access to specialized services such as health care, crisis support, legal guidance, and so on. Hence the social support groups for a good social cause play a leading role in creating awareness promotion and leveraging various dimensions of social support like emotional, instrumental, and informational support to the victims.

Red Dot Foundation plans to deal with this challenge. When the victims seek help, it is important to identify and analyze those critical posts and acknowledge the help needed with more immediate impact.

Tasks were divided to mine data from different sources: Twitter, Reddit, YouTube, News articles, Government reports, and Google trends. After the acquisition of huge amounts of data, the next step was filtering out relevant posts through topic modeling and keywords. This was followed by annotation of data and then building an NLP based machine learning classifier.

In this blog post, Tweets would be in the spotlight!

 

 

Scraping data with the right queries

Tweets need to be extracted in the pre-lockdown and during the lockdown period so as to judge the surge in domestic violence. Hence, we took a time-frame of January’20 to May’20.

Tweepy (the official tweets scraping API of twitter) extracts tweets only from the past seven days, making it a bothering limitation. Hence, we needed an alternative for mining old tweets with the location.

GetOldTweets3 is a fantastic Python library for this task. Twitter’s advanced search can do wonders for generating your customized query. In order to extract harassment-related posts, here are a few examples of queries we used:

 

 

 

Using ANDcombinations of relationships words with actions and nouns yield good results. The until and since attributes hold the limits of the time frame.

The setNear() feature accepts a location name (like Delhi, Maharashtra, India, etc) or latitude and longitude of that region. The central point of India is approximately around (22,80) degrees. The setWithin() feature accepts the radius around that point, and 1800 km generally covers India and nearby places.

After executing more such queries with different keywords, we had thousands of tweets in handsome relevant topics and some irrelevant.

 

Data needs to be classified  –  Would topic modeling work?

Since a considerable number of tweets in our huge datasets were not related to the kind of harassment we were looking for, we needed some filtering. Classifying tweets into broad topics was the goal. Topic modeling was the first thing that clicked.

Topic modeling is an unsupervised learning process to automatically identify topics present in a collection of documents and to derive hidden patterns exhibited by a text corpus. Thus, assisting better decision making.

Latent Dirichlet Allocation is the most popular technique to do so. It assumes that documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution. Given a dataset of documents, LDA backtracks and tries to figure out what topics would create those documents in the first place.

 

 

Topic 0 words are generally included in awareness posts or #BanTiktok posts due to its inappropriate content.
> Topic 1 words are headlines or real victim stories.

 

Topic modeling works best when the topics are considerably distinct or not too related to each other.

The generated topics didn’t satisfy our target of classifying as relevant or irrelevant. Hence we had to pick up another approach, since our dataset, in general, talks about kinds of harassment.

Rule-based classification turned up to be a more precise approach in this task. I created three sets of keywords to look for — relationships, violence verbs, and not-relevant words. The following algorithm as implemented in Python to filter out some documents.

 

Relationships: [List of keywords like wife, husband, housemaid etc.]
Verbs        : [List of keywords like harass, beat, abuse etc.]
Not-relevant : [List of keywords like webinar, politics, movie etc.]
Iterating through document:
{ 
  R1: (any word in Relationships) AND (any word in Verbs) -> 'Keep'
  R2: (any word in Not-relevant) -> 'Discard'
}

 

The dataset was filtered pretty much based on our customer needs. Now comes up with the task of modeling. But we need annotations for training a supervised model.

 

Deciding Labels and Annotating Tweets

Eminent domain experts helped in coming up with the categories to classify tweets based on their context. Proper labeling guidelines were set up and training sessions helped to label tweets properly, keeping in mind the edge cases. For document classification, a quick and awesome tool called Doccano was used. Several collaborators helped by taking up queues of data points and annotating them. Following were the labels used:

  • DV_OPINION_ADVOCATE
    (advocating against domestic abuse)
  • DV_OPINION_DENIER
    (denying the existence of domestic abuse)
  • DV_OPINION_INFO_NEWS
    (stating factual information or news)
  • DV_STORY
    (describing an incident of domestic abuse)
  • NON_D_VIOLENCE_ABOUT
    (other kinds of harassment)
  • NON_D_VIOLENCE_DIRECTED
    (harassment directed at individual or community)
  • NO_VIOLENCE

 

Domestic Violence Twitter

Analytics derived from NER.

 

 

And data for modeling is ready!

After all the annotations and some fabulous work with collaborators, we’re ready with an incredible training dataset.

Tinkering with Natural Language Processing…

Once pre-processing of texts by lowering cases, removing URLs, punctuation, stopwords, followed by lemmatization was done — we were ready to play around with modeling techniques.

To convert words to vectors (machine learns through numbers), experimenting with TF-IDF Vectorizer gave good results but we had a very limited vocabulary, while the inference data would have a greater variety of words. Therefore, a decision of using pre-trained word embeddings was made.

Our model used FastText English word vectors(wiki-news-300d-1M.vec) and IndicNLP word vectors (indicnlp.v1.hi.vec) for Hindi and Hinglish languages present in the documents.

Since tweets related to DV stories were quite less in number, data augmentation was used on these — by creating new sentences using synonyms of the original words.

nlpgaugis a library for textual augmentation in machine learning experiments. The goal is to improve model performance by generating augmented textual data. It’s also able to generate adversarial examples to prevent adversarial attacks.

 

 

Bringing into play  – Machine Learning Model(s)

A number of models including BERT, SVM, XGBoostClassifier, and many more were evaluated. Since there were really minute differences between some similar classes, we needed to combine two sets of labels.

After combining similar labels:

 

 

Limitations faced — Data under various classes was not easily separable because 3 classes plainly talked about Domestic Violence (story, opinion, news/info) which made it tough for the classifier to spot marginal variation in semantics.

Also, data under DV_STORY had the least number of samples given the fact that it was the most relevant class.

Hence, to deal with an imbalanced dataset, Under Sampling using NeighbourhoodCleaningRule was used from the imbalanced-learn library. The resampled data was fed to Stacked Models.

Stacking is a way of combining predictions from multiple different types of ML models, that introduces the concept of a meta learner.

 

 

Source: GeeksforGeeks

 

Level 0 learners:
– Random Forest Classifier
– Support Vector Classifier
– MLP Classifier
– LGBM Classifier

Level 1 meta-learner:
SVC with hyperparameter tuning and custom class weights.

 

Class Encoding — 0: DV_INCIDENT, 1: DV_OPINION, 2: DV_OPINION_INFO_NEWS, 3: NON_D_VIOLENCE_ABOUT, 4: NO_VIOLENCE

 

This pretty much sums up the modeling. This model was used to predict labels on 8000 rows long dataset containing tweets. The misclassifications were skimmed through and corrected in some crucial classes in order to deliver the best data.

I feel glad to be a part of this incredible community of change-makers. Making some great connections through this amazing journey is like an icing on the cake.

So excited to collaborate in the upcoming mind-blowing projects, making the world a better place using AI for good!

 

 

 

More About Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up collaboration.

 

Stay in touch via our newsletter.

Be notified (a few times a month) about top-notch articles, new real-world projects, and events with our community of changemakers.

Sign up here