AI Meets 96 Years Old NGO: Improving Case Management for Cross-Border Child Protection

AI Meets 96 Years Old NGO: Improving Case Management for Cross-Border Child Protection

How can AI and Natural Language Processing (NLP) help alleviate social workers’ administrative burden in case management?

 

By Shrey Grover and Jianna Park


 

The social service sector is increasingly showing an interest in turning to data-driven practices, which, until now, were predominantly utilized by the commercial counterpart.

Some of the key reforms that social organizations expect from leveraging data include:

  1. Harness the potential of the underlying gold mine of expert knowledge
  2. Relieve the limited staff of repetitive administrative and operational tasks
  3. Address missions on a shorter timeline with enhanced efficiency

Yet, according to IBM’s 2017 study, most of the sector seems to be in the early stages of the data journey, as shown in the visual below. Constrained budget, access to technology, and talent are cited as the major hurdles in utilizing analytical services.

 

ai case management

67% of nonprofits are in the preliminary stages of using data (source)

 

We at Omdena had an unparalleled opportunity to work on one of such nascent stage projects for International Social Service (ISS), a 96-year old NGO that massively contributes to resolving child protection cases.

 

Why This Project?

ISS has a global network of expertise in providing essential social services to children across borders — mainly in the domains of child protection and migration. However, with over 70,000 open cases per year, ISS caseworkers were facing challenges in two aspects: managing time, and managing data. The challenges were often exacerbated by administrative backlogs and a high turnover rate not uncommon in the nonprofit sector.

If we could find a way to significantly reduce the percentage of time lost on repetitive administrative work, we could focus on the more direct, high-impact tasks, helping more children and families access better quality services. ISS saw an urgent need for a technological transformation in the way they managed cases — and this is where Omdena came into the picture.

 

Problem Statement

The overarching goal of this project was to improve the quality of case management and avoid unnecessary delays in service. Our main question remained, how can we help save caseworkers’ time and leverage their data in a meaningful way?

It was time to break down the problem into more specific targets. We identified factors that hinder ISS caseworkers from focusing on the client-facing activities, seen in the following picture.

ai case management

Graphic by Omdena collaborator Bima Putra Pratama 

 

We saw that these subproblems could each be solved with different tools, which would help organizations like ISS understand the various ways that machine learning can be integrated with their existing system.

We also concluded that if we could manage data in a more streamlined manner, we could manage time more efficiently. Therefore, we decided that introducing a sample database system would prove to be beneficial.

 

Initial Challenge

The biggest challenge we faced as a team was the data shortage. ISS had a strict confidentiality agreement with their clients, which meant they couldn’t simply give us raw case files and expose private information.

Initially, ISS gave us five completed case files with names either masked or altered. As manually editing cases would have taken up caseworkers’ client-facing time, our team at Omdena had to find another way to augment data.

 

Solution

Our team collectively tackled the main problem from various angles, as follows:

ai case management

Graphic by Bima Putra Pratama

 

As we had only five data points to work with and were not authorized to access ISS’s data pool, we clarified that our final product would be a proof-of-concept rather than a production-ready system.

Additionally, keeping in mind that our product was going to be used by caseworkers who may not have a technical background, we consolidated the final deliverables into a web application with a simple user interface.

 

Data Collection

Due to limited cases available at the start of the project, the first task in hand was to collect more children related cases from various sources. We majorly concentrated on child abuse and migration cases. We gathered the case files and success stories that were publicly available on ISS partner websites, Malawi’s Child Protection Training Manual, Bolt Burdon Kemp, and Act For Kids. We also collected a catalog of child welfare court cases from the European Court of Human Rights (ECHR), WorldCourt, US Case Law, and LawCite. In the end, we had managed to collate a dataset of about 230 cases and were ready to utilize these in our project pipeline.

 

Data Engineering

We relied on a supervised learning approach for our risk score prediction model. For this, we manually labeled risk scores for each of our cases. A risk score, a float value ranging from 0 to 1, intends to highlight priority cases (i.e. cases which involve a threat to a child’s wellbeing or have tight time constraints) and require the immediate attention of the caseworkers.

The scores were given by taking into consideration various factors such as the presence and a history of abuse within the child’s network, access to education, caretaker’s willingness to care for the child, and so on. To reduce bias, three collaborators provided their risk score input for each case, and the average of the three was considered as the final risk score for that case.

 

ai ngo

Manual risk score assignment process

 

Finally, we demarcated the risk scores into three categories, using the following threshold.

ai case management

 

Additionally, we augmented our data — which originally only contained case text — by adding extra information such as case open and close dates, type of service requested, and country where the service is requested. Using this, we created a seed file to populate our sample database. These parameters would later help caseworkers see how applying a simple database search and filter system can enable dynamic data retrieval.

 

Preprocessing

Next, we moved to data preprocessing which is crucial in any data project pipeline. To generate clean, formatted data, we implemented the following steps:

  • Text Cleaning: Since the case texts were pulled from different sources, we had different sets of noises to remove, including special characters, unnecessary numbers, and section titles.
  • Lowercasing: We converted the text to lower case to avoid multiple copies of the same words.
  • Tokenization: Case text was further converted into tokens of sentences and words to access them individually.
  • Stop word Removal: As stop words did not contribute to certain solutions that we worked on, we considered it wise to remove them.
  • Lemmatization: For certain tasks like keyword and risk factor extraction, it was necessary to reduce the word to its lemmatized form (eg. “crying” to “cry,” “abused” to “abuse”), so that the words with the same root are not addressed multiple times.

 

Feature Extraction

We had to convert the case texts into compact numerical representations of fixed lengths to make them machine-readable. We considered four different types of embedding methods — Term Frequency Inverse Document Frequency (TFIDF), Doc2Vec, Universal Sentence Encoder (USE), and Bidirectional Encoder Representations (BERT).

To choose the one that works best for our case, we embedded all cases using each embedding method. Next, we reduced the embedding vector size to 100 dimensions using Principal Component Analysis (PCA). Then, we used a hierarchical clustering method to group similar cases as clusters. To find the optimal number of clusters for our problem, we referred to the dendrogram plot. We finally evaluated the quality of the clusters using Silhouette scores.

After performing these steps for all four algorithms, we observed the highest Silhouette score for USE embeddings, which was then selected as our embedding model.

 

Models & Algorithms

Text Summarization

Multiple pre-trained extractive summarizers were tried, including BART, XLNet, BERT-SUM, and GPT-2 which were made available thanks to the HuggingFace Transformers library. As evaluation metrics such as ROUGE-N and BLEU required a lot more reference summaries than what we had, we opted for relative performance comparison and checked for the quality and noise level of each model’s outcomes. Then, inference speed played a major role in determining the final model for our use case, which was XLNet.

ai case management

Time each model took to produce a sample summary, in seconds

 

Keyword & Entity Relation Extraction

Keywords were obtained from each case file using RAKE, a keyword extraction algorithm that determines high-importance phrases based on their frequencies in relation to other words in the text.

For entity relations, several techniques using OpenIE and AllenNLP were tried, but they each had their own set of drawbacks, such as producing instances of repetitive information. So we implemented our own custom relation extractor utilizing spaCy, which better-identified subject and object nodes as well as their relationships based on root dependency.

AI case management

Entity relation graph made via Plotly

 

Similarity Clustering

The pairwise similarity was computed between a given case and the rest of the data based on USE embeddings. Among Euclidean distance, Manhattan distance, and cosine similarity, we chose cosine similarity as our distance metric for two reasons.

First, it works well with unnormalized data. Second, it takes into account the orientation (i.e. angle between the embedding vectors) rather than the magnitude of the distance between the vectors. This was favorable for our task as we had cases of various lengths, and needed to avoid missing out on cases with diluted yet similar embeddings.

After getting similarity scores for all cases in our database, we fetched top five cases that had the highest similarity values to the input case.

Risk Score Prediction

A number of regression models were trained using document embeddings as input and manually labeled risk scores as output. Tensorflow’s AutoKeras, Keras, and XGBoost were some of the libraries used. The best performing model — our custom Keras neural network sequential model — was selected based on root mean square error (RMSE).

ai case management

Comparison of risk prediction model accuracies

 

Abuse Type & Risk Factor Extraction

We created more domain-specific tools to generate another source of insight via algorithms to find primary abuse types and risk factors.

For the abuse type extraction, we defined eight abuse-related verb categories such as “beat,” “molest,” and “neglect.” spaCy’s pre-trained English model en_core_web_lg and part-of-speech (POS) tagging were used to extract verbs and transform them into word vectors. Using cosine similarity, we compared each verb against the eight categories to find which abuse types most accurately capture the topic of the case.

Risk factor extraction works in a similar way, in that text was also tokenized and preprocessed using spaCy. This algorithm, however, further extended the previous abuse verbs by including additional risk words, such as “trauma,” “sick,” “war,” and “lack.” Instead of only looking at verbs, we compared each word in the case text (excluding custom stop words) against the risk words. Words that had a similarity score of over 0.65 with any of the risk factors were presented in their original form. This addition aimed to provide more transparency over what words may have affected the risk score.

 

Web Application

To put these models altogether in a way that ISS caseworkers could easily understand and use, a simple user interface was developed using Flask, a lightweight Python web application framework. We also created forms via WTForms and graphs via Plotly, and let Bootstrap handle the overall stylization of the UI.

A Javascript code to implement Google Translate API was incorporated into the HTML templates, enabling the translation of any page within the app into 108 languages.

For the database, we used PostgreSQL, a relational database management system (RDBMS), along with SQLAlchemy, an object-relational mapper (ORM) that allows us to communicate with our database in a programmatic way. Our dataset, excluding the five confidential case files initially provided by ISS, was seeded into the database, which was then hosted on Amazon RDS.

 

ai case management

The seeded database also includes fields like summary, risk score, and other model outcomes

 

Ai case management

Running models on a new case

 

ai case management

Querying the database

 

A public Tableau dashboard to visualize the case files was also added, should caseworkers wish to refer to external resources and gain further insight on case outcomes.

 

ai case management

Dashboard showcasing additional child court cases as an external point of reference (source)

Conclusion

The aim of this project was to assist ISS in offering services in a timely manner, bearing in mind the organizational history of knowledge available. Within eight weeks, we achieved this goal by providing an application prototype that would help caseworkers understand some of the various ways to leverage the power of data.

This tool, upon continued development, will be the first step toward ISS’s AI journey. And with enhanced capabilities, both experienced and less experienced caseworkers will be able to make better-informed decisions.

Limitations

Our models do come with some limitations, mainly stemming from limited data due to privacy reasons and time constraints.

As our dataset only accounted for two types of services (child abuse and migration), and came from a small number of geographical sources, the risk score prediction model may contain biases. Bias could also have been induced by the manual labeling of risk scores, which was done based on our assumptions.

The solutions provided are not meant to replace human involvement. As 100% accuracy of a machine learning model can be difficult to achieve, the tool works best in combination with the judgment of a caseworker.

Moving Forward: AI and case management

To bring this tool closer to a production level, a few improvements can be made.

Incorporating more of the official ISS data would allow fine-tuning of the models, which would yield better results. This can be done without breaching a client confidentiality agreement, by further training the models within the organization’s secure server, or introducing a differential privacy policy that allows sharing only patterns of data.

Furthermore, risk scores can be validated by the ISS caseworkers. They can provide even more risk scores as they use our prediction model, to enable continuous learning.

The database fields can be more granular and include additional attributes as caseworkers see fit. For example, the current field “case_text” can be divided into “background_info,” “outcome,” and so on. This will create more flexibility in document search and flagging missing information.

Finally, once the app is productionized and deployed on a platform like AWS, ISS caseworkers across the globe will have access to these tools, as well as access to the entire network’s pool of resources — truly empowering caseworkers to do the work that matters.

Analyzing Mental Health and Youth Sentiment Through NLP and Social Media

Analyzing Mental Health and Youth Sentiment Through NLP and Social Media

By Mateus Broilo and Andrea Posada Cardenas

 

We are living in an era where life passes so quickly that mental illness has become a pivotal issue, and perhaps a bridge to some other diseases.

As Ferris Bueller once said:

“Life moves pretty fast. If you don’t stop and look around once in awhile, you could miss it.”

This fear of missing out has caused people of all ages to suffer from mental health issues like anxiety, depression, and even suicide ideation. Contemporary psychology tells us that this is expected — simply because we live on an emotional roller coaster every day.

The way our society functions in the modern day can present us with a range of external contributing factors that impact our mental health — often beyond our control. The message here is not that the odds are hopelessly stacked against us, but that our vulnerability to anxiety and depression is not our fault. — Students Against Depression

According to WHO, good mental health is “a state of well-being in which every individual realizes his or her own potential, can cope with the normal stresses of life, can work productively and fruitfully, and is able to make a contribution to her or his community. At the same time, we find it at WordNet Search as “the psychological state of someone who is functioning at a satisfactory level of emotional and behavioral adjustment”. Notice that it is far from being a perfect definition, but it gives us a hint related to which indicator to look for, e.g. “emotional and behavioral adjustment”.

It’s foreseen that this year (2020) around 1 in 4 people will experience mental health problems. Especially, low-income countries have an estimated treatment gap of 85%, contrary to high-income countries. The latter has a treatment gap of 35% to 50%.

Every single day, tons of information is thrown into the wormhole that is the internet. Millions of young people absorb this information and see the world through the glass of online events and others’ opinions. Social media is a playground for all this information and has a deep impact on the way our youth interacts. Whether by contributing to a movement on Twitter or Facebook (#BlackLifeMatters), staying up to date with the latest news and discussions on Reddit (#COVID19), or engaging in campaigns simply for the greater good, the digital world is where the magic happens and makes worldwide interactions possible. The digital eco-not so friendly-system plays a crucial role and represents an excellent opportunity for analysts to understand what today’s youth think about their future tomorrow.

Take a look at the article written by Fondation Botnar related to the young people’s aspiration.

 

The power of sentiment analysis

Sentiment analysis, a.k.a  opinion mining or emotional artificial intelligence (AI), uses text analysis, and NLP to identify affective level patterns presented in data. Therefore, a wise question could be: How do the polarities change?

 

Top Mental Health keywords from Reddit and Twitter

 

 

Violin plots

Considering a data set scraped from Reddit and Twitter from 2016–2020, these “dynamic” polarity distributions could be expressed using violin plots.

 

 

Sentiment Violin-Plot hued by Year

Sentiment Violin Plots by year. Here positive values refer to positive sentiments, whereas negative values indicate negative sentiments. The thicker part means the values in that section of the violin has a higher frequency, and the thinner part implies lower frequency.

 

 

 

On one hand, we see that as the years go by polarity tends to become more and more neutral. On the other hand, it’s difficult to understand which sentiment falls in what category, and what does the model categorizes as positive vs negative sentiments for each year. Also, text sentiment analysis is subjective and does not really spot complex emotions like sarcasm.

 

Violin plots according to label

So now, the next attempt was to see polarities according to labels — anxiety, depression, self-harm, suicide, exasperation, loneliness, and bullying.

 

 

Sentiment Violin-Plot hued by Year

Sentiment Violin Plots by label

 

 

Even if we try to see the polarities by the label, we might end up with surface-level results instead of crisp insights. Look at Self-harm, what’s the meaning of positive self-harm? But it’s still there in the green plot.

We see that most of the polarities are distributed close to the limits of the neutral region, which is ambiguous since it can be viewed as either a lack of positiveness or a lack of accurate sentiment categorization. The question is — how do we gain better insights?

Maybe we try plotting the mean (average) sentiment per year per label.

 

 

Mean Sentiment per Year hued by label

 

 

Notice that Depression was the only label that went through two consecutive decreasing mean sentiment values and passed from positive (2017–2019) to neutral in 2020. Moreover, Loneliness and Bullying classes are depicted only with one mark each, because they appear only in the data scraped from (Jan - Jun)/2020.

 

Depression-label word cloud

Before pressing on, let’s just take a look at the Depression-label word cloud. Here we can detect a lot of “emotions” besides the huge “depression” in green, e.g. “low”, “hopeless”, “financial”, “relationship”.

 

Keywords relating to mental health

Source: Omdena

 

 

These are just the most frequent words associated with posts labeled as Depression and not necessarily translates the feelings behind the scene. However, there is a huge “feel” there… Why? For sure, this is related to one of the most common words, which actually is the 6th more common word in the whole data set. In a more in-depth analysis aiming to find interconnections among topics, certainly “feel” would be used as one of the most prominent edges.

 

 

feel knowledge graph

“Feel” Knowledge Graph

 

 

This Knowledge graph shows all the nodes where “feel” is used as the edge connector. Very insightful but not very visible.

In fact, there’s a much better approach that performs text analysis across lexical categories. So now the question is: “What sort of other feelings related to mental health issues should we be looking for?”.

 

 

 

Empath analysis

The main objective of empath analysis consists of connecting the text within a wide range of sentiments besides just negative, neutral, and positive polarities. Now we’re able to go far beyond trying to detect subjective feelings. For example, look at the second and third lexicon- “sadness” and “suffering”. Empath uses similarity comparisons to map a vocabulary of the text words, (our data set is composed of Reddit and Twitter posts) across Empath’s 200 categories.

 

 
AI Mental Health

Empathy Value VS Lexicon

 

 

 

The Empath value is calculated by counting how many times it appears and is normalized according to the total text emotions spotted and the total number of words analyzed. Now we’re able to go much deeper and truly connect the sentiment presented in the text into some real emotion, rather than just counting the most frequent ones and assuming whether it is related to something good or bad.

 

 

Empathy value vs Year

Emotion trends hued by lexicon

 

 

We choose five lexicons that might be more deeply associated with mental health issues and show in the left plot: “nervousness”, “suffering”, “shame”, “sadness” and “hate”, we tacked these five emotions per year analyzed. And guess what? Sadness skyrocketed in 2020.

 

 

Sentiment analysis in the post-COVID world

The year 2020 turned our lives upside down. From now on we will most likely have to rethink the way we eat, travel, have fun, work, connect,… In short, we will have to rethink our entire lives.

There’s absolutely no question that the COVID-19 pandemic plays an essential role in mental health analysis. To take these impacts into account, since COVID-19 began to spread out worldwide in January, we selected all the data comprising the period of (January — June)/2020 to perform the analysis. Take a look at the Word Cloud related to the COVID-19 analysis from May and June.

 

 

 

COVID 19 Top keyword Analysis of Mental Health

 

 

Covid 19 Top Keyword Analysis of Mental Health

 

 

 

We can see words like help, anxiety, loneliness, health, depression, isolation. In this case, we can consider that it reflects the emotional state of people on social media. As said earlier that the sentiment analysis under polarity tracking isn’t that insightful, but we display the violin graphs below just for comparing.

 

 

Sentiment Violin-plot for COVID 19 Analysis by Months

 

 

Sentiment Violin-plot for COVID 19 Analysis by Label

 

 

Now we see a very different pattern from the previous one, and why is that? Well, now we’re filtering by the COVID-19 keywords and indeed the sentiment distribution now seems to make sense. Looking more closely at the distribution of the data, the following is observed.

 

 

Graph of number of relatable words vs count of words

 

 

In the word count from the sample of texts from 2020, only 2.59% of them contain words related to COVID-19. The words we used are “corona”, “virus”, “COVID”, “covid19”, “COVID-19” and “coronavirus”. Furthermore, the frequency of occurrence decreases as the number of related words found increases, the most common being at most three times in the same text.

Till now, we have presented the distribution of sentiments for specific words related to COVID-19. Nonetheless, questions about how these words relate to the general sentiment during the time period under analysis haven’t been answered yet.

The general sentiment has been deteriorating, i.e. becoming more negative, since the beginning of 2020. In particular, June is the month with the most negative sentiment, which coincides with the month with the most contagious cases of COVID-19 in the period considered, with a total number of 241 million cases. Considering the differences between the words related to COVID-19 and words that are completely unrelated, in the former, more negativity in sentiments is perceived in general.

 

 

Graph between sentiment vs months in 2020

The sentiment by the label is again observed — this time from January 2020 to June 2020 only.

 

 

Violin Plot by label 2020

 

 

Exasperation remains stable, with February being the month that attracts the most attention due to its negativity compared to the rest. Likewise, self-harm is quite stable. The months that call out the attention for their negativity in this category are March and June. Contrary to self-harm, in suicides, March doesn’t represent a negative month. However, the rest of the months between February and June not only present a detriment in the sentiment, which worsens over time, but they are also notably negative. June draws attention to having really positive and really negative sentiments (high polarities), which doesn’t happen in the other months. It has to be verified, but it could be that the number of suicides has been increasing in the last months. Regarding anxiety, a downward trend is also observed in the sentiment between February and May. Finally, one should be careful with loneliness, given the high negativity perception in May and June. Given that there are only data for June 2020 for Bullying, this label isn’t analyzed.

The next figure presents the time series corresponding to the sentiment between 2019/05 and 2020/06. A slight downward trend can be observed. This means that the general sentiment has become more negative. Additionally, there are days that present greater negativity, indicated by the troughs. Most of the troughs in the present year are found in the last months since April.

 

 

Sentiment Analysis from 2019-05

Incidents that moved the youth

 

 

 

There are other major incidents, besides COVID-19, that have influenced the youth to call for help and to speak up in 2020. The recent murder of George Floyd was the turning point and lighted up the #BlackLivesMatter movements. Have a look at the word cloud on the left — with the most frequent and insightful words

The youth gathered to protest against racism and call for equality and freedom worldwide. The Empath values related to Racism and Mental Health are displayed below.

 

AI Mental Health

Normalized empathy analysis

 

 

The COVID-19 pandemic has led the world towards a scenario of a global economic crisis. Massive unemployment, lack of food, lack of medicines. Perhaps the big Q is: “How will the pandemic affect the younger generations and the generations to come? ”. Unfortunately, there’s no answer to this question. Except that the economic crisis that we’re presently living in is definitely going to affect the younger generation because they’re the ones to study, go to college and find a job in the near future. The big picture tells us that unemployment is increasing on a daily basis and there are not enough resources for all of us. The Word Cloud in the opening of the article reflects some of the most frequent words related to the actual economic crisis.

Analyzing Domestic Violence in Lockdowns: From No Data to Building an ML Classifier for Tweets

Analyzing Domestic Violence in Lockdowns: From No Data to Building an ML Classifier for Tweets

By Omdena Collaborator Harshita Chopra

 
Data mining, topic modeling, document annotations, NLP, and stacking machine learning models: A complete journey.

Artificial Intelligence and its possibilities have always fascinated me. Making machines learn through data is nothing short of amazing. When I got to learn about Omdena and it’s a wonderful initiative for bringing AI to social good using the power of global collaboration, I couldn’t stop myself from participating in its empowering challenges.

I felt truly content to be given the role of a Machine Learning Engineer in my first challenge. Connecting with a team of 50 fabulous collaborators from various countries around the world, including domain experts, data scientists, and AI experts — felt like a golden opportunity to gain knowledge in the best possible way.

It provided a space to create value out of my ideas and learn from the enhancements. The harmonizing environment gave me the experience of leading task groups and interacting with some really innovative minds.

In this blog post, I’ll walk you through a major part of the project I led and contributed to for the past month.

 

The Problem

There has been a surge in Domestic Violence (DV) and online harassment cases during COVID-19 lockdowns in India. Homes are no more a safe place for victims trapped in abusive relationships with their family members.

Domestic violence involves a pattern of psychological, physical, sexual, financial, and emotional abuse. Acts of assault, threats, humiliation, and intimidation are also considered acts of violence.

Data substantiating Domestic Violence from government resources are only available in summary form. Incidents are largely reported via calls, and hence make data and subsequent mapping difficult.

The goal of the challenge was to collect and analyze data from different social media platforms or news sources so as to gain insights on the rise in DV incidents during the nation-wide lockdown.

 

The Solution

Diverse social media platforms come up as a huge and largely untapped resource for social data and evidence. It generates a vast amount of data on a daily basis on a variety of topics. Consequently, it represents a key source of information for anyone seeking to study various issues, even the socially stigmatized and society tabooed topics like Domestic Violence.

Victims experiencing abuse are in need of earlier access to specialized services such as health care, crisis support, legal guidance, and so on. Hence the social support groups for a good social cause play a leading role in creating awareness promotion and leveraging various dimensions of social support like emotional, instrumental, and informational support to the victims.

Red Dot Foundation plans to deal with this challenge. When the victims seek help, it is important to identify and analyze those critical posts and acknowledge the help needed with more immediate impact.

Tasks were divided to mine data from different sources: Twitter, Reddit, YouTube, News articles, Government reports, and Google trends. After the acquisition of huge amounts of data, the next step was filtering out relevant posts through topic modeling and keywords. This was followed by annotation of data and then building an NLP based machine learning classifier.

In this blog post, Tweets would be in the spotlight!

 

 

Scraping data with the right queries

Tweets need to be extracted in the pre-lockdown and during the lockdown period so as to judge the surge in domestic violence. Hence, we took a time-frame of January’20 to May’20.

Tweepy (the official tweets scraping API of twitter) extracts tweets only from the past seven days, making it a bothering limitation. Hence, we needed an alternative for mining old tweets with the location.

GetOldTweets3 is a fantastic Python library for this task. Twitter’s advanced search can do wonders for generating your customized query. In order to extract harassment-related posts, here are a few examples of queries we used:

 

 

 

Using ANDcombinations of relationships words with actions and nouns yield good results. The until and since attributes hold the limits of the time frame.

The setNear() feature accepts a location name (like Delhi, Maharashtra, India, etc) or latitude and longitude of that region. The central point of India is approximately around (22,80) degrees. The setWithin() feature accepts the radius around that point, and 1800 km generally covers India and nearby places.

After executing more such queries with different keywords, we had thousands of tweets in handsome relevant topics and some irrelevant.

 

Data needs to be classified  –  Would topic modeling work?

Since a considerable number of tweets in our huge datasets were not related to the kind of harassment we were looking for, we needed some filtering. Classifying tweets into broad topics was the goal. Topic modeling was the first thing that clicked.

Topic modeling is an unsupervised learning process to automatically identify topics present in a collection of documents and to derive hidden patterns exhibited by a text corpus. Thus, assisting better decision making.

Latent Dirichlet Allocation is the most popular technique to do so. It assumes that documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution. Given a dataset of documents, LDA backtracks and tries to figure out what topics would create those documents in the first place.

 

 

Topic 0 words are generally included in awareness posts or #BanTiktok posts due to its inappropriate content.
> Topic 1 words are headlines or real victim stories.

 

Topic modeling works best when the topics are considerably distinct or not too related to each other.

The generated topics didn’t satisfy our target of classifying as relevant or irrelevant. Hence we had to pick up another approach, since our dataset, in general, talks about kinds of harassment.

Rule-based classification turned up to be a more precise approach in this task. I created three sets of keywords to look for — relationships, violence verbs, and not-relevant words. The following algorithm as implemented in Python to filter out some documents.

 

Relationships: [List of keywords like wife, husband, housemaid etc.]
Verbs        : [List of keywords like harass, beat, abuse etc.]
Not-relevant : [List of keywords like webinar, politics, movie etc.]
Iterating through document:
{ 
  R1: (any word in Relationships) AND (any word in Verbs) -> 'Keep'
  R2: (any word in Not-relevant) -> 'Discard'
}

 

The dataset was filtered pretty much based on our customer needs. Now comes up with the task of modeling. But we need annotations for training a supervised model.

 

Deciding Labels and Annotating Tweets

Eminent domain experts helped in coming up with the categories to classify tweets based on their context. Proper labeling guidelines were set up and training sessions helped to label tweets properly, keeping in mind the edge cases. For document classification, a quick and awesome tool called Doccano was used. Several collaborators helped by taking up queues of data points and annotating them. Following were the labels used:

  • DV_OPINION_ADVOCATE
    (advocating against domestic abuse)
  • DV_OPINION_DENIER
    (denying the existence of domestic abuse)
  • DV_OPINION_INFO_NEWS
    (stating factual information or news)
  • DV_STORY
    (describing an incident of domestic abuse)
  • NON_D_VIOLENCE_ABOUT
    (other kinds of harassment)
  • NON_D_VIOLENCE_DIRECTED
    (harassment directed at individual or community)
  • NO_VIOLENCE

 

Domestic Violence Twitter

Analytics derived from NER.

 

 

And data for modeling is ready!

After all the annotations and some fabulous work with collaborators, we’re ready with an incredible training dataset.

Tinkering with Natural Language Processing…

Once pre-processing of texts by lowering cases, removing URLs, punctuation, stopwords, followed by lemmatization was done — we were ready to play around with modeling techniques.

To convert words to vectors (machine learns through numbers), experimenting with TF-IDF Vectorizer gave good results but we had a very limited vocabulary, while the inference data would have a greater variety of words. Therefore, a decision of using pre-trained word embeddings was made.

Our model used FastText English word vectors(wiki-news-300d-1M.vec) and IndicNLP word vectors (indicnlp.v1.hi.vec) for Hindi and Hinglish languages present in the documents.

Since tweets related to DV stories were quite less in number, data augmentation was used on these — by creating new sentences using synonyms of the original words.

nlpgaugis a library for textual augmentation in machine learning experiments. The goal is to improve model performance by generating augmented textual data. It’s also able to generate adversarial examples to prevent adversarial attacks.

 

 

Bringing into play  – Machine Learning Model(s)

A number of models including BERT, SVM, XGBoostClassifier, and many more were evaluated. Since there were really minute differences between some similar classes, we needed to combine two sets of labels.

After combining similar labels:

 

 

Limitations faced — Data under various classes was not easily separable because 3 classes plainly talked about Domestic Violence (story, opinion, news/info) which made it tough for the classifier to spot marginal variation in semantics.

Also, data under DV_STORY had the least number of samples given the fact that it was the most relevant class.

Hence, to deal with an imbalanced dataset, Under Sampling using NeighbourhoodCleaningRule was used from the imbalanced-learn library. The resampled data was fed to Stacked Models.

Stacking is a way of combining predictions from multiple different types of ML models, that introduces the concept of a meta learner.

 

 

Source: GeeksforGeeks

 

Level 0 learners:
– Random Forest Classifier
– Support Vector Classifier
– MLP Classifier
– LGBM Classifier

Level 1 meta-learner:
SVC with hyperparameter tuning and custom class weights.

 

Class Encoding — 0: DV_INCIDENT, 1: DV_OPINION, 2: DV_OPINION_INFO_NEWS, 3: NON_D_VIOLENCE_ABOUT, 4: NO_VIOLENCE

 

This pretty much sums up the modeling. This model was used to predict labels on 8000 rows long dataset containing tweets. The misclassifications were skimmed through and corrected in some crucial classes in order to deliver the best data.

I feel glad to be a part of this incredible community of change-makers. Making some great connections through this amazing journey is like an icing on the cake.

So excited to collaborate in the upcoming mind-blowing projects, making the world a better place using AI for good!

 

 

 

More About Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up collaboration.

 
Matching Land Conflict Events to Government Policies via Machine Learning | World Resources Institute

Matching Land Conflict Events to Government Policies via Machine Learning | World Resources Institute

By Laura Clark Murray, Nikhel Gupta, Joanne Burke, Rishika Rupam, Zaheeda Tshankie

 

Download the PDF version of this whitepaper here.

Project Overview

This project aimed to provide a proof-of-concept machine-learning-based methodology to identify land conflicts events in geography and match those events to relevant government policies. The overall objective is to offer a platform where policymakers can be made aware of land conflicts as they unfold and identify existing policies that are relevant to the resolution of those conflicts.

Several Natural Language Processing (NLP) models were built to identify and categorize land conflict events in news articles and to match those land conflict events to relevant policies. A web-based tool that houses the models allows users to explore land conflict events spatially and through time, as well as explore all land conflict events by category across geography and time.

The geographic scope of the project was limited to India, which has the most environmental (land) conflicts of all countries on Earth.

 

Background

Degraded land is “land that has lost some degree of its productivity due to human-caused process”, according to the World Resources Institute. Land degradation affects 3.2 billion people and costs the global economy about 10 percent of the gross product each year. While dozens of countries have committed to restore 350 million hectares of degraded land, land disputes are a major barrier to effective implementation. Without streamlined access to land use rights, landowners are not able to implement sustainable land-use practices. In India, where 21 million hectares of land have been committed to the restoration, land conflicts affect more than 3 million people each year.

AI and machine learning offer tremendous potential to not only identify land-use conflicts events but also match suitable policies for their resolution.

 

Data Collection

All data used in this project is in the public domain.

News Article Corpus: Contained 65,000 candidate news articles from Indian and international newspapers from the years 2008, 2017, and 2018. The articles were obtained from the Global Database of Events Language and Tone Project (GDELT), “a platform that monitors the world’s news media from nearly every corner of every country in print, broadcast, and web formats, in over 100 languages.” All the text was either originally in English or translated to English by GDELT.

  • Annotated Corpus: Approximately 1,600 news articles from the full News Article Corpus were manually labeled and double-checked as Negative (no conflict news) and Positive (conflict news).
  • Gold Standard Corpus: An additional 200 annotated positive conflict news articles, provided by WRI.
  • Policy Database: Collection of 19 public policy documents related to land conflicts, provided by WRI.

 

Approach

 

Text Preparation

 

In this phase, the articles of the News Article Corpus and policy documents of the Policy Database were prepared for the natural language processing models.

The articles and policy documents were processed using SpaCy, an open-source library for natural language processing, to achieve the following:

  • Tokenization: Segmenting text into words, punctuation marks, and other elements.
  • Part-of-speech (POS) tagging: Assigning word types to tokens, such as “verb” or “noun”
  • Dependency parsing: Assigning syntactic dependency labels to describe the relations between individual tokens, such as “subject” or “object”
  • Lemmatization: Assigning the base forms of words, regardless of tense or plurality
  • Sentence Boundary Detection (SBD): Finding and segmenting individual sentences.
  • Named Entity Recognition (NER): Labelling named “real-world” objects, like persons, companies, or locations.

 

Coreference resolution was applied to the processed text data using Neuralcoref, which is based on an underlying neural net scoring model. With coreference resolution, all common expressions that refer to the same entity were located within the text. All pronominal words in the text, such as her, she, he, his, them, their, and us, were replaced with the nouns to which they referred.

 

For example, consider this sample text:

“Farmers were caught in a flood. They were tending to their field when a dam burst and swept them away.”

Neuralcoref recognizes “Farmers”, “they”, “their” and “them” as referring to the same entity. The processed sentence becomes:

Farmers were caught in a flood. Farmers were tending to their field when a dam burst and swept farmers away.”

 

Coreference resolution of sample sentences

 

 

Document Classification

 

The objective of this phase was to build a model to categorize the articles in the News Article Corpus as either “Negative”, meaning they were not about conflict events, or “Positive”, meaning they were about conflict events.

After preparation of the articles in the News Article Corpus, as described in the previous section, the texts were then prepared for classification.

First, an Annotated Corpus was formed to train the classification model. A 1,600 article subset of the News Article Corpus was manually labeled as “Negative” or “Positive”.

To prepare the articles in both the News Article Corpus and Annotated Corpus for classification, the previously pre-processed text data of the articles was represented as vectors using the Bag of Words approach. With this approach, the text is represented as a collection, or “bag”, of the words it contains along with the frequency with which each word appears. The order of words is ignored.

For example, consider a text article comprised of these two sentences:

Sentence 1: “Zahra is sick with a fever.”

Sentence 2: “Arun is happy he is not sick with a fever.”

This text contains a total of ten words: “Zahra”, “is”, “sick”, “happy”, “with”, “a”, “fever”, “not”, “Arun”, “he”. Each sentence in the text is represented as a vector, where each index in the vector indicates the frequency that one particular word appears in that sentence, as illustrated below.

 

 

 

With this technique, each sentence is represented by a vector, as follows:

“Zahra is sick with a fever.”

[1, 1, 1, 0, 1, 1, 1, 0, 0, 0]

“Arun is happy he is not sick with a fever.”

[0, 2, 1, 1, 1, 1, 1, 1, 1, 1]

With the Annotated Corpus vectorized with this technique, the data was used to train a logistic regression classifier model. The trained model was then used with the vectorized data of the News Article Corpus, to classify each article into Positive and Negative conflict categories.

The accuracy of the classification model was measured by looking at the percentage of the following:

  • True Positive: Articles correctly classified as relating to land conflicts
  • False Positive: Articles incorrectly classified as relating to land conflicts
  • True Negative: Articles correctly classified as not being related to land conflicts
  • False Negative: Articles incorrectly classified as not being related to land conflicts

 

The “precision” of the model indicates how many of those articles classified to be about the land conflict were actually about land conflict. The “recall” of the model indicates how many of the articles that were actually about the land conflict were categorized correctly. An f1-score was calculated from the precision and recall scores.

The trained logistic regression model successfully classified the news articles with precision, recall, and f1-score of 98% or greater. This indicates that produced a low number of false positives and false negatives.

 

Classification report using a test dataset and logistic regression model

 

 

Categorize by Land Conflicts Events

The objective of this phase was to build a model to identify the set of conflict events referred to in the collection of positive conflict articles and then to classify each positive conflict article accordingly.

A word cloud of the articles in the Gold Standard Corpus gives a sense of the content covered in the articles.

A topic model was built to discover the set of conflict topics that occur in the Positive conflict articles. We chose a semi-supervised approach to topic modeling to maximize the accuracy of the classification process. We chose to use CorEx (Correlation Explanation), a semi-supervised topic model that allows domain knowledge, as specified by relevant keywords acting as “anchors”, to guide the topic analysis.

To align with the Land Conflicts Policies provided by WRI, seven relevant core land conflicts topics were specified. For each topic, correlated keywords were specified as “anchors” for the topic.

 

 

 

The trained topic model provided 3 words for each of the seven topics:

  • Topic #1: land, resettlement, degradation
  • Topic #2: crops, farm, agriculture
  • Topic #3: mining, coal, sand
  • Topic #4: forest, trees, deforestation
  • Topic #5: animal, attacked, tiger
  • Topic #6: drought, climate change, rain
  • Topic #7: water, drinking, dams

The resulting topic model is 93% accurate. This scatter plot uses word representations to provide a visualization of the model’s classification of the Gold Standard Corpus and hand-labeled positive conflict articles.

 

Visualization of the topic classification of the Gold Standard Corpus and Positive Conflict Articles

 

 

Identify the Actors, Actions, Scale, Locations, and Dates

The objective of this phase was to build a model to identify the actors, actions, scale, locations, and dates in each positive conflict article.

Typically, names, places, and famous landmarks are identified through Named Entity Recognition (NER). Recognition of such standard entities is built-in with SpaCy’s NER package, by which our model detected the locations and dates in the positive conflict articles. The specialized content of the news articles required further training with “custom entities” — those particular to this context of land conlficts.

All the positive conflict articles in the Annotated Corpus were manually labeled for “custom entities”:

  • Actors: Such as “Government”, “Farmer”, “Police”, “Rains”, “Lion”
  • Actions: Such as “protest”, “attack”, “killed”
  • Numbers: Number of people affected by a conflict

This example shows how this labeling looks for some text in one article:

 

 

These labeled positive conflict articles were used to train our custom entity recognizer model. That model was then used to find and label the custom entities in the news articles in the News Article Corpus.

 

Match Conflicts to Relevant Policies

The objective of this phase was to build a model to match each processed positive conflict article to any relevant policies.

The Policy Database was composed of 19 policy documents relevant to land conflicts in India, including policies such as the “Land Acquisition Act of 2013”, the “Indian Forest Act of 1927”, and the “Protection of Plant Varieties and Farmers’ Rights Act of 2001”.

 

Excerpt of a 2001 policy document related to agriculture

 

 

A text similarity model was built to compare two text documents and determine how close they are in terms of context or meaning. The model made use of the “Cosine similarity” metric to measure the similarity of two documents irrespective of their size.

Cosine similarity calculates similarity by measuring the cosine of an angle between two vectors. Using the vectorized text of the articles and the policy documents that had been generated in the previous phases as described above, the model generated a collection of matches between articles and policies.

 

Visualization of Conflict Event and Policy Matching

The objective of this phase was to build a web-based tool for the visualization of the conflict event and policy matches.

An application was created using the Plotly Python Open Source Graphing Library. The web-based tool houses the models and allows users to explore land conflict events spatially and through time, as well as explore all land conflict events by category across geography and time.

The map displays land conflict events detected in the News Article Corpus for the selected years and regions of India.

Conflict events are displayed as color-coded dots on a map. The colors correspond to specific conflict categories, such as “Agriculture” and “ Environmental”, and actors, such as “Government”, “Rebels”, and “Civilian”.

In this example, the tool displays geo-located land conflict events across five regions of India in 2017 and 2018.

 

 

 

By selecting a particular category from the right column, only those conflicts related to that category are displayed on the map. Here only the Agriculture-related subset of the events shown in the previous example is displayed.

 

 

News articles from the select years and regions are displayed below the map. When a particular article is selected, the location of the event is shown on the map. The text of the article is displayed along with policies matched to the event by the underlying models, as seen in the example below of a 2018 agriculture-related conflict in the Andhra Pradesh region.

 

 

Here is a closer look at the article and matched policies in the example above.

 

 

 

Next Steps

This overview describes the results of a pilot project to use natural language processing techniques to identify land conflict events described in news articles and match them to relevant government policies. The project demonstrated that NLP techniques can be successfully deployed to meet this objective.

Potential improvements include refinement of the models and further development of the visualization tool. Opportunities to scale the project include building the library of news articles with those published from additional years and sources, adding to the database of policies, and expanding the geographic focus beyond India.

Opportunities to improve and scale the pilot project

 

Improvements
  • Refine models
  • Further development of visualization tool

 

Scale
  • Expand library of articles with content from additional years and sources
  • Expand the database of policies
  • Expand the geographic focus beyond India

 

 

About the Authors

  • Laura Clark Murray is the Chief Partnership & Strategy Officer at Omdena. Contact: laura@omdena.com
  • Nikhel Gupta is a physicist, a Postdoctoral Fellow at the University of Melbourne, and a machine learning engineer with Omdena.
  • Joanne Burke is a data scientist with MUFG and a machine learning engineer with Omdena.
  • Rishika Rupam is a Data and AI Researcher with Tilkal and a machine learning engineer with Omdena.
  • Zaheeda Tshankie is a Junior Data Scientist with Telkom and a machine learning engineer with Omdena.

 

Omdena Project Team

Kulsoom Abdullah, Joanne Burke, Antonia Calvi, Dennis Dondergoor, Tomasz Grzegorzek, Nikhel Gupta, Sai Tanya Kumbharageri, Michael Lerner, Irene Nanduttu, Kali Prasad, Jose Manuel Ramirez R., Rishika Rupam, Saurav Suresh, Shivam Swarnkar, Jyothsna sai Tagirisa, Elizabeth Tischenko, Carlos Arturo Pimentel Trujillo, Zaheeda Tshankie, Gabriela Urquieta

 

Partners

This project was done in collaboration with Kathleen Buckingham and John Brandt, our partners with the World Resources Institute (WRI).

 

 

About Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through global bottom-up collaboration. Omdena is a partner of the United Nations AI for Good Global Summit 2020.

Neural Transfer Learning in NLP for Post-Traumatic-Stress-Disorder Assessment

Neural Transfer Learning in NLP for Post-Traumatic-Stress-Disorder Assessment

The main goal of the project was to research and prototype technology and techniques suitable to create an intelligent chatbot to mitigate/assess PTSD in low resource settings.

 

The Problem Statement

“The challenge is to build a chatbot where a user can answer some questions and the system will guide the person with a number of therapy and advice options.”

We were allocated to the ML modeling team of the challenge. Our initial scope was nailing the problem to the most relevant specific use case. After some iterations and consultations among the team, we decided to tackle among multiple possible avenues (e.g. conversational natural language algorithms, expert system, etc.) the problem with a risk a binary assessment classifier suggestion based on labeled DSM5 criteria. The working hypothesis was that the classifier could be used as a backend of a chatbot in a low resource device that could detect the risk and refer the user to more specialized information or as a screening mechanism (in a refugee camp, in a resource depleted health facility, etc.).

The frontend of the system would be a chatbot ( potentially conversational mixed with open-ended questions) and one of the classifiers would be a risk assessment based on the conversation.

The tool is strictly informational/educational and in no circumstances, the intent is to replace health practitioners.

Our team Psychologist guided the annotation process. After a couple of iterations in the process, we ended up on a streamlined process that allowed us to classify ~50 transcripts (each with the transcripts of conversations).

 

The Baseline

Baseline algorithm implementation by different team members demonstrated that without further data-preprocessing with traditional ML methods accuracy rate was around 75%. Given the fact that we had a serious category imbalance issue, this is definitely not a metric to consider. An article is in the works with the details of the baseline infrastructure and traditional ML techniques applied to text classification problems ( ).

 

The Data

The annotation team ended up having access to 1,700 transcripts of sessions. After careful inspection, the team realized that only around 48 transcripts were for actual PTSD issues.

Training Examples: #48 PTSD transcripts each with an average of 2k+ lines

Example of an excerpt of a transcript available in [3]:

 

 

Target Definition: No-Risk Detected-> 0 or Risk Detected: 1

 

 

From an NLP/ML problem taxonomy perspective, the number of datasets is extremely limited. So this problem would be classified as a few shots of classification problems [4].

Prior art on using these techniques when the data is limited prompted the team to explore the Transfer Learning avenue in NLP with recent encouraging results in a few shots training and data augmentation through back-translation techniques.

The picture below elucidates a pandas data frame resulted in an intense data munging process and target calculations ( based on DSM5 manual recommendations) and the amazing work of our annotation team:

 

 

 

The Solution

 

 

ULMFIT

The ULMFit algorithm was one of the initial techniques to provide effective neural transfer learning with success for the state of the art NLP benchmarks[1]

The algorithm and the paper introduce a myriad of techniques to improve the efficiency of RNNs training. We will delve below in the most fundamental ones.

The pre-assumption on modern transfer learning in NLP problems is that all the inputs of all the text will be transformed in numeric values based on word embeddings[8]. In that way, we ensure semantic representation and at the same time numeric inputs to feed the neural network architecture at hand.

From a context perspective. Traditional ML relies solely on the data that you have for the learning task while Transfer Learning trains on top of weights of neural networks (NLP) pre-trained on a large corpus (examples: Wikipedia, public domain books). Successes for transfer learning in NLP and Computer Vision are widespread in the last decade.

 

Copied from [5]

 

 

Transfer learning is a good candidate when you have few training examples and can leverage existing pre-trained powerful networks.

UMLFit works as shown by the diagram below:

 

Copied from [5]

 

 

  • Pre-trained Language Model (for example with Wikipedia data)
  • Data is fine-tuned with your corpus (not annotated)
  • A classifier layer is added to the end of your network.

A simple narrative for our case is the following: The model learns the general mechanics of the English language with the Wikipedia corpus. We specialize in the model with the available transcripts both annotated and not annotated and in the end, we are able to classify this model by chopping the sequence component final layer with a regular Softmax based classifier.

 

LSTM & AWD Regularization Technique

At the core of UMLFit implementation is a bidirectional LSTM’s and a technique called ASGD WD ( Average Stochastic Gradient Descent Weight Dropped).

LSTM ( Long Short Term Memory) networks are the basic block of state of the art deep learning approaches to solve Transfer Learning in NLP sequence 2 sequence prediction problems. A sequence prediction problem consists of predicting the next word given the previous text:

 

Copied from [6]

 

 

LSTM’s are ideal for language modeling and sequence prediction(increasingly being used in Time Series Forecasting as well ) because they maintain a memory of previous input elements. Each X element in our particular situation would be a token that would generate an output (sequence) and would be sent to the next block so it’s considered during the ht output calculation. Optimal weights will be backpropagated through the network-driven by the appropriate loss function.

One component of this regularization technique (WD) involves introducing dropouts on the weights of the hidden<->hidden states connections, which is quite unique compared with the drop out techniques.

 

Copied from [7]

 

 

Another component of the regularization is the Average Stochastic Gradient Descent, that basically instead of just including the current step it also takes into consideration the previous step and returns an average[7]. More details about the implementation can be found here.

A more detailed ULMFit Diagram can be seen below where the LSTM’s components are described with the different steps of the implementation of the algorithm:

 

Copied from [5]

 

 

General-Domain LM (Language Model) Pretraining

This is the initial phase of the algorithm where a Language model is pre-trained in powerful machines with a public corpus of data-set. The language model problem is very simple: given a phrase the set of probabilities of the next word (probably one of the most oblivious use of Deep Learning in our daily lives):

 

 

We will use for this problem, in particular, the available FastAI implementation of ULMFit to elucidate the process in practical terms:

In order to choose the ULMFit implementation in fastai, you will have to specify the language model AWD_LSTM as mentioned before.

 

language_model_learner(data_lm, AWD_LSTM, drop_mult=0.5)

 

The code above does a lot in the good style of using the libraries FastAI and sklearn is being used to produce a train and validation set and fastai is being used to instantiate and UMLFit language model learner.

 

Target task LM Fine-Tuning

On the code presented on LM model section, we basically instantiate a pre-trained ULMFit language model with the right configuration of the algorithm ( there are other options for language models TransformersXL + QNNs ):

 

from fastai import language_model_learner 
from fastai import TextLMDataBunch
from sklearn.model_selection import train_test_split
# split data into training and validation set
df_trn, df_val = train_test_split(final_dataset, 
stratify = df['label'], test_size = 0.3, random_state = 12)
data_lm = TextLMDataBunch.from_df(train_df = df_trn, 
valid_df = df_val, path = "")
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.5)

 

The (pseudo)/code above basically retrieves our training and validation datasets stratified and creates a language_model_learner based on our own model. The important detail of the language model is that it doesn’t need annotated data ( perfect of our situation with limited annotated data but a bigger corpus of non annotated transcripts). Basically, we are creating a language model for our own specialized domain on top of the huge general Wikipedia kind of language model.

 

language_model_learner.unfreeze()
language_model_learner.fit_one_cycle(1, 1e-2)

 

The code above basically unfreezes the pre-trained language model and executes one cycle of training on top of the new data with the specified learning rate.

Part of the process of ULMFit is applying discriminative learning rates through the different cycles of learning :

 

For a neural language model, the accuracy of around 30% is considered acceptable given the size of the corpus and possibilities [1].

 

After this point we are able to generate text from our very specific language model:

 

Excerpt from text generated from our language model.

 

At this point, we have a reasonable text generator for our specific context. The ultimate value of UMLFit is on the ability to transform a language model in a relatively powerful text classifier.

learn.save('ft')
learn.save_encoder('ft_enc')

The code above saves the model for further reuse.

 

Target Task Classifier:

The last step of the ULMFit algorithm is to replace the last component of the language model with a classifier softmax “head” and train on top of it the specific labeled data on our project. It means the PTSD annotated transcripts.

 

classifier = text_classifier_learner(data_clas, 
AWD_LSTM, drop_mult=0.5).to_fp16()
classifier.load_encoder('ft_enc')
classifier.fit_one_cycle(1, 1e-2)
#Unfreezing a train a bit more
classifier.unfreeze()
classifier.fit_one_cycle(3, slice(1e-4, 1e-2))

 

The same technique of discriminative learning rates was used above for the classifier with much better accuracy rates. Results on the classifier specifically were not the main goal for this article a subsequent article will delve into finetuning UMLFit comparison and addition of classifier specific metrics ranking and use of data augmentation techniques such us back-translation and different re-sampling techniques.

 

Initial Results of the UMLFit based classifier.

 
 
 
 
 
 

More About Omdena

 
 

Stay in touch via our newsletter.

Be notified (a few times a month) about top-notch articles, new real-world projects, and events with our community of changemakers.

Sign up here