A complete pipeline to identify and monitor human rights abuses through NLP models. In this two-month challenge, a group of collaborators with Human Rights First prepared annotated social media posts datasets, solved related classification problems, and built a dashboard Streamlit app deployed on cloud AWS to identify, visualize and summarize the human rights abuses happening related to war crime based on the severity score and location.
Figure 1 describes the framework process flow for Omdena – Human Right First project. The steps contain data collection, data labeling, classification modeling, deployment, and dashboard, and they are briefly explained in the following sections.
Data collection: Social media
We collected datasets using API and scraping text data from different social media platforms: Twitter and Reddit. These datasets would ideally be used to tackle many diverse problems related to human rights abuses, for example, war crime detection and war crime ranking based on the severity score on subcategories of war crimes.
In order to make the process of data extraction easier by identifying the keywords for extraction, a set of vocabulary is developed. Figure 2 shows the keywords (adjective, verb, etc.) are scraped from the United Nations website using BeautifulSoup method. To widen the vocabulary, a pre-trained NLP model (glove) is used to find the vocabulary with similar words or synonyms of the keywords and save it into CSV format.
Methods used for data collection from Reddit social media platforms are shown in Figure 3. The first method is using Pushshift API to perform the data extraction. The advantage of this API is it can extract more than 1,000 thread/post text data, however, it requires a greater time of about 1-2 hours to complete. Therefore, another API method is used which is PRAW API. The limitation of this method is it only limits 1000 threads/posts. The advantage is it only takes 2-3 minutes for extraction. The credentials provided by HRF such as app name, id and secret key are used to collect the Reddit data. The data is extracted from 5 different pages called world news, news, war crimes, war, war crime. Next, specific keywords (for example, weapons, civilians) are used to extract data from particular war crime topics. Then the data is ready for data labeling.
Methods used for data collection from Twitter social media platforms are shown in Figure 4. The first method is the Tweepy API to perform the data extraction. The disadvantage of this API is it only allows us to get a limited number of tweets and you can’t exceed your quota which is very low. Therefore, an alternative solution is used that does not interface with the API called Twint. The Twint works more as scraping and the ultimate advantage of it is it doesn’t interact with API, hence there are practically no limitations to the data we obtain. The data is collected by setting up a few main restrictions such as the data range, the language, and the keyword used to extract the data. Then the data is ready for data labeling.
We planned a generic approach to label datasets and used it to label war crime data for Twitter and Reddit. Labeling the datasets as shown in the life cycle (Figure 1) requires an appropriate labeling tool. To this end, we explored different tools such as HumanFirst and Labelbox, etc., but eventually selected Labelbox as it provides to label subcategories.
The second step in the labeling life cycle is to prepare a set of guidelines. These guidelines are problem-specific and include the definition of the problem with examples, the exact labels to assign, etc. with the ultimate goal of achieving consistency of labels from different collaborators. The steps of labeling include multiclass and multilabel labeling as shown in Figure 5.
Here are the labels used for Reddit and Twitter data classification. We studied different sub-labels that are suitable to fall under the war crime category and the label definitions have been verified by Human Right First (host company) before we start to label.
Multiclass war crime
Table 1: Label Definitions – Multiclass
|War crime||The text contains contexts related to war crime information.|
|Non-war crime||The text does not contain contexts related to war crime information.|
|Under war crime||The text does not fall under any of the categories of a sublabel war crime but the context is related to other war crime information|
Multilabel war crime
Table 2: Label Definitions – Multilabel
|Involvement of Children in Armed Conflict||This applies whenever children under the age of 15 are involved in armed conflicts, that is, used as soldiers, armed and/or actively participating in the armed conflict.|
|Property Destruction||Destruction of property in an unjustified way, which includes civil, religious or cultural buildings and properties when said destruction is not justified by military necessity.|
|Intentionally directing attacks against the civilian population||Launching an attack in the knowledge that such attack will cause incidental loss of civilian life, injury to civilians or damage to civilian objects, including medical or religious personnel, as well as any person protected by international law.|
|Murder||Killing civilians or surrendered combatants|
|Mutilation, cruel treatment, and torture||Subjecting people to any type of cruelty, mistreatment, or unjustified harm, including torture, slavery, degrading treatment, mutilation of dead bodies, and collective punishments.|
|Perfidy||Killing or wounding an adversary after promising to act in good faith with the intention of breaking that promise once the unsuspecting enemy is exposed. This category also includes using flags, insignias or emblems to make the enemy believe they can trust them, and using that trust to attack, wound or kill.|
|Pillaging/Looting||Seizing property is not justified by military necessity. This includes stealing from the dead, injured or shipwrecked.|
|Sexual violence||Rape, sexual slavery, forced pregnancy or any other form of sexual violence, including forced sterilization.|
|Taking of hostages||Seizure or detention of a person (the hostage), combined with threatening to kill, to injure or to continue to detain the hostage, in order to compel a third party to do or to abstain from doing any act as an explicit or implicit condition for the release of the hostage.|
|Weaponising Civilians||Using human shields, seizing basic resources such as food, water, or other supplies from the civil population in order to coerce the adversary and/or placing civilians in certain territories to avoid the adversary attacking them.|
|Rights Suspension||Suspending population rights, especially with regards to due process. This includes unlawful deportations, declaring abolished or suspended the rights of those civilians in the adversary party, or sentencing without due judicial guarantees.|
Summary of datasets
In this section, we summarize the datasets. The outcome of the labeling distribution is shown in Figure 6.
The war crime classification is treated as two classification problems:
- Binary classification (war crime, non-war crime)
- The under war crime category is merged with the war crime category to provide a more portion of the distribution to the war crime category.
- Multilabel classification (war crime: Involvement of Children in Armed Conflict, Property Destruction, Intentionally directing attacks against the civilian population, Murder, Mutilation, cruel treatment and torture, Perfidy, Pillaging/Looting, Sexual violence, Taking of hostages, Weaponising Civilians, Rights Suspension)
The original labeled dataset is imbalanced, therefore, we prepared two datasets from that: the original overall dataset (unbalanced) and an undersampled but more balanced dataset. We built several classification models for these two datasets separately as shown in Figure 7.
Based on Figure 8, on the undersampled dataset, for binary models, the bert-large-uncased model provides the best f1 score, followed by the distil-large-uncased model. The voting classifier (ensemble learning) provides a decent f1 score. For multi-label models as shown in table 2, the bert-large-uncased model provides the best f1 score. Based on Figure 9, we found a roberta-base based binary model achieving the best F1-score on the overall dataset (Reddit). Another bert model (bert-large-uncased) was a very close second. Several traditional machine learning algorithms (Naive Bayes, Random Forest, Support Vector Machine) provide a good F1 score, however, these models were trained using an earlier version of the Reddit dataset which contained fewer data and were much unbalanced, therefore the models are overfitting. The passive-aggressive classifier (an online-learning algorithm) and the distil-bert-large-uncased model provide decent F1 scores. Finally, for multilabel, the bert-large-uncased model provides the best F1 score.
Figure 10 shows the binary and multilabel models trained on the Twitter dataset. Figure 11 shows their scores. For binary models, the Multinomial Naïve Bayes model provides the best f1 score, followed by the bert-large-uncased and XGBoost models. The Support Vector Machine model provides a decent f1 score. For multilabel, the bert-large-uncased model provides the best f1 score and the XGBoost model provides a decent f1 score.
Summary of the model classification
The limitations of the models are summarized in Figure 12. Based on the models for Reddit and Twitter, for binary classification, the limitation is the models are over-fitting due to the class imbalance although it gave better accuracy. Furthermore, the war-crime examples seem to share quite similar texts, therefore lack diversity. This is another potential reason for which our models have decent performances, despite the dataset being small. The future scope for binary models is to find more efficient ways to clean the texts. The limitation for multilabel models is the models are over-fitting due to class imbalance. There were less than 100 tweets for some categories. Undersampling dropped the accuracy significantly but removed the over-fitting. The future scope for multilabel modeling is since there are 11 classes, it is very important for the classes to be balanced when doing a classification. Maybe adding some tweets manually for those imbalanced classes. In general, the limitation of modeling is it requires more training data for better accuracy, and more test data is required for a more reliable evaluation. Further fine-tuning the models for optimal performance and hyperparameter tuning of the large models require much more time and effort as each trial takes about 30 minutes. Here are future scopes for building war crime classifiers:
- Deeper model interpretation for better understanding model’s predictions
- Studying and avoiding model biases due to some countries occurring a lot of times in war-crime posts
We take one step further to better understand which words (more precisely, tokens) in a text contribute to a war crime prediction from our model. For this, we use the Python library Captum, an open-source and extensible library for model interpretability built on PyTorch. Figure 13 shows a few results given by Captum on our model. We observe the following fact:
- Some expected words that contribute to war crime prediction: torture, war, massacre, genocide, kill, rape, shoot, weapon, bomb
- Some neutral words like civilian or government, in some texts, contribute to war crime prediction. For example:
- hughhefner nato targeting Libya civilians infra structure clearly specifically forbid
- bush adviser defends massacring civilians
- torture year British government know
The potential reason: such words occur quite often in the positive (war crime) examples in the collected dataset, while there might not be enough negative examples with such words to help the model recognize them as neutral.
- Some country names contribute to war crime prediction, which is a very undesirable situation because this leads to some form of bias. Examples:
- sign petition investigate prosecute Iraq torture scandal
- judge rebuke Irish Nobel laureate call Israel state Brandon sun a ~
- Israeli shelling kill Gazan wound <number> year old girl wb a ~
The reason for this is similar to the case of neutral words above. However, the consequence is much more severe, and we need to avoid such bias to occur.
The above observations suggest that we need to improve the data collection to avoid bias, and one of the options could be including more negative examples which contain certain words (neutral words or country names, etc.). However, we need to find such negative examples among the examples that could be extracted by using the same set of keywords for our overall data collection. Otherwise, the negative examples would come from a different distribution, and this won’t help the model to learn such words used in negative examples that are similar to our positive examples.
Deployment of the models
Deployment is the next crucial step after building the models. It’s at this stage that we bring the model alive to fulfill its purposes for the clients and all other stakeholders. So, we integrate the model into an environment that receives social media messages (Reddit or Twitter posts) and predicts whether it’s a war crime or nonwar crime. In the case of war crime, it also determines its subcategories such as murder, property destruction, intentionally attacking civilians, etc.
We use FastAPI to deploy the NLP models on HRF’s AWS platform. FastAPI is a modern web framework for building APIs which is fast to implement. In addition, it comes with an interactive API interface – simple to navigate and intuitive to operate with. For AWS, we use S3, PostgresQL, and EC2.
The diagram above illustrates the first 3 steps for the deployment phase.
- In Step 1, the NLP models (chosen as the final) were saved separately as a single application (i.e., pickle file). Like wrapping up a product into a nice package box.
- Then in Step 2, we bring the model out of the box so that anyone can interact with it. This is where the FastAPI comes in. Firstly, we tested hosting the interactive API interface on our local machine which could be accessed with this local url http://127.0.0.1:8000/docs Secondly, we tried out various messages on the interface for model prediction. Example:
- The model predicted “demonstration is ongoing in Pakistan for free and fair election” as nonwar crime
- The model predicted “civilians are being attacked and killed in Gaza” as a war crime
- We moved to Step 3 once things were working fine on our local machine. Here, we simply upload all the files from local to the EC2 environment. In addition, we installed all the necessary requirements such as python packages. Now, the API interface could be accessed using the EC2 URL.
The final Dashboard MVP
In conclusion, the envisioned goals for this project were successfully achieved, with labeled datasets generated for war crime data and an MVP with backend models and front-end interfaces deployed to Cloud AWS, displaying the visualization using Streamlit dashboard. This work could be further extended and furnished on possible areas of future exploration, including:
- Collect more data with higher quality datasets for Reddit and Twitter. Explore other open-source data.
- Explore methods to identify the trustworthiness of the data.
Building AI Solutions for Real-World Problems