What to do after scraping 240,000 news articles from the web, labeling text and every sentence of them is a tedious and time-consuming job. This article outlines the steps to label text and prepare several in-house data sets for training machine learning models to detect political bias and hate speech in a news article as well as determining if an article’s title is written as clickbait.
The project partner: The use case in this case study stems from social impact tech startup The Newsroom who hosted an Omdena Challenge as part of Omdena´s AI Incubator for impact startups.
When thinking about how to effectively solve real-world problems through machine learning (ML) and AI, many people tend to first steer toward the deployment of elaborate and complicated ML models and extensive feature engineering coupled with model fine-tuning to achieve the highest possible prediction performance. Consequently, many people spend a lot of time and effort learning about model development and deployment, while at the same time, unfortunately, neglecting a major part of the day-to-day work of an ML engineer:
The preparation of high-quality data sets which can have a tremendous impact on model performance.
“If 80% of our work [in data science] is preparing high-quality data, then I think preparing that data is the core part […] of the work of a machine learning engineer.” ~ Andrew Ng, founder of deeplearning.ai.
The internet is flooded with courses, videos, blog posts, and more about how to implement which kind of model effectively for what kind of problem. However, there is barely any emphasis on how to prepare real-world data sets to make them valuable instruments for model performance improvement.
This article outlines the steps that we took in the Omdena Newsroom project to prepare several in-house data sets for training machine learning models to detect political bias and hate speech in a news article as well as determining if an article’s title is written as clickbait. It describes core points that need to be considered in the preparation of the data set and the approach we took to implement those.
The Omdena Newsroom project set out to build AI solutions to score online news articles with respect to their reliability and trustworthiness.
Phase 1: Determining data requirements
Before starting to work with the data, we spent a considerable amount of time determining appropriate requirements on the data posed by our modeling strategies. These had to be set in stone before any labeling process was able to start, otherwise, any labeling efforts could have easily been deemed useless later on.
1. What is the desired output of our ML models?
First, we determined what kind of output we expected our models to deliver, which in turn was guided by the main objective of the project.
In our case, we were going for binary classifiers for both hate speech detection as well as for clickbait classification. Alternatives could have been multi-class categorization (e.g., a text contains no hate speech vs. contains only offensive language vs. contains hateful language).
Another example for a range of outputs is the detection of political bias in a text, which we categorized into politically biased toward the right vs. no political bias (neutral) vs. politically biased to the left.
2. How is the classification category defined?
Since classifications of text are most often not based on any of the text’s metrics, but rather on the interpretation of the text’s content, clear guidelines on how a certain text category is defined, needed to be set up and agreed on. Obviously, such definitions have to be guided by the main objectives of the project.
In our project, we based our definitions of what hate speech, political bias, and clickbait are on previously established rules and guidelines. Those were available through publications of domain experts, journalistic associations, as well as publications reporting ML studies that tried to solve similar text classification problems as we were.
3. What is the granularity of the model input?
When working with text, many input formats are possible: single words, sentences, paragraphs, or even whole articles, or generated summaries of them. Obviously, a decision here needs to be again guided by the main objectives of the project, but input constraints of certain models need to be considered as well.
For example, transformer models impose a limit on the maximum input character length which in many cases excludes the use of paragraphs or even whole text articles.
In our case, the classification of whole news articles would have been preferable based on our main project goals. However, based on model constraints as described above, we instead decided to use single sentences/headlines as model input for classification.
Phase 2: Data selection
Labeling data manually is tedious and takes a lot of time. To save time and resources, we decided to carefully select the data sets to be labeled to optimally contain sufficient and similar amounts of examples of every classification category.
In our project, the raw data set consisted of around 240.000 news articles that were scraped from the internet. Labeling every sentence of those articles was therefore not feasible. Instead, we took different approaches to select a subset of articles for each classification problem and only labeled those individual subsets.
For hate speech classification, an initial data exploration had revealed that our data set did not contain many sentences with hate speech. To ensure our subset of data would contain enough samples of hate speech to enable proper training of our ML models, the prevalence of ‘hate terms’ as defined by Davidson et al 2017 was determined for each news article, and the top 4000 of those were added to the final data subset. Then, 1000 randomly selected additional news articles from the entire raw data set were appended to balance the total subset of articles. Finally, duplicates and articles in foreign languages were removed, resulting in a total subset of 4989 articles.
To create a balanced subset of news articles with sentences that were either neutral, politically biased to the left, or politically biased to the right, a hierarchical two-step prediction approach was used (see also figure below):
First, the raw data set of news articles was grouped into “political”, “non-political”, and “undetermined topic” articles based on their respective URLs. The “non-political” articles were split into sentences and 1000 of those sentences were already randomly assigned to contribute to the final data set.
Then, three different classifiers (a Universal Sentence Encoder (USE) model, Naive Bayes, and SVM) were trained on open-source data sets consisting of sentences that were labeled as politically “biased” or “unbiased”. These models were used to assign a “bias score” from 0 to 3 (0 meaning “no bias”, and 3 correspondings to “highly biased”) to individual sentences of the “political” articles. From the sentences with a bias score of 0 (“not at all biased”), 2600 were randomly selected and added to the final data set. Furthermore, 400 of the sentences with a score of 1 were randomly selected and also added to the final data set.
All sentences with a bias score of 2 or higher were then further classified with models trained on open-source data sets comprised of sentences labeled as biased to the “right” or “left” by assigning them a score of 0 to 3 (0 meaning “biased to the far left” and 3 meaning “biased to the far-right”). 2500 of the sentences with a score of 0 and a score of 3 were each randomly selected and added to the final data set.
Finally, 500 of the sentences with a score of 1 and 2 were also each selected and added to the final data set. Thus, the final data set was composed of a total of 10.000 sentences that were hopefully balanced with respect to their political bias.
To create a balanced data set for clickbait classification, first, all headlines of the news articles were scraped. Then, a USE-based model was trained on independent open-source data sets and used to assign a clickbait score ranging from 0 to 1 (with 0 “no clickbait” and 1 “clickbait”) to each of the scraped headlines. Then, 4000 headlines with a score > 0.5, 4000 with a score ≤ 0.1, and 2000 headlines with a score ≤ 0.5 and > 0.1 were randomly selected and added together to form a balanced clickbait data set (see figure below).
Phase 3: Data preprocessing and cleaning
In real-world problems, barely any data set is ready to use, especially when it comes to text. Missing values need to be dealt with, special characters have to be cleaned out, “nonsense” sentences need to be discarded, …The list can be endless. A proper and concise exploratory data analysis (EDA) can shed light on the problems that need to be dealt with and guide the process of data preprocessing and cleaning.
In our raw data set, an exploratory analysis had revealed that additionally to a small number of foreign-language articles, all articles contained a multitude of HTML tags as a result of scraping the data from the internet. Furthermore, the articles did not only include the article text but often supplementary information such as contacts of the article authors, promo codes, or copyright information, to name a few examples.
Our sampled news article subsets for hate speech classification were first split into sentences. Only articles that were longer than 2 and shorter than 50 sentences long were included. Furthermore, sentences that had less than 3 words were discarded.
For the political bias data set, the news articles had already been split into sentences, so that this step was obsolete, and the clickbait data set also only consisted of individual headlines, i.e., sentences. Furthermore, headlines that were in another language than English were discarded.
Apart from parsing out HTML tags, no further text processing, such as lower casing, removal of stop words, or even lemmatization or tokenization was performed because otherwise, the sentences would have become impossible to read, interpret, and thus classify.
Phase 4: Data labeling
To ensure the highest possible prediction accuracy by the ML models later, the data that the models are trained with has to be labeled as accurately as possible. Therefore, it is very important that the people who label the data have a clear understanding of the classification categories and how to assign the right category to a sentence, i.e., how to label the sentence correctly.
For that, we set up clear definitions of what “hate speech”, “political bias to the left”, “right”, or “center”, and what “clickbait” exactly meant to us. These definitions were strongly guided by examples from scientific literature and domain expert information that was available on the internet. By providing clear guidelines, we also hoped to suppress the labeler’s own implicit biases to ensure accurate and consistent labeling.
There is a multitude of tools available for labeling text, such as Prodigy, DataTurks, MonkeyLearn, LabelStudio, and many more. They differ greatly in terms of labeling methodology (labeling whole sentences, annotating parts of sentences, etc.), helpful features, and of course pricing. After a thorough exploration by the team of which of the tools would work best for our needs, we decided to go with HumanFirst.
There, the sentences can either be displayed individually (see also figure below) but can also be reviewed in their article context which was very helpful for interpreting the correct meaning of the sentence. Each sentence/headline could be individually selected to assign a label to it.
Additionally to the labels described above, we also included a “NA” label that was assigned to any utterance that had been missed in the cleaning phase, such as sentences/headlines in foreign languages or anything else that was not a proper English sentence. All sentences/headlines with “NA” were later excluded from the data sets.
A helpful feature of HumanFirst was their suggestion tool: As soon as a number of sentences/headlines were classified into a certain category, HumanFirst started suggesting unlabeled sentences that might also fit into this category and allowed the user to label them all at once. Or, vice versa, categories were suggested for unlabeled sentences, as shown in the figure above.
Labeling text manually is tedious, exhausting, and boring, let’s just be frank about it. To make the task more feasible and to prevent inaccurate labeling due to inattentiveness building up over time, the data sets were split into smaller sets each containing 2000-6000 headlines/sentences. Each of the three data subsets was assigned to two different labelers that were going through the entire data set independently.
After both of these people had labeled the entire subset, it was exported from HumanFirst and the labels that were assigned to each sentence/headline by the two labelers were compared. If the labels matched, the respective label was accepted as the final label. If they didn’t match, two other individuals were asked to label the sentence/headline independently. If those two labels matched, the respective label was accepted as the final label. If even those didn’t match, one more person assigned a label which was then, basically as a majority vote, accepted as the final label. How we implemented the models and a complete NLP pipeline of this project is in – Using NLP to Fight Misinformation And Detect Fake News.
In this report, a detailed approach on how to prepare several data sets for training machine learning models to detect political bias, hate speech, and headline click-baiting in news articles was described. As it is crucial for optimal model performance to have adequate, accurate, and balanced pre-labeled training data available, the following points were considered:
- All requirements on the data posed by model restrictions and project goals were elaborated beforehand to prepare the data sets optimally.
- Article/headline/sentence shortlisting was done by training ML models on similar open-source data sets and used to “pre-classify” the data samples. This way, we were hoping to put together balanced subsets that contained similar amounts of samples for each classification category to make sure each classification category was equally represented in the data sets.
- Clear and precise guidelines were set up for how each classification category was defined to make sure the data was adequately classified by the human labelers.
- Each sentence/headline was labeled by multiple people independently and a final label was determined by the majority opinion to eliminate potential individual biases and opinions as much as possible to ensure high label accuracy.
- Data sets were divided into small subsets to make the task more feasible and avoid inaccurate labeling due to exhaustion or monotony.
- Sentences/headlines in a foreign language or text snippets that didn’t make up a proper English sentence were manually filtered out in the labeling process by assigning them to an “NA” category to eliminate noise and produce a cleaner and more accurate final data set.
Want to learn more? Check out the real-world tutorials below: