A complete pipeline using NLP to fight misinformation in news articles. In this two-month challenge, a group of 45+ collaborators prepared annotated news datasets, solved related classification problems, and built a browser extension to identify and summarize misinformation in news.
In our globalized, digitalized world, information from a variety of sources can be disseminated at unprecedented speeds to widespread audiences. While this has proven to be largely beneficial, the world has also experienced a sharp rise in the pervasiveness of Fake News. This has become a global phenomenon that undermines not only the integrity of mainstream news media but could also cause societal instability.
Using examples from the 2016 US election, H. Allcott, and M. Gentzkow, in their article Social Media and Fake News in the 2016 Election, suggested that ‘one fake news article was about as persuasive as one TV campaign ad’ and has the potential to impact close political battles. The ubiquity of social media only makes the matter worse!
In ‘The science of fake news’ published in Science, David M. J. Lazer, et al. defines Fake news as “fabricated information that mimics news media content in form but not in organizational process or intent. Fake-news outlets, in turn, lack the news media’s editorial norms and processes for ensuring the accuracy and credibility of information. Fake news overlaps with other information disorders, such as misinformation (false or misleading information) and disinformation (false information that is purposely spread to deceive people).”
In this project, 45+ collaborators from Omdena partner with The Newsroom and take on a specific type of Fake News, Misinformation. The overarching goal of The Newsroom is to identify news articles and claims most likely to contain false or highly biased information, assign a trust score to them, and summarize through a product — NewsScore — a Browser extension. To assist us in this process, The NewsRoom provided us with an unlabeled dataset containing ~240K scraped news articles and suggested several published labeled datasets.
Setting the Goals
Given a news article, our first goal is to assign it a trust score based on the extent and types of misinformation that it propagates. To ensure transparency of the scoring process, we decided to build a set of models, each addressing a specific attribute of news misinformation. Upon discussion, the decided shortlisted attributes are Hate Speech, Clickbait, and Political bias. We further divided this goal into two parts:
- In-house dataset preparation: Using the unlabeled dataset provided by The Newsroom, prepare three labeled datasets, each focusing on a specific attribute.
- Transparency modeling: Prepare classification models for Hate Speech, Clickbait, and Political bias using open-source and In-house datasets.
Our second goal is to build models for claim detection and verification in news. Given our capacity in two months, we decided to only tackle the ‘claims detection’ problem.
Our final goal is to build a minimal viable product (MVP) that will tie everything together. For this, we decided to build a google chrome extension.
The workflow of this project is visualized below:
In-house dataset preparation
One of the primary goals of the project is to prepare in-house datasets from unlabeled news articles provided by The NewsRoom. The resulting datasets will be used to solve diverse misinformation-associated problems, for example, hate speech detection, political bias identification, clickbait detection, claims detection, and verification. The following subsections provide an overview of the in-house dataset generation process and a summary of the resulting datasets.
Dataset labeling process
We planned a generic approach to label datasets and used it to label hate speech, clickbait, and political bias datasets. The first step in the dataset labeling life cycle (Figure 1) is the choice of an appropriate labeling tool. To this end, we explored different tools such as HumanFirst, Labelbox, Labelstudio, etc., but eventually selected HumanFirst for speed and ease of use.
The second step in the labeling life cycle is to prepare a set of guidelines. These guidelines are problem-specific and include the definition of the problem with examples, the exact labels to assign, etc. with the ultimate goal of achieving consistency of labels from different collaborators.
The first two steps in the labeling process are data agnostic, meaning we don’t consult the actual news articles to complete the steps. However, the remaining steps are data-dependent, starting with the NewsRoom unlabeled data and ending with the final labeled data.
The huge amount of unlabeled data at hand makes it impossible to label the entire data. The third step in the life cycle makes it manageable by suggesting a shortlisted ‘Unlabeled dataset’. For each problem, we used a combination of supervised (based on published labeled datasets) and unsupervised techniques to subsample a small, albeit representative dataset to be labeled in 1–2 weeks. All of these datasets are generated at the sentence level.
The fourth step is crowdsourcing (actual labeling) to get independent labels from collaborators. We aimed for 3x labeling of each sentence, however, we were able to get 2x labeling for most of the shortlisted datasets.
Even though we spent a considerable amount of time preparing consistent guidelines, we still found many labeling mismatches (conflicts). The final step is to resolve the conflicts; we assigned additional people to specifically label those examples to reach a consensus. This gives us our final in-house dataset(s).
Case Study: Clickbait
With HumanFirst already selected as the labeling tool, we start at step two: generating guidelines. We studied different articles to define ‘clickbait’ and use examples to clearly demonstrate them. Table 1 shows different types of ‘clickbait’ with illustrative examples.
In the next step, we parsed out all the headlines from the NewsRoom articles. We then trained a Universal Sentence Encoder (USE) based model on an independent dataset and used that model to predict clickbait probability scores (0 to 1) for all the NewsRoom headlines. We randomly sampled 10,000 headlines encompassing different ranges of clickbait scores. This gives us a uniform representation of different types of headlines. Finally, we converted those to HumanFirst format, divided all of the sentences and articles into 5 different datasets, and uploaded them to HumanFirst for labeling. This is our shortlisted unlabeled dataset.
The datasets are then independently 2x labeled by different collaborators using HumanFirst. We then exported all the datasets from HumanFirst, resolved the conflicts, and prepared the final dataset. The final ‘in-house’ labeled clickbait dataset contains 9,954 article headlines.
Summary of in-house datasets
In this section, we summarize all three independent datasets we prepared using the dataset labeling lifecycle.
The hate speech dataset is the most imbalanced of (Figure 2, 1% hate vs 99% no hate examples) our labeled datasets. One of the reasons behind this imbalance is that hate speech is very underrepresented in mainstream news articles. Another potential reason is that our approach to shortlist sentences for hate speech is based on the presence of hate words and bi-grams from a previous study which could be limited and outdated.
Clickbait dataset is probably our best in-house dataset in terms of quality and representation. This is partly because clickbait detection is a relatively easier problem. For this dataset, we were able to consistently ensure 2x labeling.
The political bias dataset is the last one we labeled. We spent a good amount of time finding a good candidate unlabeled dataset, however, most of the examples were only labeled by one collaborator. Therefore, the quality of the dataset is worse than the previous two. We were also unable to get good coverage as only half of the 10,000 examples were labeled.
We also prepared an in-house labeled dataset (1000 examples only) for claim detection. This dataset did not follow the labeling lifecycle and due to limited capacity, only one collaborator was assigned to this. Exploration and extension of this in more detail could be future work.
Claims detection modeling
We defined a claim as “A statement about the world that can be verified”. The Claims Detection models function as binary classification tasks, grouping input sentences as Check-Worthy Factual Sentences (CFS) and Non-Factual Sentences (NFS). This labeling convention, as well as the model codes, tested, stem from the open-sourced ClaimSpotter publication and GitHub.
In this study, baseline models BiLSTM and SVM are proposed, along with transformer models BERT, DistilBERT, and RoBERTa. However, for this project, only the BiLSTM model was tested and integrated into the MVP.
Initially, the BiLSTM model achieved an F1 score of approximately 70%. However, upon fine-tuning and tweaks to the model parameters, this baseline model achieved an F1-score of ~74% detection rate for the positive class of Check-Worthy Factual Sentences (CFS).
Transformer models were also tested, based on the ClaimSpotter publication. In this study, BERT, DistilBERT, and RoBERTa were tested. In addition, the authors added adversarial perturbations to each of the transformers, which prevented overfitting and improved model accuracy when tested on unknown data in their study. The models were published open-source to GitHub, and all of them, in the base version and with the added adversarial layers, were tested for the scope of this project. Prioritizing a balance of detection accuracy and model training time, we found that the BERT-based model (without adversarial perturbations) outperformed all other models (F1-score of 0.8338 for CFS).
Transparency modeling includes the preparation of classifiers based on published datasets for hate speech, clickbait, and political bias classification. Collaborators build many independent models for these problems. We benchmarked the models and selected the best one(s) based on the F1-score of the positive class (hate, clickbait, or politically biased). Finally, we evaluated the models on the in-house datasets prepared earlier.
Hate Speech Classification
Hate speech classification is a binary problem at the sentence level where each sentence is labeled as either ‘hate’ or ‘no-hate’. We used two openly available datasets for this classification problem: StormFront (based on a forum) and Crowdflower (based on tweets). Even though we prepared classification models on both datasets, only the StormFront dataset is binary by nature, and therefore we spent the majority of our time modeling StormFront data.
The full StormFront dataset is highly imbalanced, and therefore, we prepared two different datasets from that: one with the full dataset and the second with a subsample of the dataset and is balanced. We built several classification models for these two datasets separately.
On the balanced dataset, a BERT + CNN-based model achieved the best F1-score of 0.812. Another USE-based model was a very close second. Several traditional machine learning algorithms (Naive Bayes, Random Forest, and Support Vector Classifier) provided close performances.
Clickbait is also a binary classification problem, where given an article headline, we try to predict whether an article is clickbait or not. For this problem, we used a dataset from Kaggle, which in reality is a combination of two datasets. We name it the ‘combined’ dataset.
We built several classification models on the combined dataset including xgboost, BERT-based model, and a USE-based model. According to the F1-score comparison, xgboost with a comprehensive set of features performed the best, however, it came at a cost of time and memory inefficiency. Moreover, the gain over a comparable xgboost model with a smaller set of features was minimal (F1-score of 0.905 vs 0.902). Therefore, we think a simpler xgboost model would probably be the practical best solution.
Political bias Classification
Classification of political bias can be done as either a binary problem (biased or not biased) or a three-label problem (left/liberal, center/neutral, and right/conservative). We investigated both options but due to bad performance on 3 label problems, we decided to solve political bias as a binary classification problem.
Additionally, different freely available datasets available for political bias label political bias either at the full article level or individual sentence level. Examples of article-level datasets are DeepBlue and Baly et al. datasets, and an example of a sentence-level dataset is the IBC dataset. Here we exhibit the performances of Baly et al. and IBC datasets.
For article-level classification with Baly et al. dataset, we built tree-based classifiers Random Forest and xgboost, and transformer-based classifiers RoBERTa and LongFormer, where RoBERTa outperformed other models (F1-score of 0.79). For sentence-level classification on the IBC dataset, we tried a Naive Bayes model and a USE-based model. We found the USE-based model to perform the best (F1-score of 0.90). Our study suggests that classification at the article level is considerably more challenging than classifying at the sentence level.
In-house data modeling
Once our in-house labeled data were ready, we evaluated those datasets for the three transparency modeling problems. We first applied some of the models built from freely available datasets off-the-shelf on the in-house datasets but the performance was poor. Therefore, we decided to separately model the in-house datasets.
Figure 7 summarizes the results of modeling on the in-house datasets. For hate speech, we build models separately for the original imbalanced dataset and a subsampled balanced dataset. Using a USE-based approach, we found that balancing improves the F1-score from 0.31 to 0.51, however, compared to the StormFront dataset, the performance is still inferior.
For the clickbait in-house dataset, we used a xgboost model and a USE-based model (similar to the combined dataset modeling), and we found the USE-based model to outperform the xgboost model. However, both of these approaches performed substantially worse as our best F1-score reduced to 0.42 (from ~0.90 in the combined dataset).
As our political bias in-house dataset has three labels (left, center, and right), we first merged the left and center for binary classification and prepared a binary dataset. We used a USE-based model on this dataset, however, we were only able to get an F1-score of 0.17.
Overall, all the in-house datasets performed comparatively poorly, and it indicates a scope to improve the in-house datasets.
Minimal Viable Product: NewsScore
To demonstrate how the delivered models can be used, and considering The Newsroom’s vision on developing a Browser Extension, a basic version of such an extension was developed. The extension is named NewsScore. In Figure 8, we show how the extension works in practice. When a user visits a news article, the extension takes that article as input and prepares a report in the back-end. When the user clicks on the extension, it visualizes the report and provides additional options to interact with the extension.
Figure 9 shows different components of the extension, schematically, and zoom in on the NewsScore report from Figure 8. NewsScore has the following features:
- An initial report about the whole article regards the presence of clickbait, bias, or hate speech, along with more detailed information on each section about why the article received the given score. Currently, in the reliability of information section, only detected CFS are printed, in the future, this CFS will be the input for new features such as claim verification, report of a specific claim, and others. (This module is also available in the form of an option for the user to highlight a sentence from the article and apply any of the available tools above to it. Currently, this feature is only used to include text into the claim detected lists for demonstration purposes.)
- A section where the user can provide feedback for the app regards the provided score for each section to improve the extension over time with more sophisticated approaches like active learning.
- A (disabled) section for related articles that may be populated in the future for showing similar articles, with better scores about the same topic for example.
Some features were handed to The Newsroom team as is with few steps left so they can be directly used by an end-user. In the future, it would be useful to integrate different modeling approaches in the MVP back-end and enhance the front-end with useful data visualizations. Once completed, the chrome extension will provide an article summary that recapitulates the news article by providing an overall news score, transparency scores for Hate speech, clickbait, and political bias, and a score for claim verification (reliable information).
Conclusion and Future Directions
In conclusion, the envisioned goals for this project were successfully achieved, with in-house labeled datasets generated for Political Bias, Hate Speech, and Clickbait. Transparency Models were also trained for the detection of the aforementioned three attributes, as well as for claims detection. Finally, an MVP was produced for front-end model deployment and display of the news article trust score.
Evidently, we, a team of 45+ collaborators, achieved a considerable result in an 8-week time span. Nonetheless, this work could be further extended and furnished Possible areas of future exploration are listed below:
- Prepare higher quality datasets to ensure 3x labeling for all of the in-house datasets. Extensively explore model training and evaluation on the in-house datasets.
- Use domain-specific approaches to model the data. One example is NewsBERT, a recent development, which we did not have time to explore.
- Explore different approaches to generate an aggregated news (trust) score from the results of transparency and claim detection models.
- Implement additional transparency models for other attributes of misinformation. For instance, detection of Machine-Generated Text was initially explored but eventually halted to prioritize the aforementioned classification models given the timeline of this project.
- Design storage strategies so each news article is efficiently processed for the users by reusing previous visits to it.
- Design a common feature representation so the models can reuse these features across each score generation.
- Explore different approaches to model deployment on the MVP, particularly those of Transformer modeling.
- Port the MVP extension code to more scalable and robust technologies like Vue for increased performance.
- Apply each prediction right into the news article in the form of a highlighted text so the user can spot any phenomena occurrence (clickbait, bias, hate speech, and so on).
- Calibrate further the process of sentence segmentation that is transversal to all classification problems and could be optimized for news article sentences, for example, considering the role of social media citations within the text.
- Incorporate active learning practices so the feedback from the user can help modeling algorithms to improve their predictions.
Finally, we acknowledge all the collaborators for their hard work, our labeling partner HumanFirst for assistance in labeling, our client The Newsroom for their close cooperation and feedback, and Omdena for making this project possible!