A machine learning approach to understand the relation between the news media articles and the downfall of stock exchanges — Panic or Information?


By Nikhel Gupta


Covid-19 is undoubtedly a cruel virus and we have seen it ripping families apart around the globe. At the time of writing this article, about 470,000 people are infected by this virus in all continents (except Antarctica). So this variety of coronaviruses should be treated with caution and respect. However, if you compare the current daily statistics of the Covid-19 infections to the population of the world, you will find that the probability that any one of us will catch the virus today is super small.

According to the official WHO data, 60 out of 1 million people have hosted the virus until now. This is a very small number and an outlier if you ask a Statistician. So should we really take all the precautions that Governments are asking us to take like hand washing, sanitizing, keeping social distance, etc.? Certainly yes, as even if the probability of getting sick is small, you are not special, it can infect you and you can transmit it to others. Also, you do not want to cripple the health care system which is already overburdened and you want to stay healthy (for a change). But do we really need to panic? No, right?

I believe that all of you reading this article know that we don’t need to panic but still, we’re seeing empty shelves in our supermarkets. So why are those bulk buyers panicking and how is it propagating?

Let me think, what do I see when I turn on any news channel or read a news report today? Firstly, I see news about coronavirus and …, well, there is no ‘and’, I only see news about the coronavirus!

Much of the news these days has disingenuous reporting with sensational claims and flashing scaremongering headlines clearly to attract your attention and clicks. Several media outlets are capitalizing on our fear of losing our dear ones and ignoring to report all other news, which directly or indirectly propagates panic and hysteria. Such panic is not good for our psychological and economical state and we can already see it’s effect on the crashing stock markets around the world.

Several studies in the past have shown that stock markets are directly affected by the everyday news (e.g. Zhou et al. 2018; Hiransha et al. 2018). In this article, I’ll show the predictions of stock prices using the news articles scrapped for the month of January, February, and March of 2020. The following are some of the tasks that are performed for these predictions:

  1. Pulling all news data for all countries and filter articles related to Covid-19.
  2. Combining news data for January, February, and March and scrape them using the URLs in the data.
  3. Applying co-reference resolution to the text, manually labeling economic and non-economic articles and training a random forest/logistic regression model to classify all articles.
  4. Downloading stock exchange data.
  5. Building a neural network to predict stock prices from news articles.

Task 1

Find the full code in Github.

Next, I searched for news headlines that have words related to the coronavirus. For instance, I used the following keywords

relevant_words = [‘corona’, ‘coronavirus’, ‘wuhan’, ‘hubei’, ‘virus’, ‘quarantine’]

The number of articles per day with these keywords is in-between 88,356 (08–03–2020) to 178823 (04–03–2020) for the month of March. This number was just 12,317 on 01–02–2020.

I know that’s a huge rise in the number for English only articles, right?

Note that this number is true only when the above keywords are mentioned in the headlines. There can be several more keywords (e.g. I missed Covid-19) and some articles may be talking about coronavirus in the text and not in the headline.

Task 2

Find the full code in Github.

Task 3

The modern Natural Language Processing (NLP) techniques like neural networks allow us to do this job easily by training a model with a coreference-annotated dataset and use the trained model to perform coreference resolution for all articles. Even better, there are tools available that are trained on such huge datasets and we can just use them to resolve out text data of news articles. One such tool is Neuralcoref, a pipeline extension for Spacy which annotates and resolves coreferences using a neural network.

Here is the working code for co-referencing.

After labeling some of these articles manually, the classification algorithms like Random Forests Classifiers and logistic regression are used to categorize articles into economic and non-economic articles.

First, the co-referenced text is cleaned using the following clean_text() function:

# stop words
stopw = set(stopwords.words(‘english’))
snow = nltk.stem.SnowballStemmer(‘english’)
# lets remove words like not, very from stop words 
reqd_words = set([‘only’,’very’,”doesn’t”,’few’,’not’])
stopw = stopw — reqd_words# text cleaning
def clean_text(article):
 cleaned_article = []
 cleaned_words_list = text_to_word_sequence(article)
 for word in cleaned_words_list:
 if word not in stopw and len(word) > 2:
 return ‘ ‘.join(cleaned_article)final_df[‘stemmed_articles’] = final_df.text_coref.apply(lambda x: clean_text(x))

The cleaned text is then converted to vectors using TF-IDF bigrams as following

# converting data into vectors using TF-IDF bigram
tfidf = TfidfVectorizer(ngram_range=(1,2), min_df=5, max_features=10000)
tfidf_xtrain_vect = tfidf.fit_transform(train_df.stemmed_articles)tfidf_xtest_vect = tfidf.transform(test_df.stemmed_articles)

And the model is trained using the grid search:

def best_model(x_train, y_train, x_test, y_test):
 pipe = Pipeline([(‘classifier’ , RandomForestClassifier())])
 param_grid = [
 {‘classifier’ : [LogisticRegression()],
 ‘classifier__penalty’ : [‘l1’, ‘l2’],
 ‘classifier__C’ : inverse_lambda,
 ‘classifier__class_weight’ : [None, ‘balanced’],
 ‘classifier__solver’ : [‘liblinear’]},
 {‘classifier’ : [RandomForestClassifier()],
 ‘classifier__n_estimators’ : list(range(10,300,10)),
 ‘classifier__max_features’ : list(range(6,32,5))}
 clf = GridSearchCV(pipe, param_grid = param_grid, cv = 3, verbose=True, n_jobs=-1)
 best_clf = clf.fit(x_train, y_train)
 print(f’best estimator is {clf.best_estimator_}’)best_logreg_model = clf.best_params_[‘classifier’]
 best_logreg_model.fit(x_train, y_train)unigram_predicts = best_logreg_model.predict(x_test)
 cv_cm = pd.crosstab(y_test, unigram_predicts, rownames=[“True Label”], colnames=[“predicted label”])
 print(“confusion matrix on test data is:”)
 print(“ “)
 print(“classification report on test data is”)
 print(classification_report(y_true=y_test, y_pred=unigram_predicts))return best_logreg_model

The full code for this is in the following gist.

The model produces the following results on a test dataset:

confusion matrix on test data is:
predicted label  NEGATIVE  POSITIVE
True Label                         
NEGATIVE              405        12
POSITIVE               16       381

classification report on test data is
              precision    recall  f1-score   support    NEGATIVE       0.96      0.97      0.97       417
    POSITIVE       0.97      0.96      0.96       397    accuracy                           0.97       814
   macro avg       0.97      0.97      0.97       814
weighted avg       0.97      0.97      0.97       814

With this trained model, I find approximately 20% of news articles that report economical news related to the coronavirus.

Task 4

And following is the plot showing normalized closing prices of stocks.


Stock exchange normalized prices downloaded with Alpha Vantage

Task 5

I will discuss the adapted version of the network in more detail in a future post and here I will present the model predictions for some of the stock exchanges.

I) New York Stock Exchange (NYSE) closing prices from 1st January 2020 to 20 March 2020. The green line shows actual prices and blue lines are the prices predicted from the news articles.


NYSE closing prices (green) and predicted prices (blue).


II) Same as above but for the Hong Kong Stock Exchange (HKSE).


HKSE closing prices (green) and predicted prices (blue).

III) For Australian Securities Exchange (ASX)


ASX closing prices (green) and predicted prices (blue)

IV) For Bombay Stock Exchange (BSE)


BSE closing prices (green) and predicted prices (blue)

All these stock predictions from the news article data show a correlation between the news and stock prices. Although the correlations are not too strong on a day by day basis as the stock exchange prices depend on several other factors. The downfall trend of stock predictions from news articles is however similar to the actual trend.

What should we do?

On a personal level, I think we need to calm down and keep working. Follow all the precautions. Think twice and crosscheck before believing any news that spreads panic. There is no need to update ourselves with the number of coronavirus cases every hour and keep talking about it in every discussion. Possibly, we need to stop watching/reading the news about the coronavirus and to update and inform ourselves, we can always look into several official platforms developed by the Governments of each country.

Remember, feelings like fear and panic are contagious, probably much more than the Covid-19.

This work was done in collaboration with the community members of Omdena AI. I thank Hoa Nguyen, Yash Mahesh Bangera, Linda and Sadhika Dua for all important contributions.

In future work, I plan to look into the job crisis and the impact of Covid-19 on the informal sector with another Omdena AI challenge.

You can contact me on LinkedIn and follow my academic research on Orcid.


About Omdena

Building AI through global collaboration

Omdena is a global platform where changemakers build ethical and inclusive AI solutions to real-world problems through collaboration.

Learn more about the power of Collaborative AI.

Stay in touch via our newsletter.

Be notified (a few times a month) about top-notch articles, new real-world projects, and events with our community of changemakers.

Sign up here