A BI Tool for Collecting Online Financial Information using NLP (Use case: Amazon)

April 14, 2022

In this article, you will learn an end-to-end web scraping and NLP process to gather financial business intelligence about a given company via sentiment analysis and keyword extraction. We will use Amazon as an example use case to scrape financial news and discover business events and sentiment tone of the organization:

This tool will mine the web via three APIs (NEWS API, FINVIZ, and GoogleNews) to extract and gather financial news about a target company on a specific day.
- Tools used: urllib.request, BeautifulSoup, regex
It will then do sentiment analysis on news headlines to predict the tone (percent of positive, negative, and neutral).
- Tools used: nltk.sentiment.vader, gensim.parsing.preprocessing, nltk.stem
Lastly, it will extract important company events from the news texts that could impact their stock price, e.g. new products to launch, M&A, stock buybacks or splits, increase or decrease in hiring.”
- Tools used: KeywordProcessor from flashtext library
This tool is deployed in Streamlit, and the link is provided below (the Streamlit code will not be discussed as part of this article).
- Streamlit app: https://share.streamlit.io/samfaar/bi-financial-app/main
GitHub page: https://github.com/samfaar/BI-Financial-App

Web Scraping via FINVIZ, NEWS API, and GoogleNews

Google does not like being scraped, mainly because Google Search itself is literally a mighty web scraper. As a result, Google has mechanisms to “limit” scraping its search results. For example, you might write a python code that scrapes Google search results today, but it will break whenever Google changes the CSS classes used on the search engine results pages. If it stops working, you’ll need to view the source of the page, inspect the elements and tags you are trying to parse, and update the CSS identifier accordingly. I was able to scrape news via both methods of APIs (using GoogleNews https://pypi.org/project/GoogleNews/) and sending requests to https://www.google.com/search?q={keyword}, but eventually, they both stopped working after a few days. For this reason, we will perform our news scraping with FINVIZ (http://finviz.com) and NEWS API (https://newsapi.org/docs/client-libraries/python) to ensure consistently successful results, and will also add GoogleNews API with a try and except block to handle possible exceptions.

FINVIZ

Why news from FINVIZ? FINVIZ has a list of trusted websites, and headlines from these sites tend to be more consistent in their jargon than those from independent bloggers. Consistent textual patterns will improve the sentiment analysis scores.

The code below shows how we connect to FINVIZ search URL using Request and extracting the news for a given company ticker symbol (AMZN for Amazon in our example) as a data frame, called “news”.

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
import pandas as pd

# Let's pick a company ticker symbol (AMZN for Amazon)
company_ticker = 'AMZN'
# Add the ticker symbol to the "finviz" search box url
url = ("http://finviz.com/quote.ashx?t=" + company_ticker.lower())
# Most websites block requests that are without a User-Agent header (these simulate a typical browser)

# Send a Request to the url and return an html file
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})

# open and read the request
webpage = urlopen(req).read()

# make a soup using BeautifulSoup from webpage
html = soup(webpage, "html.parser")

# Extract the 'class' = 'fullview-news-outer' from our html code, and create a dataframe from it
news = pd.read_html(str(html), attrs={'class': 'fullview-news-outer'})[0]

# extract the links for each news by finding all the "a" tags and 'class' = 'tab-link-news'
links = []
for a in html.find_all('a', class_="tab-link-news"):
links.append(a['href'])

# Clean up our news dataframe
news.columns = ['Date', 'News_Headline']
news['Article_Link'] = links
news.head()

As you can see above, the “Date” column has two formats: data and time & only time. This is because when the date is the same, FINVIZ only shows the time and does not repeat the data. We will clean this using regular expression (regex).

import re

# extract time as a new column
news['time'] = news['Date'].apply(lambda x: ''.join(re.findall(r'[a-zA-Z]{1,9}-d{1,2}-d{1,2}s(.+)', x)))

# fill empty cells by the times mentioned in the "Date" column
news.loc[news['time'] == '', 'time'] = news['Date']

news

Next, we extract the date from our “Date” column.

import numpy as np

news['date'] = news['Date'].apply(lambda x: ''.join(re.findall(r'([a-zA-Z]{1,9}-d{1,2}-d{1,2})s.+', x)))

# change empty cells to NaN type in the new "date" column
news.loc[news['date'] == '', 'date'] = np.nan

# fillna() by forward filling
news.fillna(method = 'ffill', inplace = True)

news

At last, we combine the two “date” and “time” columns and convert them to datetime type, followed by cleaning our data frame.

# combine "date" & "time" columns and convert to datetime type
news['datetime'] = pd.to_datetime(news['date'] + ' ' + news['time'])

# clean out dataframe
news.drop(['Date', 'time', 'date'], axis = 1, inplace = True)
news.sort_values('datetime', inplace = True)
news.reset_index(drop=True, inplace =True)
news.columns = ['news_headline', 'url', 'datetime']

News

Note that we have data times that are older than the search date (2022-04-01), and we will remove them at the end of our scraping (once we combine all of our scraping results) to include only relevant dates.

NEWS API

The code below extracts news as a data frame, called df_newsapi. We will also do some cleaning on the data frame.

from newsapi.newsapi_client import NewsApiClient

company_ticker = 'AMZN'
search_date = '2022-04-01'

newsapi = NewsApiClient(api_key='3a2d0a55066041dc81e3acfbd665fc6e')
# extract "articles", which will be a dictionary
articles = newsapi.get_everything(q=company_ticker,
                              from_param=search_date,
                              language="en",
                              sort_by="publishedAt",
                              page_size=100)

# we want to get the "articles" key from our "articles" dictionary
df_newsapi = pd.DataFrame(articles['articles'])
df_newsapi.head()

# do some cleaning of the df_newsapi
df_newsapi.drop(['author', 'urlToImage'], axis=1, inplace=True)
df_newsapi.rename({'publishedAt': 'datetime'}, axis=1, inplace = True)
df_newsapi.rename({'title': 'news_headline'}, axis=1, inplace = True)
df_newsapi['source'] =  df_newsapi['source'].map(lambda x: x['name'])
df_newsapi.head()

GoogleNews

The python code for the news extraction via GoolgeNews is given below. We used Config because sometimes newspaper package might not be able to download an article due to the restriction in accessing the article with a specified URL. To bypass that restriction, we set the user_agent variable in order to parse those restricted articles and get authorized. Also, the connection may occasionally time out, as it uses the Python module requests so to prevent that from happening, we have used config.request_timeout.

We are going to limit our news extraction to the first two pages of results (From Google News). We could write a for loop to go through multiple pages of results, but hose repetitive requests to Google are going to be automatically detected and stopped (will return a failed connection error).

from GoogleNews import GoogleNews
from newspaper import Config
import re

company_ticker = 'AMZN'
search_date = '2022-04-02'

# GoogleNews sometime returns an empty dataframe, so we add a try and except Block for handling those exceptions
try:
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()
config.browser_user_agent = user_agent
config.request_timeout = 10
df_google = pd.DataFrame()

# change the format of date string from YYYY-MM-DD to MM/DD/YYYY so that is works with GoogleNews
start_date = re.sub(r'(d{4})-(d{1,2})-(d{1,2})', '2/3/1', search_date)

# Extract News with Google News ---> gives only 10 results per request
googlenews = GoogleNews(start=start_date)
googlenews.search(company_ticker)

# store the results of the first result page
result1 = googlenews.result()
df_google1 = pd.DataFrame(result1)

# store the results of the 2nd result page
googlenews.clear()
googlenews.getpage(2)
result2 = googlenews.result()
df_google2 = pd.DataFrame(result2)

df_google = pd.concat([df_google1, df_google2])

# do some cleaning of the df_google DF
if df_google.shape[0] != 0:
    df_google.drop(['img', 'date'], axis=1, inplace=True)
    df_google.columns = ['news_headline', 'source', 'datetime', 'description', 'url']
display(df_google.head())
except:
pass

Combining All in one Web-Scraping Custom Function

We now write a custom function that scrapes the news with FINVIZ and News API and combines all results into a single dataframe. Our custom function (get_news) has two user string inputs: company ticker symbol, and date for collecting news that needs to be in YYYY-MM-DD format. We will later use user input boxes for these items in our streamlit app.

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
import pandas as pd
import re
import numpy as np
from newsapi.newsapi_client import NewsApiClient
def get_news(company_ticker, search_date):
 ## newsapi
 newsapi = NewsApiClient(api_key='3a2d0a55066041dc81e3acfbd665fc6e')
 articles = newsapi.get_everything(q=company_ticker,      
                                   from_param=search_date,
                                   language="en",
                                   sort_by="publishedAt",
                                   page_size=100)
 df_newsapi = pd.DataFrame(articles['articles'])
 # do some cleaning of the DF
 df_newsapi.drop(['author', 'urlToImage'], axis=1, inplace=True)
 df_newsapi.rename({'publishedAt': 'datetime'}, axis=1, inplace = True)
 df_newsapi.rename({'title': 'news_headline'}, axis=1, inplace = True)
 df_newsapi['source'] =  df_newsapi['source'].map(lambda x: x['name'])  
    
 ## finviz
 url = ("http://finviz.com/quote.ashx?t=" + company_ticker.lower())
 req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
 webpage = urlopen(req).read()
 html = soup(webpage, "html.parser")
 news = pd.read_html(str(html), attrs={'class': 'fullview-news-outer'})[0]

 links = []
 for a in html.find_all('a', class_="tab-link-news"):
     links.append(a['href'])
 # Clean up news dataframe
 news.columns = ['Date', 'News_Headline']
 news['Article_Link'] = links

 # >>> clean "Date" column and create a new "datetime" column
 # extract time
 news['time'] = news['Date'].apply(lambda x: ''.join(re.findall(r'[a-zA-Z]{1,9}-d{1,2}-d{1,2}s(.+)', x)))
 news.loc[news['time'] == '', 'time'] = news['Date']
 #extract date
 news['date'] = news['Date'].apply(lambda x: ''.join(re.findall(r'([a-zA-Z]{1,9}-d{1,2}-d{1,2})s.+', x)))
 news.loc[news['date'] == '', 'date'] = np.nan
 news.fillna(method = 'ffill', inplace = True)
 # convert to datetime type
 news['datetime'] = pd.to_datetime(news['date'] + ' ' + news['time'])
 news.drop(['Date', 'time', 'date'], axis = 1, inplace = True)
 news.sort_values('datetime', inplace = True)
 news.reset_index(drop=True, inplace =True)
 news.columns = ['news_headline', 'url', 'datetime']
 df_finviz = news.copy()## GoogleNews# GoogleNews sometime returns an empty dataframe, so we add a try and except Block for handling those exceptions
try:
 user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
 config = Config()
 config.browser_user_agent = user_agent
 config.request_timeout = 10
 df_google = pd.DataFrame()
    
 # change the format of date string from YYYY-MM-DD to MM/DD/YYYY so that is works with GoogleNews
 start_date = re.sub(r'(d{4})-(d{1,2})-(d{1,2})', '2/3/1', search_date)

 # Extract News with Google News ---> gives only 10 results per request
 googlenews = GoogleNews(start=start_date)
 googlenews.search(company_ticker)
    
 # store the results of the first result page
 result1 = googlenews.result()
 df_google1 = pd.DataFrame(result1)
    
 # store the results of the 2nd result page
 googlenews.clear()
 googlenews.getpage(2)
 result2 = googlenews.result()
 df_google2 = pd.DataFrame(result2)
    
 df_google = pd.concat([df_google1, df_google2])
    
 # do some cleaning of the df_google DF
 if df_google.shape[0] != 0:
     df_google.drop(['img', 'date'], axis=1, inplace=True)
     df_google.columns = ['news_headline', 'source', 'datetime', 'description', 'url']
except:
 pass
 ## Add the 3 DFs together
 df_news = pd.concat([df_newsapi, df_finviz, df_google], ignore_index=True)
 df_news['datetime'] = pd.to_datetime(df_news['datetime'], format = '%Y-%m-%d %H:%M:%S')
 df_news.set_index('datetime', inplace = True)
 # only returning the rows that match our search_date
 df_news = df_news[df_news.index.to_period('D') == search_date]
 df_news.sort_index(inplace = True)
 # Get clean source column from urls using regex
 df_news['source'] = df_news['url'].map(lambda x: ''.join(re.findall(r"https?://(?:www.)?([A-Za-z_0-9.-]+).*", x)))
    
 return df_news

Here is what we get when we run our custom function for “AMZN” on “2022-04-01”:

df_news = get_news('AMZN', '2022-04-01')

df_news.shape

>>> (68, 5)

Sentiment Analysis on News Headlines via NLTK VADER

Sentiment Analysis Methods

There are two main methods for Sentiment Analysis (SA):

1. Rules-based SA (NLTK VADER, TextBlob)

Attaches a positive or negative rating to certain words (ex. horrible has a negative association), pays attention to negation if it exists, and returns values based on these words. This tends to work fine, and has the advantage of being simple and extremely fast, but has some weaknesses:
- As sentences get longer, more neutral words exist, and therefore, the overall score tends to normalize more towards neutral as well (or does it)
- Sarcasm and jargon are often misinterpreted

2. Vector-based SA (Flair)

Each word is represented inside a vector space. Words with vector representations most similar to another word are often used in the same context. This allows us, to, therefore, determine the sentiment of any given vector, and therefore, any given sentence.
Weaknesses:
- Flair tends to be much slower than its rule-based counterparts but comes at the advantage of being a trained NLP model instead of a rule-based model, which, if done well, comes with added performance.
- To put in perspective how much slower, in running 1200 sentences, NLTK took 0.78 seconds, TextBlob took an impressive 0.55 seconds, and Flair took 49 seconds (50–100x longer), which begs whether the added accuracy is truly worth the increased runtime.

The performance of each method depends on the type of text that is analyzed, and it is recommended to test them all before selecting a final SA method. You can also design your own sentiment analysis tool using supervised ML (https://python-bloggers.com/2020/10/how-to-run-sentiment-analysis-in-python-using-vader/).

For the purpose of developing our tool, NLTK VADER was used as it showed the best SA results. The VADER library returns 4 values, such as:

pos: The probability of the sentiment to be positive
neu: The probability of the sentiment to be neutral
neg: The probability of the sentiment to be negative
compound (from -1 to 1): The normalized compound score, which calculates the sum of all lexicon ratings and takes values from -1 to 1. </aside>

Notice that the pos, neu and neg probabilities add up to 1, and here are the meaning of typical threshold values for compound score:

positive: compound score ≥ 0.05
neutral: compound score between -0.05 and 0.05
negative: compound score ≤ -0.05

Obtaining VADER SA Scores on News Headlines

The python code for VADER SA is given below, which extracts the compound SA score of the news headlines in a new column of our dataframe.

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
# download these 3 when run for the first time
nltk.download('vader_lexicon')
nltk.download('movie_reviews')
nltk.download('punkt')

def nltk_vader_score(text):
sentiment_analyzer = SIA()
# we take "compound score" (from -1 to 1): The normalized compound score which calculates the sum of all lexicon ratings
sent_score = sentiment_analyzer.polarity_scores(text)['compound']
return sent_score

df_news['sentiment_score_vader'] = df_news['news_headline'].map(nltk_vader_score)
df_news.head()

EDA of VADER Compound Scores

In this section, we will use ployly library to visualize the distribution of sentiment scores and also the percent of the three sentiment types from all news headlines.

import plotly.express as px
fig = px.histogram(
df_news, x='sentiment_score_vader',
color='source').update_xaxes(categoryorder="total descending")

fig.update_layout(xaxis_title='Sentiment Score (Compound from -1 to 1)',
              yaxis_title='Count',
              font=dict(size=16),
              bargap=0.025,
              width=790,
              height=520,
              legend=dict(orientation="h",
                          yanchor="top",
                          y=1.23,
                          xanchor="center",
                          x=0.48))
fig.show('notebook')

For sentiment type, we define the following custom function that labels the sentiment scores accordingly.

def sentiment_type(text):
analyzer = SIA().polarity_scores(text)
neg = analyzer['neg']
neu = analyzer['neu']
pos = analyzer['pos']
comp = analyzer['compound']
   
if neg > pos:
    return 'negative'
elif pos > neg:
    return 'positive'
elif pos == neg:
    return 'neutral'df_news['sentiment_type'] = df_news['news_headline'].map(sentiment_type)

Now we can plot a pie chart from the newly created column ‘sentiment_type’, which will show the percentage of each sentiment type for Amazon.

fig = px.pie(df_news,
        values=df_news['sentiment_type'].value_counts(normalize=True) * 100,
        names=df_news['sentiment_type'].unique(),
        color=df_news['sentiment_type'].unique(),
        hole=0.35,
        color_discrete_map={
            'neutral': 'silver',
            'positive': 'mediumspringgreen',
            'negative': 'orangered'
        })fig.update_traces(textposition='inside', textinfo='percent+label', textfont_size=22, hoverinfo='label+value',
              texttemplate = "%{label}<br>%{value:.0f}%")fig.update_layout(font=dict(size=16),
              width=810,
              height=520)
fig.show('notebook')

News Headlines WordCloud

Lastly, we generate a WordCloud map on our News Headlines to provide a global look at the news scope.

from wordcloud import WordCloud, STOPWORDS

def word_cloud(text):
stopwords = set(STOPWORDS)
allWords = ' '.join([nws for nws in text])
wordCloud = WordCloud(
    background_color='white',  # black
    width=1600,
    height=800,
    stopwords=stopwords,
    min_font_size=20,
    max_font_size=150).generate(allWords)
fig, ax = plt.subplots(figsize=(20, 10),
                      facecolor='w')  # facecolor='k' for black frame
plt.imshow(wordCloud, interpolation='bilinear')
ax.axis("off")
fig.tight_layout(pad=0)
plt.show()

print('Wordcloud for ' + company_ticker)
word_cloud(df_news['news_headline_tokens'].values)

As can be seen on the WordCloud map, the topic of union workers and their voting to unionize for Amazon is a main news topic that our tool has correctly picked up.

Company Events mentioned in the News

To extract certain company events from the news, we first need to get the text of each news article using requests and BeautifulSoup

Because we want to scrape the news from various websites, the challenge is to get only the content of the news body (and not all the text within a news web link). One way is to use .body as shown below, but we still get some text that are not part of the content of the news body. The advantage of this method is that we get a clean html text that does NOT need any regex post-processing.

soup = BeautifulSoup(html_text, 'lxml')
tag = soup.body

Another method is to look at several news web links individually and see what the html class is for the content of the news body. Since our web scraping is dynamic (we get news from some well-known resources like yahoo finance or wsj, but the news sources can be anything depending on the date and company the user selects), our class list will not be exhaustive. Another downside of this method is that we need to clean the html text using regex post-processing.

soup = BeautifulSoup(html_text, 'lxml')
        body_content = soup.findAll('div',
                                    attrs={
                                        'class': [
                                            'caas-body',
                                            'article-content-body-only',
                                            'article__body', 'body',
                                            'article-content rich-text'
                                        ]
                                    })

We select method 1 explained above, and create a custom function for our text extraction.

def get_article_text(Article_Link):
import requests
from bs4 import BeautifulSoup

# using request package to make a GET request for the website, which means we're getting data from it.
header = {
    "User-Agent":
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
    "X-Requested-With": "XMLHttpRequest"
}

html = requests.get(Article_Link, headers=header).content
soup = BeautifulSoup(html)

# Get the whole body tag
tag = soup.body

# Join each string recursively
text = []
for string in tag.strings:
    # ignore if fewer than 15 words
    if len(string.split()) > 15:
        text.append(string)
return ' '.join(text)

df_news['news_text'] = df_news['url'].map(get_article_text)

# cleaning news_text by transforming anything that is NOT space, letters, or numbers to ''
df_news['news_text'] = df_news['news_text'].apply(lambda x: re.sub('[^ a-zA-Z0-9]', '', x))

Now that we have the text of each news, we can proceed to extract important company events from the news, e.g. new products to launch, merger, acquisition, stock-related (buyback, split, …), hiring, or lay off. We can do this in a number of ways, one of the most popular being RegEx. But there is a Python library that could do the job more quickly and is much easier to work with, called FlashText. We therefore define a custom function to use the FlashText library.

def keyword_extractor(text):
from flashtext import KeywordProcessor
kwp = KeywordProcessor()

keyword_dict = {
    'new product': ['new product', 'new products'],
    'M&A': ['merger', 'acquisition'],
    'stock split/buyback': ['buyback', 'split'],
    'workforce change': ['hire', 'hiring', 'firing', 'lay off', 'laid off']
}

kwp.add_keywords_from_dict(keyword_dict)
   
# we use set to get rid of repeating keywords, and ', '.join() to get string instead of SET data type:
return ', '.join(set(kwp.extract_keywords(text)))

We then apply our function to create a new column containing our company event keywords.

df_news['event_keywords'] = df_news['news_text'].map(keyword_extractor)

Now, we can visualize the number of news articles containing company-event keywords for Amazon.

fig = px.histogram(
    df_news[df_news['event_keywords'] != ''],
    x='event_keywords',
    color='sentiment_type',
    color_discrete_map={
                        'neutral': 'silver',
                        'positive': 'mediumspringgreen',
                        'negative': 'orangered'
                    }).update_xaxes(categoryorder="total descending")

fig.update_layout(yaxis_title='Count',
              xaxis_title='',
              width=810,
              height=620,
              font=dict(size=16),
              legend=dict(orientation="h",
                          yanchor="top",
                          y=1.16,
                          xanchor="center",
                          x=0.5))
fig.update_xaxes(tickangle=-45)

Disclaimer: The material in this article is purely educational and should not be taken as professional investment or any other advice. The information presented is just a snapshot.