For a visual representation of the findings check out our Tableau dashboard.
How to identify and predict social media posts/tweets/videos by users with high risk of mental health issues.
To begin with, we’ve got to evaluate these two questions to conduct our examination:
- What kind of data do we need?
- How can we access it?
We already knew that our problem was concerned about covid mental health impact in Singapore, so it was a great start that helped us minimize our search circle. But what to collect exactly?
As we were predicting impacts based on the tweets, posts, and videos by the individual users on the social platform, so we decided with three important data sources for us:
Our first aim is to assess whether there are significant impacts on people from the start of covid.
Our second aim is to gather meaningful data from these impacts to explore new ideas for how risk prediction ML models can be improved.
We decided to use a mixed-methods approach to collect both quantitative and qualitative data.
- Quantitative data is expressed in numbers and graphs and is analysed through statistical methods.
- Qualitative data is expressed in words and analysed through interpretations and categorizations.
Now for the second question, when a data analyst/scientist wants to gather some data in a short period, the first thing to do is Kaggle! Right?
How great it is when data can be as easily accessed as on Kaggle? Data collection? Just visit Kaggle, find a suitable data set, and download it in less than 5 min.
- Hint: Using a Kaggle dataset is not sufficient.
After a lot of “Googling” we decided to scrape third parties where we found the data we needed.
How we scraped in a nutshell:
- Understand and inspect the web page to find the HTML markers associated with the information we want.
- Use snscrape, twint – Python libraries to scrape twitter tweets.
- Use YouTube API to scrape comments
- Manipulate the scraped data to get it in the form we need. (CSV format)
Note: Data Scraping and preparation was done by the different scraping task teams for each of the data sources. Each team collaborated and provided cleaned data from the above mentioned data sources.
Doing the user level risk prediction about mental impacts based on different features from social platforms as data sources, we came up with below ML methodologies for model building:
Supervised Learning – Classification model
Why do we use supervised learning classification models instead of an unsupervised method?
A supervised learning classification algorithm is used to predict the probability of a target variable. The nature of target or dependent variable in this case is dichotomous, which means there would be only two possible classes, so it will be whether the user is mentally impacted “Yes” or “No”.
Tools & Programming Language Used:
- Anaconda Navigator
- Jupyter Notebook
- Google Colab
- Pycaret – Auto ML
- Logistic Regression
What is Pycaret?
Pycaret is an open source, low-code machine learning library in Python that allows you to go from preparing your data to deploying your model within minutes in your choice of notebook environment.
Why Pycaret and how it helped us?
- Increased Productivity: Pycaret being a low-code library makes you more productive. With less time spent coding, our team can now focus more on business problems.
- Easy to Use: Pycaret is a simple and easy to use machine learning library that will help you to perform end-to-end ML experiments with fewer lines of code.
- Business Ready: Pycaret is a business ready solution. It allows us to do prototyping quickly and efficiently from our choice of notebook environment.
Logistic regression is one of the most common, simple, and more efficient methods for binary and multi label classification problems. [We used this algorithm as well other than the Pycaret approach just for the traditional way and to see the results differences if any].
Solution – Finally, we developed a risk prediction ML model to identify the users who have mental impact during the period and then train the model so that the same can be used in future to predict the users and risk for the mental impacts associated with it.
- Model built to predict the users from different social platforms having mental impact because of the covid situation starting end 2019.
- As per the model, the target variable is users having negative or very negative sentiments and containing all the traits of covid, and mental health keywords are predicted.
- The model was trained and tested on the present data from pandemic which we have from different sources, and we identified the targeted users.
- The aim of the model is that it can be used for the future data and see if the users have impacts or not for the post pandemic duration and this will help the Social organizations and NGOs to identify and predict people which are mentally impacted, so better support and help can be provided.
What is feature importance?
Feature importance is a method for assigning scores to input features to a predictive model that indicates the relative importance of each feature when making a prediction. It helps us eliminate the unimportant features(variables) and improves the overall accuracy and performance of the classification model.
How did we use the feature importance for our model?
Important features were reviewed and helped us in building accurate models. Covid, mental health, relevance and negative were the top features. Later when using pycaret, we evaluated the model with good accuracy on unseen data.
Summary & Inferences from the Model
- Total 80K+ users from Twitter, Reddit & Omdena.
- Results – Out of total 80K + users, Has_Mental_Impact – 4720 users
- Accuracy of a model is generally the proportion of the true results (True Positives + True Negatives) among the total results.
Accuracy = (TP+TN)/(TP+FP+FN+TN)
For our model we received an accuracy of 100%, theoretically this signifies that our model predicts efficiently on the test/unseen data. The reason here for a perfect score is because we don’t have any False positive and False Negative predictions made.
In contradiction to the above theory, an accuracy score of 1(100%) with the limited data raised few questions about the data with which we have trained and if we would have used enough or correct predictors which can give us accuracy not perfect but at least little realistic.
Maybe the data we used for training and testing was good but not enough to bring variety for the predictors. For e.g. what about if we have consider few predictors like age, sex and occupation of the users using social platforms or any other important column that can be explored and can be used to bring more efficiency for feature importance part and then it could have changed the values for True positives and True Negatives that eventually would have an impact on the Model Accuracy.
Issue of Data Leakage
In our model, the target results which we are trying to predict somehow seem to be already there in our train data and then gets included in test/unseen data, this is what in data science language known as Data leakage.
This usually results in unreliable predictions outcomes in the real-world scenario after the model is deployed.
Probable reasons for the data leakage in the model
- Including the target variable as a feature somehow destroys our aim of prediction model as it will be no different than the other set of features.
- Giveaway features – These are the features which expose the information about the target variable and once the model is deployed to the real-world scenario, it will not be available. Some of the cases for give-away feature explained with our model:
- In our model, we are predicting mental impact based on different features. Now let’s say we have a feature that indicates that a user is already showing signs of mental impacts apart from the target variable, then we should never include that as a feature in the training data. If we already know that a user had a clear mental impact based on that particular feature, then we may not require a predictive model here.
- Another example in our model is a training model based on current pandemic data. Once the model is deployed, we have to assess the impact for users – let’s say post pandemic time, for that case including features that expose information about post pandemic time will cause leakage. So, we should only use features which can be available after the model is deployed and used for the future data
How could we avoid the data leakage issue here?
- In our model we used the same set of data to split into training sets and then used some percentage of it as test/unseen data, but it would have been better to include a validation set of data. That will work as a real life scenario to test on your model and validate it finally.
- During EDA features should have been explored for high correlation as it can cause biases in the model.
- After training , data should be checked for very high weights can because for the leakage
- Applying pre-processing steps to explore or clean data like removing outliers, estimating missing values etc. should be done separately on the training data not the entire data set, if we use on entire dataset – data leakage will occur, and we all know that that data will not be fresh unseen data to predict the model.
Deployment and Future Enhancements
- In the future, we can use more data from both pre-pandemic and post-pandemic time and then can run the model to predict the target users.
- We can explore more features (for e.g., text in post or any other columns with more health keywords) for a better model.
- Use more expertise from health professionals to understand the constituting factors for mental health and explore the data more.
- We will be deploying our model to production using the flask application.
- Explored the opportunity for auto ML using Pycaret.
- Explored the model with merged data v/s individual data sources.
- Feature Importance played a crucial role in determining the best predictors to be used for model building.
- To understand the issue of data leakage and lessons learned along how it can be avoided right from the very initial stages of pre-processing and exploratory data analysis.
Source Code: GitHub
https://github.com/OmdenaAI/omdena-singapore-covid-health [Risk Predictor]