In this article, we will be addressing one of the most significant modern society problems, which is work stress. A lot of people have talked and are still talking about this, but you know we like to tackle things differently here. So we will be using Machine Learning models for stress recognition to help us understand how stress actually works physically and mentally. This helps gaining more insights on the problem and how AI can help us to solve it.
Author: Muhammad Ezzat
According to the European Agency for Safety and Health at work 2017 annual report, work stress affects more than half of the workers and is the second most common reason for work-related health issues, and it is also the main reason for the high employee turnover rate (EU-OSHA, 2017). Work stress has become one of the most impactful issues our society is facing. On an individual level, a highly stressed person usually suffers a significant decrease in their quality of life. This indeed causes a terrible impact on a person’s mental & physical state.
Approach and Methods
In this article, we will be addressing the effect of work stress over people’s physical state using ML models to try to understand different patterns & correlations in the SWELL dataset that we will be talking about in detail later. Our approach will be based on simple Machine Learning models rather than complex models & networks, so that we can interpret its inner parameters & get a good insight into the relationship between work stress & physical health. Keep in mind that we will be working with R 4.0.1.
The Institute of Computing and Information at Radboud University conducted experiments on 25 subjects during typical office work via Biosensors. Each subject went through work stressors such as receiving unexpected emails, interruptions & pressure to complete their work on an intense schedule. The experiment went on for 3 hours where Heart rate variability (HRV) measures were recorded for each subject of the 25.
The SWELL Dataset which was collected from the experiment above consists of 36 variables (we will showcase most of them later in the article). With the 36th variable being the state the subject is currently going through at a certain point in time. This state can be one of three:
- No stress; The subject is working on his task without any interruptions or being aware of any time limit for 45 minutes.
- Time Pressure; The subject is working on his task without interruptions but with a known time limit of 30 minutes.
- Interruptions; The subject receives 8 unexpected emails in the middle of the time-limited task. Some of the emails requested the subject to do certain actions, while others were irrelevant.
Let’s start by EDA
Alright, let’s kick off our analysis with a big-picture exploratory data analysis.
library(tidyverse) library(ggthemes) library(MASS) library(reshape2) library(clusterSim) library(rpart) library(rpart.plot) library(mlbench) library(caret) library(corrplot) library(randomForest) set.seed(123) theme_set(theme_classic()) data <- read_csv(‘data/train.csv’)
First, we quickly import the libraries we will be working with for the entire article, set our random seed to make sure our results are reproducible & import the training data.
target_class_distribution <- ggplot(data, aes(x = condition))+ geom_bar()+ labs(x = ‘Conditon’, y = ‘Case Count’, title = ‘Subject condition proportions’) target_class_distribution
Here we are constructing the following bar chart that showcases the target variable distribution. We would like to know whether there is a class imbalance or not to avoid unnecessary data bias. If any imbalance was found we would consider resampling our dataset if the dataset size is relatively small.
As you will observe in the next plot there is a significant difference between `no stress` condition & the other two conditions. However, since the dataset size is 369289 records, we don’t really need to resample as the dataset size is sufficient to avoid any class bias.
Now, one main step is to take a look over our explanatory variables. Luckily all of them are continuous numerical variables, which will make it easy for us to pre-process for feature selection & model fitting.
molten_data <- melt(data) scale_grids <- ggplot(data = molten_data, aes(x = value)) + stat_density() + facet_wrap(~variable, scales = “free”) scale_grids
Here we melted our dataset, which means we collected all of the values in one column & in an additional column we specify the column from which this value belongs in the original dataset. This trick allows us to visualize all of our distributions in one large plot.
Here we can observe that our variables follow different distributions, which means that we are probably going to normalize/standardize our data variables so our model can calculate clear variable importance without being tricked by irregular variances. Also, we will drop the `datasetId` variable.
target_variable <- dplyr::select(data,condition) explanatory_variables <- data %>% dplyr::select(-c(condition,datasetId)) normalized_data<data.Normalization(data.frame(explanatory_variables),type=”n1")
Now after applying “n1” type normalization, which is standardization (calculating z-score). We will plot our variables’ variances. Just to make sure our normalization went the way we wanted.
variances <- sapply(data,var) names <- colnames(data) variance_df <- data.frame(names,variances) variance_plot <- ggplot(variance_df,aes(reorder(names,variances),variances))+geom_col()+coord_flip()+ labs(title = ‘Variables` Variances’, x = ‘Variable Name’, y = ‘Variance’)
That seems right, every variable now has a mean of 0 & variance of 1. The kind of normalization we did is called standardization/z-score. Have a look on its formula here:
Keep in mind that the variable `condition` is our target variable (categorical) so it has no variance. Next, we will fit a quick decision tree & plot its structure to identify key variables. We will do this process regularly so you can observe the change after each phase of data pre-processing.
data <- cbind(normalized_data,target_variable) tree <- rpart(condition ~ ., data = data) tree_plot <- rpart.plot(tree, box.palette=”RdBu”, shadow.col=”gray”, nn=TRUE)
The following plot is the structure of the initial tree model we just fit, keep in mind that this is not the final model. You can observe that `Median_RR` & `Mean_RR` are the first & second roots of the tree respectively. We can observe that RR related variables appear in 7 nodes, while other variables did not appear yet. Thus, a variable importance analysis is needed to identify variables worth keeping & discard the useless ones.
Note: If you don’t understand the context of the variables, don’t worry we’ll provide more information later on in the article.
34 explanatory variables to predict one target don’t seem quite right. We saw that the initial decision tree used only 11 variables to fit our data. We need to discard variables that don’t convey enough relevant information to our problems.
correlation_matrix <- cor(normalized_data) correlation_plot <- corrplot(correlation_matrix,method = ‘square’, type = “full”, order = “hclust”, tl.col = “black”, tl.srt = 90) correlation_plot
We will start by plotting our correlation matrix. This matrix is a 2-dimensional nxn; n = number of our dataset’s features. Each row/column will contain the Pearson correlation coefficient between feature[i] & the rest of the dataset features. This index ranges from -1 to 1; ± 1 is perfect correlation, while 0 is no correlation. Normally we would keep only one of two strongly correlated features, as both of them probably provide the same quality & quantity of information to the dataset.
We can observe that (besides the diagonal) there are some large coefficients of correlation between variables, thus we would like to know which variables we can remove to get a better dataset, as tracing this matrix by the eye seems tedious. We can just do what is next.
correlated_columns_to_remove <- findCorrelation(correlation_matrix,cutoff = 0.8) # 1 18 17 2 30 16 5 10 8 26 29 28 23 24 3 11 13 20 non_correlated_data <- cbind(dplyr::select(normalized_data, -correlated_columns_to_remove),target_variable)
After removing the columns with indexes identified by `findCorrelation`. We can fit our tree model again & inspect its structure.
phase2_tree <- rpart(condition ~ ., data = non_correlated_data) phase2_tree_plot <- rpart.plot(phase2_tree, box.palette=”RdBu”, shadow.col=”gray”, nn=TRUE)
Instead of exploring the tree structure itself, we will explore another kind of plot. The variable importance plots. Variable importance is a decision tree’s means of expressing the impact of each variable over its structure thus its decisions, it is a useful piece of information that reveals a lot about your data.
varimp <- varImp(phase2_tree) varimp$Variable <- row.names(varimp) varimp <- dplyr::select(varimp,c(2,1)) rownames(varimp) <- c() importance_plot <- ggplot(varimp,aes(x = reorder(Variable,Overall), y = Overall))+geom_col()+ labs(title = ‘Variable Importance Plot 1’, x = ‘Variable’, y = ‘Overall Importance’) unimportant_columns <- varimp[varimp$Overall <= 10000,1]
We also determined the unimportant variables; the variables that scored less than 10k on the important metric. After removing “KURT”, “sampen”, “SDRR_RMSSD_REL_RR” “SKEW_REL_RR”, “MEAN_REL_RR” we will view the plot & finally explain what our variables exactly mean in Heart Rate Variability context.
final_data <- dplyr::select(non_correlated_data,-unimportant_columns) final_data$condition <- as.factor(final_data$condition)
Here is an explanation for our final variables.
|RMSSD||Square root of the mean of the sum of the squares of the difference between adjacent RR intervals|
|HR||Heart Rate (beats per minute)|
|HF||F High (0.15Hz – 0.4Hz) frequency band of the HRV power spectrum|
|MEDIAN_REL_RR||RR Mean of all relative RR intervals|
|HF_LF||A ratio of Low Frequency to High Frequency.|
|SDRR_RMSSD||R Standard deviation of all relative RR intervals|
*The time between beats is measured in milliseconds (ms) and is called an “R-Rinterval”
Finally, it is time to train, test & tune our model (if needed). First, we will load the test data & adjust it by dropping columns that we no longer use then normalize the rest of the variables.
test_data <- read_csv(‘data/test.csv’) columns_to_keep <- names(final_data) test_data <- dplyr::select(test_data, columns_to_keep) normalized_test_data <- cbind(data.Normalization (data.frame(test_data[,-12]),type=”n1"),test_data$condition)
Then we will train a RandomForest Classifier, which is a bunch of decision trees put together by majority voting. We are going to evaluate it using different metrics of classification quality. Then we will have a look on it’s variable importance plot. We will train the model with only 1 fifth of the data to save training time.
batch_size <- as.integer(nrow(final_data)/5) rf_classifier <- randomForest(condition ~ ., data=final_data[1:batch_size,], importance=TRUE) rf_predictions <- predict(rf_classifier,normalized_test_data[,-12]) confusion_matrix <- confusionMatrix(as.factor(rf_predictions), as.factor(normalized_test_data$`test_data$condition`))
Our model did beyond great on testing with more than 99% accuracy & amazing precision, recall & specificity. This means that our variable importance plot is going to be accurate & reliable.
Here is the variable importance plot. This will help us identify the best predictors of the mental condition each subject is going through depending on its physical HRV attributes. It will give us insight into what physical descriptors are correlated with the mental state the most. Based on such supreme model performance, the delivered insights are expected to be logical, accurate & worth denoting.
varImpPlot(rf_classifier,lcolor = ‘red’,main = ‘Variable Importance’,pch = 16)
To make it more clear, mean decrease accuracy is the average loss in model accuracy when a certain variable is dropped (higher = more important). While mean decrease Gini is the average decrease in tree node impurity (higher = more important). The more important the variable, the more it impacts the model & gives it more information about the subject condition.
Work stress is a major issue. It affects your physical condition negatively. The best predictor of your frame of mind from our experiment was the number of heartbeats per minute. The real benefit of such a model is to identify the factors that a person’s mental state affects regarding the physical condition. Such insights should raise awareness about one of the biggest social hazards as we believe. We believe it’s the role of stakeholders to deal with it. Employers are encouraged to create better working environments. For all the stressed workers out there, you have your reasons to feel stressed, accept that feeling, take a breath, get your stuff together & go on even if in baby steps, don’t hesitate to ask for help if you need it.