In this article, Omdena’s team uses Causal Inference, a powerful modeling tool for explanatory analysis, on multivariate observational datasets and Machine Learning, to predict the exact “path” of actions or set of daily actions introduced into one’s life to slow aging down.

**Author:** Shubhangi Ranjan

## Problem Statement

Age-related diseases are **killing 150,000 people per day**. Humanity is a health tech organization, which is now able to monitor people’s rates of aging, but the only way for that information to have an impact is if the people can know what actions they should take to slow their aging down. This is complex because these impactful actions will not only be different for every person, but also for every moment in that person’s life, and for every combination of actions the person takes. A good analogy would be map directions must be tailored very specifically to where the person is at that moment, their current mode of transportation, and their current intended destination. When these elements change, the directions must change too.

The now mainstream way to measure a person’s probability of disease in the near and far future is Biological Age: The basic idea behind biological aging is that aging occurs as you gradually accumulate damage and lose function in various tissues and systems in the body. Biological age can vary quite a bit depending on your lifestyle (diet, exercise, sleep, attitude, stress, etc.). Depending on your genetics and your lifestyle actions, your biological age will be higher or lower than your chronological one (the time since your birthday). People with a younger biological age compared to their chronological age are at a lower risk of suffering age-related diseases and mortality.

So in this challenge, Humanity and the Omdena team compressed high throughput markers such as activity and other lifestyle action data from the user (e.g. diet, weight, socio-economic status) to develop weighted algorithms predictive of the biological age outcome.

## What is Causal Inference?

Causal Inference (aka “how to find cause and effect in observational datasets”)

“Association alone does not imply causation”

Machine learning algorithms use standard statistical analysis, typified by regression, estimation, and hypothesis testing techniques to assess parameters of a distribution from samples drawn from that distribution. With the help of such parameters, one can infer associations among variables, estimate beliefs or probabilities of past and future events (predict the future), as well as update those probabilities in light of new evidence or new measurements.

Correlation and causality can seem deceptively similar. While causation and correlation can exist at the same time, correlation does not imply causation. Causation explicitly applies to cases where action A causes outcome B. On the other hand, **correlation is simply a relationship. ** Action A relates to Action B—but one event doesn’t necessarily cause the other event to happen. So, correlations can lead to wrong assumptions. And, importantly, to know how to change the future, we need to know causation.

In correlations, the notation is P(x|y) i.e. the probability of x given y: for example, the probability of a disease given the person consumes alcohol. However, in causal calculus, a very small but important change is made. **Instead of P(x|y) it’s P(x|do(y)) **i.e. the probability of x **given that y is done:** for example, the probability of a lower biological age given that I start doing high-intensity activity. **The ‘do’ is very important: it represents the intervention, the actual doing of something that will cause the effect. **

The exciting news is that causal inference is a powerful modeling tool for explanatory analysis, which has started to enable current machine learning to make explainable predictions.** But simply, it not only allows us to predict the future but understand how to change it.**

### Conceptual Framework (Structural Causal Graphs)

Directed acyclic graphs are used to represent causal relationships. A DAG displays assumptions about the relationship between variables. The assumptions we make take the form of lines (or edges) going from one node to another. Directed paths are also chains because each is causal on the next.

Here, both X and Treatment variables have a causal impact on biological age. For example, people with lower chronological age or weight might not have a higher biological age. Consequently, their chances of having a disease would then be lower. Also, people who exercise more, do not consume alcohol, have healthy sleeping patterns may (or may not) have lower biological age. This is drawn with the educated assumption that certain lifestyle actions can cause a change in Biological Age. We are trying to capture the relationship between the treatment i.e. lifestyle actions and outcome variables i.e. biological age.

## Simulating randomized controlled trials

Randomized experiments have been considered by many to be the “gold standard” for causal inference because the treatment assignment is random and physically manipulated: one group gets the treatment, one does not. The assumptions here seem straightforward, securable by design, and can be conveniently defended. When there is no control over treatment assignment, like with observational data, we attempt to model it. Modeling here is equivalent to saying “we assume that after adjusting for age, gender, weight, maximum heart rate, alcohol consumption, socioeconomic factors, ethnicity, runners and non-runners are so similar to each other as if they were randomly assigned to running.”

Controlled experiments seem simple, we can act upon a variable directly and see how our other variables change in our causal diagram. In a medical trial, this would be taking groups of people 1 and 2, group 1 taking the placebo, and group 2 taking the actual medicine for the sickness and observing the results. Naturally, in medical trials we want these people to come from the same distribution.

In order to simulate an RCT experiment environment, the first step is to cluster similar users based on constant factors that do not change instantly. Hierarchical clustering allows users to move from one cluster to another as the clusters become more mature.

def get_cluster_groups(df): ''' Function to assign a cluster group to every individual using Agglomerative clustering Input -> df <dataframe> -> dataframe with demographic details of individuals Output -> df <dataframe> -> dataframe with cluster groups column added to the demographic details ''' df_scaled = scale_data(df = df, except_cols = ['Unnamed: 0', 'person_id', 'Minutes Asleep', 'Number of Steps', 'Sleep Efficiency', 'Very Active Minutes', 'composition_score', 'deep_sleep_in_minutes', 'duration_score', 'glasses_of_fluid', 'lightly_active_minutes', 'meal_count', 'moderately_active_minutes', 'mood', 'readiness', 'revitalization_score', 'total_calories', 'total_distance', 'BA', 'fatigue_inv', 'stress_inv', 'restlessness_inv']) plt.figure(figsize=(20, 10)) plt.title('Dendrogram') dendo = shc.dendrogram( shc.linkage(df_scaled[df_scaled.columns.difference(['person_id'])].values, method='ward', metric='euclidean'), truncate_mode='lastp', p=40, show_contracted=True) cluster = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward') groups = cluster.fit_predict( df_scaled[df_scaled.columns.difference( ['Participant ID'])].values)

In randomized experiments, treatment is assigned by the flip of a coin, but in observational studies treatment (eg., a person exercising) may be determined by many factors (e.g., likes exercising). If those factors affect the risk of developing the outcome (e.g., lowering of biological age), then the effects of those factors become entangled with the effect of treatment.

So, the next step to mimic controlled experimentation and to minimize the limitations of observational data would be to minimize selection bias. For example, in a particular cluster, a certain user tends to exercise more to stay healthy while another user tends to meditate more and sleep for 7-8 hours every day. For this purpose, we have used the propensity score matching technique.

PyMatch is a Python library that features matching techniques for observational studies and is a Python implementation of R’s Matching package. PyMatch supports propensity score matching for both discrete and continuous variables, which we used during our project.

More details about PyMatch can be found in the following GitHub repository: https://github.com/benmiroglio/pymatch

### Fitting an Initial Propensity Score Model

In this step, we fitted an initial propensity score model and obtained the following result. We observed that both dummy groups have very similar treatment effect distribution. In our case, this is expected since our BA is randomly simulated (used to keep real user data private), and is not affected by any of the activity/ health variables. We expect (for the real world dataset), there should be a more visible separation between the 2 groups, thus indicating that the X-variables affect the y-variable.

### Matching

If we see a treatment effect, matching attempts to reduce the treatment assignment bias, and mimic randomization, by creating a sample of units that received the treatment that is comparable on all observed covariates to a sample of units that did not receive the treatment. In this case, matching tries to estimate effects on decreased BA group had they had covariates that are different from the decreased BA group to increased BA group. Now, we use the function model.match() to start implementing matching. This is done with replacement, meaning a single majority record (in our case, y=1) can be matched to multiple minority records (y=0). Matcher assigns a unique record_id to each record in the test and control groups so this can be addressed after matching. At the end of this section, the scores are printed out and we can observe that each pair of matches have scores within 0.0001 of each other.

### Assessing Matches

Now, we assess the matches by plotting the histograms and ECDF plots of each X-variables. We can observe that for most variables, matching has made the corresponding distribution more similar across the 2 dummy groups. This indicates that the matching algorithm worked as intended.

## Machine Learning for Recommendation

With the matched dataset (Matched_df), we can now train an ML model and infer causality. We used a logistic regression model and performed hyperparameter tuning on its regularization parameter C.

from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score from sklearn.model_selection import GridSearchCV, StratifiedKFold Log_model = LogisticRegression(max_iter = 1000) grid_values = {'C':[0.000,0.001,0.01, 0.1, 1, 10, 100, 1000], 'penalty': ['l2']} cross_validation = StratifiedKFold(n_splits = 5) grid_log_model = GridSearchCV(Log_model, param_grid = grid_values, scoring = 'accuracy', cv = cross_validation) grid_log_model.fit(X_train, y_train) print(grid_log_model.best_params_) print(grid_log_model.best_score_) importance = Log_model.coef_ Feature_series = pd.Series(importance[0], Cols) order = Feature_series.map(lambda x : x).abs().sort_values(ascending = False) Feature_series[order.index] print("Top 3 Features: ") for i in range(3): print('{}: '.format(i+1), Feature_series[order.index].index[i])

## Conclusion

We have built a system that takes in the user attributes and lifestyle actions that are being monitored on one side (activity rates, sleep, meditation, diet, etc.) and uses the ongoing increases or decreases in the user’s Biological Age measure to decide which actions were most effective and in what combinations and when. The system then also matched across users with similar attributes to use the insights and weightings set for one user to affect the weightings given to actions and the combination of actions to another user. Causal inference has helped us identify which new actions/interventions to introduce to a person’s daily actions since they have a large effect on the person’s rate of aging and biological age. Machine learning and causal inference combined have now allowed the personalization and combinatorial nature of real-life to be modeled.