Construction of disability risk prediction model for the elderly based on machine learning
Study design and participants
The data for this study were from the China Health and Retirement Longitudinal Study (CHARLS). The CHARLS is a nationally representative longitudinal database of people aged 45 years and over in China24. The study began a national survey in 2011, covering 150 counties and 450 village/neighborhood committees in 28 provinces of China. The researchers followed participants in 2013, 2015, 2018, and 2020 using face-to-face computer-assisted personal interviewing and gathered extensive information about socioeconomic status, health, and anthropometric and laboratory measurements. The study received ethics approval from the Biomedical Ethics Review Committee of Peking University, Beijing (IRB00001052-11015), and all participants provided informed consent. All study methods were carried out based on the Declaration of Helsinki. All methods were performed in accordance with the relevant guidelines and regulations.
This study used 2018 as the baseline and conducted follow-up in 2020. A total of 19,816 participants were included in the baseline survey, excluding participants aged < 65 years (12,133), those with disability at baseline (2,298), and those with missing or abnormal key predictors (2,213); ultimately, 3,172 participants were included in the analysis.
Disability assessment
The internationally accepted Katz scale was used to evaluate activities of daily living (ADL) of older adults25. ADL refers to a person’s ability to independently complete necessary activities in daily life, reflecting the most basic ability of self-care26. The scale included six items, namely, dressing, bathing, eating, getting in and out of bed, using the bathroom, and controlling urination and defecation. There were four options for each item: (1) no difficulty; (2) difficult but still possible to carry out; (3) difficult and needing help; and (4) unable to carry out. With any one item selected among items (2), (3), or (4), the individual is judged to have disability14. Cronbach’s α coefficient of the scale in this study was 0.867.
Data collection and participant variables
According to the relevant literature in China and abroad27,28,29,30, Predictors in this study included sociodemographic factors, chronic diseases, health status indicators and lifestyle factors.
Among sociodemographic factors, age was divided into age groups 65–74 years, 75–84 years, and 85 years or above. Marital status comprised married and other, including separated, divorced, widowed, or never married. Education level was categorized as below primary school, primary school, secondary school, or high school and above. The area of residence was divided into rural and urban areas. Among chronic diseases, hypertension, lung disease, arthritis, heart disease, stroke, and diabetes were assessed as presence or absence of each disease. For health status indexes, sleep time was divided into more than 9 h, 6–9 h, and less than 6–9 h. Self-rated health status was divided into very good, good, average, poor, and very poor. Depression was categorized as depression and no depression. Among lifestyle factors, drinking frequency was divided into non-drinking and drinking, smoking frequency into non-smoking and smoking, and exercise into exercising or not exercising every week.
In this study, the 10-item Center of Epidemiologic Studies Depression Scale (CES-D-10) was used to evaluate depression status among participants. The scale comprises 10 items, with a total score of 30. With scores ≥ 10, the patient was classified as having depression; with scores < 10, the patient was classified as not having depression31.
Derivation and evaluation of prediction models
In this study, the dataset was partitioned using nested cross-validation, in which a super-parametric search is performed by estimating the generalization error of the underlying model to obtain the optimal parameters of the model. Specifically, two loops—the outer loop and inner loop—are included in the nested cross-validation run. The inner layer uses the grid search method to obtain the optimal super-parameters of various models. The outer loop provides 80% of the data as the training set to the inner loop for training while retaining 20% of the data as the test set for testing the inner loop model. In this way, information leakage of data can be prevented to obtain relatively low model scoring deviation.
In this study, we conducted an in-depth investigation of disability in the fields of medicine and nursing, and we pre-screened 18 characteristic variables that are highly related to disability. During our research, we used a least absolute shrinkage and selection operator (Lasso) model to compare the effects, selected 10 additional important characteristic variables, and compared them with the total variable model to verify their functionality. Given the number of feature variables, the correlation, and the size of the training dataset, feature selection was not necessary with sufficient computational resources. Therefore, we did not take feature engineering as the main research focus for the construction of an ML model; we found no serious over-fitting phenomenon in the model.
In this study, nine parameters were used to evaluate the performance of the prediction model, such as accuracy, precision, recall, F1 Score, Hamming loss, Jaccard score, Cohen’s kappa score, Confusion matrix and ROC curve and area under curve (AUC). In addition, decision curve analysis (DCA) reflects the prediction performance of the model under different threshold probabilities. All positive and negative lines represent the extreme case of net gain when all samples are positive and negative, respectively. The flowchart for model derivation and verification is shown in Fig. 1.

Flowchart of model derivation and validation.
ML interpretation
As a visualization tool, SHAP can explain ML models by combining the importance of predictors with predictive effects to obtain the relative risk scores of various predictors and then estimating the contribution of each predicted final prediction result from clinical data to predict the probability of the possible occurrence of various clinical events. Because SHAP can be used to quantify the related factors through calculation to improve the accuracy of the prediction results, the performance of the visual prediction model using SHAP is of great importance in improving the accuracy of clinical diagnosis and treatment32.
Statistical analysis
Continuous variables are presented as the mean ± standard deviation, and categorical variables are presented as frequency and percentage. All the above analyses were conducted using IBM SPSS 27.0. Feature selection, model derivation, and model evaluation were performed with Python 3.7.6. A two-sided p-value of < 0.05 was considered statistically significant. Five distinct machine learning techniques, including logistic regression (LR), deep neural network (DNN), support vector machine (SVM), random forest (RF), and extreme gradient boosting (XGBoost), were chosen to construct models.
link
