PlaylistPro Churn Prediction: Predictive Analysis
A Comprehensive Analysis of Customer Churn Using Multiple Machine Learning Approaches
1 Introduction
This section initiates the predictive analysis phase, extending insights from the earlier descriptive exploration. The task at hand is a binary classification problem, where the response variable churned indicates whether a customer has discontinued their subscription (1) or remained active (0). Accurately predicting churn is essential for PlaylistPro, as it enables the company to identify high-risk users early, design targeted retention strategies, and reduce revenue loss associated with customer attrition.
To achieve this, a mix of parametric and non-parametric supervised learning models were developed and compared. The parametric models (Logistic Regression and Step-wise Logistic Regression) were chosen for their transparency and interpretability, particularly suited for relationships that exhibit linear or near-linear patterns. In contrast, the non-parametric, tree-based models (Random Forest and Extreme Gradient Boosting (XGBoost)) offer flexibility in capturing complex, non-linear interactions among predictors. Together, these models provide both explanatory insigths and predictive strength, allowing PlaylistPro to balance business interpretability with data-driven precision.
2 Data Transformations
Following the initial exploratory data analysis, customer_id was excluded from the predictor set because a unique identifier does not contribute predictive value. All other predictors were retained for model development. Categorical variables, including the response variable (churned), were transformed into dummy variables to facilitate model compatbility.
A data integrity check was conducted across all numeric predictors to ensure the absence of invalid or inconsistent values prior to data splitting and model fitting. One correction was applied to the variable days_since_sign_up, which contained negative values in the raw dataset. These were converted to positive values to reflect accurate account age and maintain logical consistency for subsequent modeling stages.
3 Feature Engineering
All predictors except customer_id were retained for modeling. During descriptive analysis, age appeared to have a non-linear relationship with churn with both younger (20-30 years) and older (60-80 years) users showed higher churn rates, while middle-aged users (30-60 years) tended to remain subscribed. This observation prompted a hypothesis that age follows a U-shaped relationship with churn.
To validate this, customers were grouped into five-year age bins, and the average churn rate per group was plotted. Two separate models were then tested.
1. Linear model: churn ~ age
2. Quadratic model: churn ~ age + age²
The quadratic model improved R² from 0.0024 to 0.028, and the F-test confirmed the significance of the quadratic term. Graphically, the linear model underestimated churn at the extremes, while the quadratic curve captured the curvature more accurately, reflecting that churn increases again amongst the oldest users.
Including both age and age² allowed the model to better explain churn dynamics across demographic segments. This refinement highlights two distinct high-risk groups - younger users who may require better onboarding and engagement, and older users who might need simplified interfaces or personalized support.
The improvement was further supported by the Likelihood Ratio Test, which confirmed that the quadratic model fit the data significantly better than the linear model. Information criteria also favored the quadratic specification, with AIC decreasing from \(172,904\) to \(169,648\) and BIC from \(172,924\) to \(169,677\). These reductions indicate that the added complexity of including the quadratic term is justified by a substantial gain in model fit.
Additional hypothesis testing was conducted for other predictors, including weekly_listening_hours and interaction terms between subscription type and usage metrics. However, none of these variables met the usefulness standard, as their quadratic or interaction effects were statistically insignificant and did not improve model performance. Therefore, only the quadratic term for age was retained in the final model.
After running initial classification models, the categorical predictor location was also ruled out from the final model because it created 19 dummy variables impacting model interpretabiltiy and was ruled out to be statistically insignificant with p-values way less than 0.05.
4 Multicollinearity Check
The earlier descriptive analysis revealed minimal correlation among predictors, suggesting low risk of multicollinearity. To confirm this, a Variance Inflation Factor (VIF) analysis was conducted to assess potential linear dependencies among the predictors.
| Term | Df | VIF (Adj.) | |
|---|---|---|---|
| weekly_hours | weekly_hours | 1 | 1.12 |
| customer_service_inquiries | customer_service_inquiries | 2 | 1.08 |
| subscription_type | subscription_type | 3 | 1.05 |
| num_subscription_pauses | num_subscription_pauses | 1 | 1.05 |
| song_skip_rate | song_skip_rate | 1 | 1.04 |
| age | age | 1 | 1.00 |
| payment_plan | payment_plan | 1 | 1.00 |
| payment_method | payment_method | 3 | 1.00 |
| signup_date | signup_date | 1 | 1.00 |
| average_session_length | average_session_length | 1 | 1.00 |
| weekly_songs_played | weekly_songs_played | 1 | 1.00 |
| weekly_unique_songs | weekly_unique_songs | 1 | 1.00 |
| num_favorite_artists | num_favorite_artists | 1 | 1.00 |
| num_platform_friends | num_platform_friends | 1 | 1.00 |
| num_playlists_created | num_playlists_created | 1 | 1.00 |
| num_shared_playlists | num_shared_playlists | 1 | 1.00 |
| notifications_clicked | notifications_clicked | 1 | 1.00 |
From the VIF scores for each predictors, there is no sign of problematic multicollinearity because even thte highest value (around 1.12) are far below 5, which is the usual threshold for concern.
5 Data Splitting
An 80-20 split was adopted for this predictive analysis, with 80% of the data used for model training and the remaining 20% reserved for validation. Although 5-fold and 10-fold cross-validation procedures were tested, the resulting predictive accuracy and evaluation metrics were consistent across methods. Cross-validation, while robust, is computationally expensive because it repeatedly trains and tests the model across multiple data folds, requiring the algorithm to be fitted several times significantly increasing processing time and resource usage compared to a single 80-20 split.
6 Model Development & Statistical Findings
Four supervised learning models were developed to predict customer churn for Playlist Pro: Logistic Regression, Stepwise Logistic Regression, RandomForest and XGBoost. Each model represents a distinct analytical approach, offering a balance between interpretability, parsimony, and predictive strength. Notably, the quadratic term age² was included only in the logistic models, as tree-based models like Random Forest and XGBoost inherently capture non-linearities.
6.1 Model 1 : Logistic Regression (Baseline)
The baseline Logistic Regression model was implemented first to establish a benchmark for performance. It included all available predictors along with the quadratic term age², introduced to capture the observed non-linear (U-shaped) relationship between age and churn identified earlier. Logistic regression was chosen for its interpretability and ability to quantify the direction and magnitude of predictor effects.
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 1.3543 | 0.0903 | 14.9912 | 0.0000 |
| age | -0.2081 | 0.0033 | -63.9323 | 0.0000 |
| subscription_typeFree | 3.4845 | 0.0307 | 113.3730 | 0.0000 |
| subscription_typePremium | -0.0132 | 0.0250 | -0.5286 | 0.5971 |
| subscription_typeStudent | 1.7190 | 0.0258 | 66.6124 | 0.0000 |
| payment_planYearly | -0.0088 | 0.0180 | -0.4895 | 0.6245 |
| num_subscription_pauses | 0.5183 | 0.0068 | 75.8675 | 0.0000 |
| payment_methodCredit Card | -0.0868 | 0.0256 | -3.3943 | 0.0007 |
| payment_methodDebit Card | -0.0417 | 0.0255 | -1.6384 | 0.1013 |
| payment_methodPaypal | -0.0371 | 0.0255 | -1.4545 | 0.1458 |
| customer_service_inquiriesMedium | 1.7087 | 0.0231 | 73.8853 | 0.0000 |
| customer_service_inquiriesHigh | 3.4704 | 0.0274 | 126.7744 | 0.0000 |
| signup_date | 0.0000 | 0.0000 | -0.0522 | 0.9583 |
| weekly_hours | -0.0821 | 0.0007 | -111.1676 | 0.0000 |
| average_session_length | -0.0005 | 0.0003 | -1.7835 | 0.0745 |
| song_skip_rate | 2.2507 | 0.0330 | 68.1648 | 0.0000 |
| weekly_songs_played | 0.0000 | 0.0001 | -0.2220 | 0.8243 |
| weekly_unique_songs | 0.0008 | 0.0001 | 7.2984 | 0.0000 |
| num_favorite_artists | -0.0013 | 0.0006 | -2.0493 | 0.0404 |
| num_platform_friends | -0.0002 | 0.0002 | -1.1493 | 0.2504 |
| num_playlists_created | 0.0000 | 0.0003 | 0.1247 | 0.9007 |
| num_shared_playlists | 0.0007 | 0.0006 | 1.1552 | 0.2480 |
| notifications_clicked | -0.0125 | 0.0006 | -19.8476 | 0.0000 |
| I(age^2) | 0.0023 | 0.0000 | 67.7232 | 0.0000 |
The baseline logistic regression model reveals clear behavioral and demographic patterns influencing customer churn. Both age and age² are statistically significant, confirming a U-shaped relationship. Strong positive coefficients for free or student subscriptions, high customer service inquiries, and song skip rate indicate that disengaged or dissatisfied users drive most churn. Conversely, higher weekly listening hours significantly reduce churn probability, underscoring that consistent engagement is the strongest retention signal.
Predictors such as payment_method, signup_date, payment_plan, and average_session_length did not show statistical significance and contributed little additional explanatory power. The subsequent stepwise selection and ensemble models will automatically de-emphasize such weak predictors through internal feature selection or remove them progressively.
6.2 Model 2: Logistic Regression w/ Stepwise Subset Selection
The Stepwise Logistic Regression model was then applied using backward elimination based on the Akaike Information Criterion (AIC). This approach systematically removed redundant or statistically insignificant variables, producing a leaner model with minimal loss of accuracy. The result subset model improved efficiency and interpretability, aligning with the goal of deriving actionable insights without unnecessary model complexity. Before model refinement, stepwise subset selection was applied in forward, backward, and bidirectional modes to identify a leaner model that balances predictive accuracy and interpretability. Among these, the backward elimination model produced the most efficient with the lowest AIC and comparable validation accuracy to the full model.
**Final Model Formula:**
churned ~ age + subscription_type + num_subscription_pauses +
payment_method + customer_service_inquiries + weekly_hours +
average_session_length + song_skip_rate + weekly_unique_songs +
num_favorite_artists + notifications_clicked + I(age^2)
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 1.3471 | 0.0830 | 16.2278 | 0.0000 |
| age | -0.2081 | 0.0033 | -63.9346 | 0.0000 |
| subscription_typeFree | 3.4845 | 0.0307 | 113.3819 | 0.0000 |
| subscription_typePremium | -0.0135 | 0.0250 | -0.5407 | 0.5887 |
| subscription_typeStudent | 1.7189 | 0.0258 | 66.6112 | 0.0000 |
| num_subscription_pauses | 0.5183 | 0.0068 | 75.8785 | 0.0000 |
| payment_methodCredit Card | -0.0868 | 0.0256 | -3.3948 | 0.0007 |
| payment_methodDebit Card | -0.0420 | 0.0255 | -1.6508 | 0.0988 |
| payment_methodPaypal | -0.0371 | 0.0255 | -1.4561 | 0.1454 |
| customer_service_inquiriesMedium | 1.7087 | 0.0231 | 73.8934 | 0.0000 |
| customer_service_inquiriesHigh | 3.4704 | 0.0274 | 126.7785 | 0.0000 |
| weekly_hours | -0.0821 | 0.0007 | -111.1695 | 0.0000 |
| average_session_length | -0.0005 | 0.0003 | -1.7812 | 0.0749 |
| song_skip_rate | 2.2507 | 0.0330 | 68.1677 | 0.0000 |
| weekly_unique_songs | 0.0008 | 0.0001 | 7.2998 | 0.0000 |
| num_favorite_artists | -0.0013 | 0.0006 | -2.0549 | 0.0399 |
| notifications_clicked | -0.0125 | 0.0006 | -19.8470 | 0.0000 |
| I(age^2) | 0.0023 | 0.0000 | 67.7247 | 0.0000 |
The stepwise logistic regression model refines the baseline by removing statistically redundant variables such as signup_date, num_playlists_created, payment_plan, and other low-impact engagement features. The process stablized at an AIC of ~77145, retaining only the most meaningful predictors including age, age², subscription_type, weekly_hours, song_skip_rate, and customer_service_inquiries. The retained predictors highlight a consistent behavioral pattern as users on free or student subscriptions, those with frequent customer service inquiries, and a high song skip-rate remain the strongest churn indicators. Engagement variables such as weekly listening hours and notifications clicked continue to show negative relationships with churn, reinforcing their protectiveness effectively. Both age and age² are statistically significant in this model as well.
6.3 Model 3 : Random Forest
The Random Forest model applies a bagging (bootstrap aggregation) framework that combines the outputs of multiple decision trees. In this case, 100 trees were trained as part of the model design to capture complex, non-linear relationships in customer behavior. The ensemble strategy improves stability and accuracy by averaging results from many independent trained trees, each built on random sample of the data.
**Random Forest Model Summary:**
Call:
randomForest(formula = churned ~ ., data = train_split, ntree = 100, mtry = sqrt(ncol(train_split) - 1), importance = TRUE)
Type of random forest: classification
Number of trees: 100
No. of variables tried at each split: 4
OOB estimate of error rate: 15.56%
Confusion matrix:
0 1 class.error
0 40562 8099 0.1664372
1 7463 43877 0.1453642
The Out-of-Bag (OOB) error rate of 15.56% serves as an internal validation check, confirming strong generalization without overfitting.
6.4 Model 4 : XGBoost
The XGBoost model is based on the boosting approach, where each tree is trained sequentially to correct the errors made by the previous ones. This design allows the model to learn patterns progressively, improving precision without overfitting. With 100 iterations and controlled depth, XGBoost fine-tunes its predictions by focusing more on customers the earlier trees struggled to classify. This means the model is better at identifying high-risk churners that may appear stable in simpler models, providing sharper insights for targeted retention strategies.
**XGBoost Model Summary:**
##### xgb.Booster
raw: 468 Kb
call:
xgb.train(params = params, data = dtrain, nrounds = nrounds,
watchlist = watchlist, verbose = verbose, print_every_n = print_every_n,
early_stopping_rounds = early_stopping_rounds, maximize = maximize,
save_period = save_period, save_name = save_name, xgb_model = xgb_model,
callbacks = callbacks, max_depth = 6, eta = 0.1, objective = "binary:logistic",
eval_metric = "error")
params (as set within xgb.train):
max_depth = "6", eta = "0.1", objective = "binary:logistic", eval_metric = "error", validate_parameters = "TRUE"
xgb.attributes:
niter
callbacks:
cb.evaluation.log()
# of features: 17
niter: 100
nfeatures : 17
evaluation_log:
iter train_error
<num> <num>
1 0.2050079
2 0.2029980
--- ---
99 0.1282587
100 0.1277187
7 Model Evaluation
The performance of all four predictive models was evaluated using a combination of statistical and operational criteria. Each metric was chosen to assess the model’s accuracy, generability, and practical utility in predicting customer churn.
7.1 Evaluation Criteria and Rationale
To ensure an objective comparison, the following metrics were used:
| Metric | Purpose | Why It Matters |
|---|---|---|
| Training & Validation Accuracy | Measures the proportion of correct predictions. | Assesses model fit and generalization; large discrepancies between training and validation accuracy indicate overfitting. |
| AIC / BIC (for Logistic Models only) | Penalize excessive model complexity. | Ensures parsimony by selecting the simplest model with strong explanatory power. |
| Precision & Recall (Sensitivity) | Evaluate class-specific prediction quality. | Precision identifies how many predicted churns were correct, while recall measures how many actual churns were captured. |
| F1-Score | Harmonic mean of precision and recall. | Balances the trade-off between false positives and false negatives, providing a single performance summary. |
| ROC-AUC | Measures overall discriminatory power across thresholds. | Reflects how well the model distinguishes churned vs. active customers regardless of cutoff choice. |
| Likelihood Ratio Test (LRT) | Tests significance of model improvement. | Confirms whether additional predictors or transformations (e.g., age²) significantly enhance model performance. |
7.2 Training and Validation Accuracy
The comparison shows that Random Forest and XGBoost outperform both logistic models in raw predictive accuracy, with validation accuracies of 84.72% and 84.79% respectively. However, Random Forest attained a 100% training accuracy which signals overfitting and suggest that instead of learning patterns, it memorized the training data. In terms of the goodness of fit and predictive accuracy, XGBoost would be best suited for automated churn prediction and targeting, while stepwise logistic regression provides clear policy guidance for understanding which behaviors most strongly influence customer retention.
| Model | Training Accuracy | Validation Accuracy |
|---|---|---|
| Logistic Regression | 0.8154 | 0.8177 |
| Stepwise Logistic | 0.8153 | 0.8179 |
| Random Forest | 1.0000 | 0.8472 |
| XGBoost | 0.8723 | 0.8479 |
7.3 Confusion Matrix
The confusion matrices present each model’s ability to correctly classify churned versus active users. While all models perform reasonably well, the logistic models tend to miss more actual churners (false negatives), meaning some at-risk customers would go undetected and not receive timely retention offers.
In contrast, the XGBoost and Random Forest models capture a larger share of true churners, translating to greater potential for targeted interventions and revenue retention. This makes them more suitable for operational use cases while catching likely churners early is more valuable than perfectly classifying all active users.
7.4 AIC and BIC (Information Criterion)
The AIC and BIC values were computed from the log-likelihood of each logistic model, which evaluates how well the predicted probabilities fit the observed outcomes. Since these information criteria apply only to probabilistic models, they were not calculated for tree-based methods.
Both measures penalize model complexity, and while the stepwise logistic model shows slightly lower AIC and BIC values, the improvement remains marginal indicating that the simpler subset model performs almost as well as the full model while remaining more interpretable.
| Model | AIC | BIC |
|---|---|---|
| Logistic Regression | 77,154 | 77,382 |
| Stepwise Logistic | 77,145 | 77,316 |
7.5 F1-Score
The F1-scores reinforce the earlier insight from the confusion matrices that XGBoost and Random Forest provide the most balanced performance between correctly identifying churners and avoiding false alarms. Both models achieve F1-scores above 0.85, meaning they can reliably flag customers at risk of leaving while minimizing wasted retention efforts on loyal users.
A model with high precision and recall ensures that marketing resources are focused on the right customers (those most likely to churn) improving retention efficiency and ROI. Logistic and stepwise logistic models, while slightly less effective, still offer stable performance and are valuable for explaining the key behavioral factors driving churn.
| Model | Dataset | Precision | Recall | F1-Score |
|---|---|---|---|---|
| Logistic Regression | Training | 0.8202 | 0.8201 | 0.8202 |
| Logistic Regression | Validation | 0.8218 | 0.8234 | 0.8226 |
| Stepwise Logistic | Training | 0.8202 | 0.8200 | 0.8201 |
| Stepwise Logistic | Validation | 0.8219 | 0.8239 | 0.8229 |
| Random Forest | Training | 1.0000 | 1.0000 | 1.0000 |
| Random Forest | Validation | 0.8447 | 0.8605 | 0.8525 |
| XGBoost | Training | 0.8765 | 0.8745 | 0.8755 |
| XGBoost | Validation | 0.8543 | 0.8485 | 0.8514 |
7.6 ROC-AUC Graphs
The ROC-AUC results reinforce earlier evaluation metrics. XGBoost achieves the highest AUC (0.94), with its curve hugging the top-left corner of the plot which is a clear indicator of superior discrimination between churned and active users. Logistic and stepwise models follow closely with AUC values of 0.90, offering comparable accuracy with simpler interpretability, while Random Forest trails slightly behind at 0.85.
This means XGBoost can most reliably rank customers by churn risk, enabling targeted retention actions with minimal waste. The model’s strong AUC shows it can effectively separate likely churners from loyal users, providing decision-makers with a trustworthy basis for prioritizing customer outreach.
| Model | AUC Score |
|---|---|
| Logistic Regression | 0.9050 |
| Stepwise Logistic | 0.9050 |
| Random Forest | 0.8468 |
| XGBoost | 0.9418 |
7.7 Variable Importance
The variable importance charts provide deeper insight how the models make predictions rather than just how well they perform. While accuracy and AUC measure predictive power, variable importance highlights the drivers of churn, helping stakeholders understand which customer behaviors most influence the outcome.
Both Random Forest and XGBoost independently identified the same top predictors: subscription_type, weekly_hours, and customer_service_inquiries as the strongest contributors to churn likelihood. This cross-model consistency strengthens confidence in these features as reliable business levers.
Though the exact ranking order varies slightly between models, the recurring presence of these key predictors indicates that churn risk is primarily driven by subscription tier, engagement intensity, and customer support interactions. For PlaylistPro, this suggests prioritizing subscription-level retention strategies, personalized engagement campaigns, and reducing service friction as high-impact actions.
8 Model Selection and Justification
Among all four models tested, XGBoost emerged as the strongest predictive performer. The baseline and stepwise logistic regression models offered interpretability but struggled to capture the complex, non-linear relationships driving churn; especially among reatined users leading to higher opportunity costs if relied upon for targeting. While the stepwise model improved parsimony, it provided no real predictive gain beyond the baseline, as confirmed by nearly identical AUC scores. The Random Forest, though strong on training data, displayed signs of overfitting with a perfect training accuracy and a noticeably lower ROC curve on validation, indicating inflated confidence without true generalization. In contrast, XGBoost maintained the best balance between accuracy, F1-score, and AUC, leveraging gradient boosting to refine predictions iteratively. Its ability to correct prior misclassifications makes it the most reliable model for identifying at-risk customers and driving retention-focused business strategies.
9 Conclusions
The predictive analysis confirmed that churn at PlaylistPro is driven by low engagement and high service friction. Among all models, XGBoost emerged as the most reliable model, delivering the highest AUC and balanced F1-score without overfitting. Its precision in identifying high-risk customers makes it ideal for operational deployment, while logistic models remain valuable for explaining the underlying behavioral drivers. These insights lay the foundation for the next phase (prescriptive analytics) where PlaylistPro can simulate retention interventions, optimize incentive allocation, and quantify how targeted engagement strategies can maximize customer lifetime value.
10 Appendix
10.1 Predictions
This section demonstrates the predictive performance of the best-performing model (XGBoost) by showing actual vs predicted outcomes for a sample of validation set observations. This provides a practical view of how the model performs on unseen data and helps identify any systematic prediction patterns.
| Actual | Predicted Class |
|---|---|
| Active | Churned |
| Active | Churned |
| Active | Active |
| Active | Active |
| Churned | Churned |
| Active | Active |
| Churned | Active |
| Active | Churned |
| Churned | Active |
| Churned | Churned |
| Active | Active |
| Active | Active |
| Active | Active |
| Active | Active |
| Churned | Churned |
| Churned | Churned |
| Churned | Churned |
| Churned | Churned |
| Churned | Churned |
| Active | Active |
10.2 Model Output Comparison for age and age²
This section provides the complete statistical outputs that supported the decision to include the quadratic age term in the final models. The comparison between linear and quadratic specifications demonstrates the significant improvement in model fit achieved by capturing the U-shaped relationship between age and churn probability.
| Model | R² | AIC (Linear) | AIC (Logistic) | BIC (Linear) | BIC (Logistic) | F-Statistic | P-Value |
|---|---|---|---|---|---|---|---|
| Linear (Age only) | 0.0024 | 181067.3 | 172904.4 | 181096.5 | 172923.9 | NA | NA |
| Quadratic (Age + Age²) | 0.0280 | 177812.2 | 169648.0 | 177851.2 | 169677.2 | 3299.784 | 0 |
**ANOVA F-Test Results:**
Analysis of Variance Table
Model 1: as.numeric(churned) ~ age
Model 2: as.numeric(churned) ~ poly(age, 2)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 124998 31154
2 124997 30352 1 801.27 3299.8 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
**Logistic Model Coefficients:**
Linear Model (Age only):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.210195163 0.0163366034 -12.86652 6.946583e-38
age 0.005450966 0.0003167996 17.20635 2.379488e-66
Quadratic Model (Age + Age²):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.110990704 4.441047e-02 47.53363 0
age -0.105655974 1.998564e-03 -52.86596 0
I(age^2) 0.001149224 2.044015e-05 56.22388 0
**Likelihood Ratio Test:**
Analysis of Deviance Table
Model 1: churned ~ age
Model 2: churned ~ age + I(age^2)
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 124998 172900
2 124997 169642 1 3258.4 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
10.3 Stepwise Model Selection Tracing Output
This section shows the complete backward elimination process that led to the final stepwise logistic regression model. The tracing output demonstrates how variables were systematically removed based on AIC criteria, revealing the step-by-step refinement that produced the most parsimonious model while maintaining predictive power.
**Backward Stepwise Selection Process:**
Starting with full model including all predictors and quadratic age term.
Start: AIC=77154.1
churned ~ age + num_subscription_pauses + signup_date + weekly_hours +
average_session_length + song_skip_rate + weekly_songs_played +
weekly_unique_songs + num_favorite_artists + num_platform_friends +
num_playlists_created + num_shared_playlists + notifications_clicked +
subscription_type + payment_plan + payment_method + customer_service_inquiries +
I(age^2)
Df Deviance AIC
- signup_date 1 77106 77152
- num_playlists_created 1 77106 77152
- weekly_songs_played 1 77106 77152
- payment_plan 1 77106 77152
- num_platform_friends 1 77107 77153
- num_shared_playlists 1 77107 77153
<none> 77106 77154
- average_session_length 1 77109 77155
- num_favorite_artists 1 77110 77156
- payment_method 3 77118 77160
- weekly_unique_songs 1 77159 77205
- notifications_clicked 1 77503 77549
- age 1 81554 81600
- I(age^2) 1 82159 82205
- song_skip_rate 1 82207 82253
- num_subscription_pauses 1 83571 83617
- weekly_hours 1 93368 93414
- customer_service_inquiries 2 100187 100231
- subscription_type 3 100734 100776
Step: AIC=77152.1
churned ~ age + num_subscription_pauses + weekly_hours + average_session_length +
song_skip_rate + weekly_songs_played + weekly_unique_songs +
num_favorite_artists + num_platform_friends + num_playlists_created +
num_shared_playlists + notifications_clicked + subscription_type +
payment_plan + payment_method + customer_service_inquiries +
I(age^2)
Df Deviance AIC
- num_playlists_created 1 77106 77150
- weekly_songs_played 1 77106 77150
- payment_plan 1 77106 77150
- num_platform_friends 1 77107 77151
- num_shared_playlists 1 77107 77151
<none> 77106 77152
- average_session_length 1 77109 77153
- num_favorite_artists 1 77110 77154
- payment_method 3 77118 77158
- weekly_unique_songs 1 77159 77203
- notifications_clicked 1 77503 77547
- age 1 81554 81598
- I(age^2) 1 82159 82203
- song_skip_rate 1 82207 82251
- num_subscription_pauses 1 83571 83615
- weekly_hours 1 93368 93412
- customer_service_inquiries 2 100187 100229
- subscription_type 3 100735 100775
Step: AIC=77150.11
churned ~ age + num_subscription_pauses + weekly_hours + average_session_length +
song_skip_rate + weekly_songs_played + weekly_unique_songs +
num_favorite_artists + num_platform_friends + num_shared_playlists +
notifications_clicked + subscription_type + payment_plan +
payment_method + customer_service_inquiries + I(age^2)
Df Deviance AIC
- weekly_songs_played 1 77106 77148
- payment_plan 1 77106 77148
- num_platform_friends 1 77107 77149
- num_shared_playlists 1 77107 77149
<none> 77106 77150
- average_session_length 1 77109 77151
- num_favorite_artists 1 77110 77152
- payment_method 3 77118 77156
- weekly_unique_songs 1 77159 77201
- notifications_clicked 1 77503 77545
- age 1 81554 81596
- I(age^2) 1 82160 82202
- song_skip_rate 1 82207 82249
- num_subscription_pauses 1 83571 83613
- weekly_hours 1 93368 93410
- customer_service_inquiries 2 100187 100227
- subscription_type 3 100736 100774
Step: AIC=77148.16
churned ~ age + num_subscription_pauses + weekly_hours + average_session_length +
song_skip_rate + weekly_unique_songs + num_favorite_artists +
num_platform_friends + num_shared_playlists + notifications_clicked +
subscription_type + payment_plan + payment_method + customer_service_inquiries +
I(age^2)
Df Deviance AIC
- payment_plan 1 77106 77146
- num_platform_friends 1 77107 77147
- num_shared_playlists 1 77108 77148
<none> 77106 77148
- average_session_length 1 77109 77149
- num_favorite_artists 1 77110 77150
- payment_method 3 77118 77154
- weekly_unique_songs 1 77159 77199
- notifications_clicked 1 77503 77543
- age 1 81554 81594
- I(age^2) 1 82160 82200
- song_skip_rate 1 82207 82247
- num_subscription_pauses 1 83571 83611
- weekly_hours 1 93368 93408
- customer_service_inquiries 2 100187 100225
- subscription_type 3 100736 100772
Step: AIC=77146.4
churned ~ age + num_subscription_pauses + weekly_hours + average_session_length +
song_skip_rate + weekly_unique_songs + num_favorite_artists +
num_platform_friends + num_shared_playlists + notifications_clicked +
subscription_type + payment_method + customer_service_inquiries +
I(age^2)
Df Deviance AIC
- num_platform_friends 1 77108 77146
- num_shared_playlists 1 77108 77146
<none> 77106 77146
- average_session_length 1 77110 77148
- num_favorite_artists 1 77111 77149
- payment_method 3 77118 77152
- weekly_unique_songs 1 77160 77198
- notifications_clicked 1 77503 77541
- age 1 81554 81592
- I(age^2) 1 82160 82198
- song_skip_rate 1 82207 82245
- num_subscription_pauses 1 83572 83610
- weekly_hours 1 93369 93407
- customer_service_inquiries 2 100187 100223
- subscription_type 3 100737 100771
Step: AIC=77145.72
churned ~ age + num_subscription_pauses + weekly_hours + average_session_length +
song_skip_rate + weekly_unique_songs + num_favorite_artists +
num_shared_playlists + notifications_clicked + subscription_type +
payment_method + customer_service_inquiries + I(age^2)
Df Deviance AIC
- num_shared_playlists 1 77109 77145
<none> 77108 77146
- average_session_length 1 77111 77147
- num_favorite_artists 1 77112 77148
- payment_method 3 77119 77151
- weekly_unique_songs 1 77161 77197
- notifications_clicked 1 77505 77541
- age 1 81555 81591
- I(age^2) 1 82161 82197
- song_skip_rate 1 82209 82245
- num_subscription_pauses 1 83574 83610
- weekly_hours 1 93369 93405
- customer_service_inquiries 2 100192 100226
- subscription_type 3 100739 100771
Step: AIC=77145.06
churned ~ age + num_subscription_pauses + weekly_hours + average_session_length +
song_skip_rate + weekly_unique_songs + num_favorite_artists +
notifications_clicked + subscription_type + payment_method +
customer_service_inquiries + I(age^2)
Df Deviance AIC
<none> 77109 77145
- average_session_length 1 77112 77146
- num_favorite_artists 1 77113 77147
- payment_method 3 77121 77151
- weekly_unique_songs 1 77162 77196
- notifications_clicked 1 77506 77540
- age 1 81557 81591
- I(age^2) 1 82163 82197
- song_skip_rate 1 82210 82244
- num_subscription_pauses 1 83576 83610
- weekly_hours 1 93370 93404
- customer_service_inquiries 2 100192 100224
- subscription_type 3 100741 100771
**Final Model Formula:**
churned ~ age + num_subscription_pauses + weekly_hours + average_session_length +
song_skip_rate + weekly_unique_songs + num_favorite_artists +
notifications_clicked + subscription_type + payment_method +
customer_service_inquiries + I(age^2)
**Final Model Summary:**
Call:
glm(formula = churned ~ age + num_subscription_pauses + weekly_hours +
average_session_length + song_skip_rate + weekly_unique_songs +
num_favorite_artists + notifications_clicked + subscription_type +
payment_method + customer_service_inquiries + I(age^2), family = binomial,
data = train_split)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.347e+00 8.301e-02 16.228 < 2e-16 ***
age -2.081e-01 3.255e-03 -63.935 < 2e-16 ***
num_subscription_pauses 5.183e-01 6.831e-03 75.879 < 2e-16 ***
weekly_hours -8.210e-02 7.385e-04 -111.170 < 2e-16 ***
average_session_length -4.662e-04 2.618e-04 -1.781 0.074880 .
song_skip_rate 2.251e+00 3.302e-02 68.168 < 2e-16 ***
weekly_unique_songs 7.664e-04 1.050e-04 7.300 2.88e-13 ***
num_favorite_artists -1.280e-03 6.230e-04 -2.055 0.039893 *
notifications_clicked -1.247e-02 6.282e-04 -19.847 < 2e-16 ***
subscription_typeFree 3.484e+00 3.073e-02 113.382 < 2e-16 ***
subscription_typePremium -1.349e-02 2.496e-02 -0.541 0.588746
subscription_typeStudent 1.719e+00 2.580e-02 66.611 < 2e-16 ***
payment_methodCredit Card -8.678e-02 2.556e-02 -3.395 0.000687 ***
payment_methodDebit Card -4.203e-02 2.546e-02 -1.651 0.098784 .
payment_methodPaypal -3.714e-02 2.551e-02 -1.456 0.145353
customer_service_inquiriesMedium 1.709e+00 2.312e-02 73.893 < 2e-16 ***
customer_service_inquiriesHigh 3.470e+00 2.737e-02 126.779 < 2e-16 ***
I(age^2) 2.267e-03 3.347e-05 67.725 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 138559 on 100000 degrees of freedom
Residual deviance: 77109 on 99983 degrees of freedom
AIC: 77145
Number of Fisher Scoring iterations: 6
**Model Comparison:**
| Model | AIC | BIC | DF | Deviance |
|---|---|---|---|---|
| Full Model | 77154.10 | 77382.41 | 99977 | 77106.10 |
| Stepwise Model | 77145.06 | 77316.29 | 99983 | 77109.06 |