PlaylistPro Churn Prediction: Predictive Analysis

A Comprehensive Analysis of Customer Churn Using Multiple Machine Learning Approaches

Author

Satkar Karki

Published

October 13, 2025

1 Introduction

This section initiates the predictive analysis phase, extending insights from the earlier descriptive exploration. The task at hand is a binary classification problem, where the response variable churned indicates whether a customer has discontinued their subscription (1) or remained active (0). Accurately predicting churn is essential for PlaylistPro, as it enables the company to identify high-risk users early, design targeted retention strategies, and reduce revenue loss associated with customer attrition.

To achieve this, a mix of parametric and non-parametric supervised learning models were developed and compared. The parametric models (Logistic Regression and Step-wise Logistic Regression) were chosen for their transparency and interpretability, particularly suited for relationships that exhibit linear or near-linear patterns. In contrast, the non-parametric, tree-based models (Random Forest and Extreme Gradient Boosting (XGBoost)) offer flexibility in capturing complex, non-linear interactions among predictors. Together, these models provide both explanatory insigths and predictive strength, allowing PlaylistPro to balance business interpretability with data-driven precision.

2 Data Transformations

Following the initial exploratory data analysis, customer_id was excluded from the predictor set because a unique identifier does not contribute predictive value. All other predictors were retained for model development. Categorical variables, including the response variable (churned), were transformed into dummy variables to facilitate model compatbility.

A data integrity check was conducted across all numeric predictors to ensure the absence of invalid or inconsistent values prior to data splitting and model fitting. One correction was applied to the variable days_since_sign_up, which contained negative values in the raw dataset. These were converted to positive values to reflect accurate account age and maintain logical consistency for subsequent modeling stages.

3 Feature Engineering

All predictors except customer_id were retained for modeling. During descriptive analysis, age appeared to have a non-linear relationship with churn with both younger (20-30 years) and older (60-80 years) users showed higher churn rates, while middle-aged users (30-60 years) tended to remain subscribed. This observation prompted a hypothesis that age follows a U-shaped relationship with churn.

To validate this, customers were grouped into five-year age bins, and the average churn rate per group was plotted. Two separate models were then tested.

1. Linear model: churn ~ age

2. Quadratic model: churn ~ age + age²

The quadratic model improved R² from 0.0024 to 0.028, and the F-test confirmed the significance of the quadratic term. Graphically, the linear model underestimated churn at the extremes, while the quadratic curve captured the curvature more accurately, reflecting that churn increases again amongst the oldest users.

Including both age and age² allowed the model to better explain churn dynamics across demographic segments. This refinement highlights two distinct high-risk groups - younger users who may require better onboarding and engagement, and older users who might need simplified interfaces or personalized support.

The improvement was further supported by the Likelihood Ratio Test, which confirmed that the quadratic model fit the data significantly better than the linear model. Information criteria also favored the quadratic specification, with AIC decreasing from \(172,904\) to \(169,648\) and BIC from \(172,924\) to \(169,677\). These reductions indicate that the added complexity of including the quadratic term is justified by a substantial gain in model fit.

Additional hypothesis testing was conducted for other predictors, including weekly_listening_hours and interaction terms between subscription type and usage metrics. However, none of these variables met the usefulness standard, as their quadratic or interaction effects were statistically insignificant and did not improve model performance. Therefore, only the quadratic term for age was retained in the final model.

After running initial classification models, the categorical predictor location was also ruled out from the final model because it created 19 dummy variables impacting model interpretabiltiy and was ruled out to be statistically insignificant with p-values way less than 0.05.

4 Multicollinearity Check

The earlier descriptive analysis revealed minimal correlation among predictors, suggesting low risk of multicollinearity. To confirm this, a Variance Inflation Factor (VIF) analysis was conducted to assess potential linear dependencies among the predictors.

Variance Inflation (Adjusted GVIF for multi-df terms)
	Term	Df	VIF (Adj.)
weekly_hours	weekly_hours	1	1.12
customer_service_inquiries	customer_service_inquiries	2	1.08
subscription_type	subscription_type	3	1.05
num_subscription_pauses	num_subscription_pauses	1	1.05
song_skip_rate	song_skip_rate	1	1.04
age	age	1	1.00
payment_plan	payment_plan	1	1.00
payment_method	payment_method	3	1.00
signup_date	signup_date	1	1.00
average_session_length	average_session_length	1	1.00
weekly_songs_played	weekly_songs_played	1	1.00
weekly_unique_songs	weekly_unique_songs	1	1.00
num_favorite_artists	num_favorite_artists	1	1.00
num_platform_friends	num_platform_friends	1	1.00
num_playlists_created	num_playlists_created	1	1.00
num_shared_playlists	num_shared_playlists	1	1.00
notifications_clicked	notifications_clicked	1	1.00

From the VIF scores for each predictors, there is no sign of problematic multicollinearity because even thte highest value (around 1.12) are far below 5, which is the usual threshold for concern.

5 Data Splitting

An 80-20 split was adopted for this predictive analysis, with 80% of the data used for model training and the remaining 20% reserved for validation. Although 5-fold and 10-fold cross-validation procedures were tested, the resulting predictive accuracy and evaluation metrics were consistent across methods. Cross-validation, while robust, is computationally expensive because it repeatedly trains and tests the model across multiple data folds, requiring the algorithm to be fitted several times significantly increasing processing time and resource usage compared to a single 80-20 split.

6 Model Development & Statistical Findings

Four supervised learning models were developed to predict customer churn for Playlist Pro: Logistic Regression, Stepwise Logistic Regression, RandomForest and XGBoost. Each model represents a distinct analytical approach, offering a balance between interpretability, parsimony, and predictive strength. Notably, the quadratic term age² was included only in the logistic models, as tree-based models like Random Forest and XGBoost inherently capture non-linearities.

6.1 Model 1 : Logistic Regression (Baseline)

The baseline Logistic Regression model was implemented first to establish a benchmark for performance. It included all available predictors along with the quadratic term age², introduced to capture the observed non-linear (U-shaped) relationship between age and churn identified earlier. Logistic regression was chosen for its interpretability and ability to quantify the direction and magnitude of predictor effects.

Logistic Regression Model Coefficients
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	1.3543	0.0903	14.9912	0.0000
age	-0.2081	0.0033	-63.9323	0.0000
subscription_typeFree	3.4845	0.0307	113.3730	0.0000
subscription_typePremium	-0.0132	0.0250	-0.5286	0.5971
subscription_typeStudent	1.7190	0.0258	66.6124	0.0000
payment_planYearly	-0.0088	0.0180	-0.4895	0.6245
num_subscription_pauses	0.5183	0.0068	75.8675	0.0000
payment_methodCredit Card	-0.0868	0.0256	-3.3943	0.0007
payment_methodDebit Card	-0.0417	0.0255	-1.6384	0.1013
payment_methodPaypal	-0.0371	0.0255	-1.4545	0.1458
customer_service_inquiriesMedium	1.7087	0.0231	73.8853	0.0000
customer_service_inquiriesHigh	3.4704	0.0274	126.7744	0.0000
signup_date	0.0000	0.0000	-0.0522	0.9583
weekly_hours	-0.0821	0.0007	-111.1676	0.0000
average_session_length	-0.0005	0.0003	-1.7835	0.0745
song_skip_rate	2.2507	0.0330	68.1648	0.0000
weekly_songs_played	0.0000	0.0001	-0.2220	0.8243
weekly_unique_songs	0.0008	0.0001	7.2984	0.0000
num_favorite_artists	-0.0013	0.0006	-2.0493	0.0404
num_platform_friends	-0.0002	0.0002	-1.1493	0.2504
num_playlists_created	0.0000	0.0003	0.1247	0.9007
num_shared_playlists	0.0007	0.0006	1.1552	0.2480
notifications_clicked	-0.0125	0.0006	-19.8476	0.0000
I(age^2)	0.0023	0.0000	67.7232	0.0000

The baseline logistic regression model reveals clear behavioral and demographic patterns influencing customer churn. Both age and age² are statistically significant, confirming a U-shaped relationship. Strong positive coefficients for free or student subscriptions, high customer service inquiries, and song skip rate indicate that disengaged or dissatisfied users drive most churn. Conversely, higher weekly listening hours significantly reduce churn probability, underscoring that consistent engagement is the strongest retention signal.

Predictors such as payment_method, signup_date, payment_plan, and average_session_length did not show statistical significance and contributed little additional explanatory power. The subsequent stepwise selection and ensemble models will automatically de-emphasize such weak predictors through internal feature selection or remove them progressively.

6.2 Model 2: Logistic Regression w/ Stepwise Subset Selection

The Stepwise Logistic Regression model was then applied using backward elimination based on the Akaike Information Criterion (AIC). This approach systematically removed redundant or statistically insignificant variables, producing a leaner model with minimal loss of accuracy. The result subset model improved efficiency and interpretability, aligning with the goal of deriving actionable insights without unnecessary model complexity. Before model refinement, stepwise subset selection was applied in forward, backward, and bidirectional modes to identify a leaner model that balances predictive accuracy and interpretability. Among these, the backward elimination model produced the most efficient with the lowest AIC and comparable validation accuracy to the full model.

**Final Model Formula:**

churned ~ age + subscription_type + num_subscription_pauses + 
    payment_method + customer_service_inquiries + weekly_hours + 
    average_session_length + song_skip_rate + weekly_unique_songs + 
    num_favorite_artists + notifications_clicked + I(age^2)

Stepwise Logistic Regression Model Coefficients
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	1.3471	0.0830	16.2278	0.0000
age	-0.2081	0.0033	-63.9346	0.0000
subscription_typeFree	3.4845	0.0307	113.3819	0.0000
subscription_typePremium	-0.0135	0.0250	-0.5407	0.5887
subscription_typeStudent	1.7189	0.0258	66.6112	0.0000
num_subscription_pauses	0.5183	0.0068	75.8785	0.0000
payment_methodCredit Card	-0.0868	0.0256	-3.3948	0.0007
payment_methodDebit Card	-0.0420	0.0255	-1.6508	0.0988
payment_methodPaypal	-0.0371	0.0255	-1.4561	0.1454
customer_service_inquiriesMedium	1.7087	0.0231	73.8934	0.0000
customer_service_inquiriesHigh	3.4704	0.0274	126.7785	0.0000
weekly_hours	-0.0821	0.0007	-111.1695	0.0000
average_session_length	-0.0005	0.0003	-1.7812	0.0749
song_skip_rate	2.2507	0.0330	68.1677	0.0000
weekly_unique_songs	0.0008	0.0001	7.2998	0.0000
num_favorite_artists	-0.0013	0.0006	-2.0549	0.0399
notifications_clicked	-0.0125	0.0006	-19.8470	0.0000
I(age^2)	0.0023	0.0000	67.7247	0.0000

The stepwise logistic regression model refines the baseline by removing statistically redundant variables such as signup_date, num_playlists_created, payment_plan, and other low-impact engagement features. The process stablized at an AIC of ~77145, retaining only the most meaningful predictors including age, age², subscription_type, weekly_hours, song_skip_rate, and customer_service_inquiries. The retained predictors highlight a consistent behavioral pattern as users on free or student subscriptions, those with frequent customer service inquiries, and a high song skip-rate remain the strongest churn indicators. Engagement variables such as weekly listening hours and notifications clicked continue to show negative relationships with churn, reinforcing their protectiveness effectively. Both age and age² are statistically significant in this model as well.

6.3 Model 3 : Random Forest

The Random Forest model applies a bagging (bootstrap aggregation) framework that combines the outputs of multiple decision trees. In this case, 100 trees were trained as part of the model design to capture complex, non-linear relationships in customer behavior. The ensemble strategy improves stability and accuracy by averaging results from many independent trained trees, each built on random sample of the data.

**Random Forest Model Summary:**


Call:
 randomForest(formula = churned ~ ., data = train_split, ntree = 100,      mtry = sqrt(ncol(train_split) - 1), importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 100
No. of variables tried at each split: 4

        OOB estimate of  error rate: 15.56%
Confusion matrix:
      0     1 class.error
0 40562  8099   0.1664372
1  7463 43877   0.1453642

The Out-of-Bag (OOB) error rate of 15.56% serves as an internal validation check, confirming strong generalization without overfitting.

6.4 Model 4 : XGBoost

The XGBoost model is based on the boosting approach, where each tree is trained sequentially to correct the errors made by the previous ones. This design allows the model to learn patterns progressively, improving precision without overfitting. With 100 iterations and controlled depth, XGBoost fine-tunes its predictions by focusing more on customers the earlier trees struggled to classify. This means the model is better at identifying high-risk churners that may appear stable in simpler models, providing sharper insights for targeted retention strategies.

**XGBoost Model Summary:**

##### xgb.Booster
raw: 468 Kb 
call:
  xgb.train(params = params, data = dtrain, nrounds = nrounds, 
    watchlist = watchlist, verbose = verbose, print_every_n = print_every_n, 
    early_stopping_rounds = early_stopping_rounds, maximize = maximize, 
    save_period = save_period, save_name = save_name, xgb_model = xgb_model, 
    callbacks = callbacks, max_depth = 6, eta = 0.1, objective = "binary:logistic", 
    eval_metric = "error")
params (as set within xgb.train):
  max_depth = "6", eta = "0.1", objective = "binary:logistic", eval_metric = "error", validate_parameters = "TRUE"
xgb.attributes:
  niter
callbacks:
  cb.evaluation.log()
# of features: 17 
niter: 100
nfeatures : 17 
evaluation_log:
  iter train_error
 <num>       <num>
     1   0.2050079
     2   0.2029980
   ---         ---
    99   0.1282587
   100   0.1277187

7 Model Evaluation

The performance of all four predictive models was evaluated using a combination of statistical and operational criteria. Each metric was chosen to assess the model’s accuracy, generability, and practical utility in predicting customer churn.

7.1 Evaluation Criteria and Rationale

To ensure an objective comparison, the following metrics were used:

Metric	Purpose	Why It Matters
Training & Validation Accuracy	Measures the proportion of correct predictions.	Assesses model fit and generalization; large discrepancies between training and validation accuracy indicate overfitting.
AIC / BIC (for Logistic Models only)	Penalize excessive model complexity.	Ensures parsimony by selecting the simplest model with strong explanatory power.
Precision & Recall (Sensitivity)	Evaluate class-specific prediction quality.	Precision identifies how many predicted churns were correct, while recall measures how many actual churns were captured.
F1-Score	Harmonic mean of precision and recall.	Balances the trade-off between false positives and false negatives, providing a single performance summary.
ROC-AUC	Measures overall discriminatory power across thresholds.	Reflects how well the model distinguishes churned vs. active customers regardless of cutoff choice.
Likelihood Ratio Test (LRT)	Tests significance of model improvement.	Confirms whether additional predictors or transformations (e.g., `age²`) significantly enhance model performance.

7.2 Training and Validation Accuracy

The comparison shows that Random Forest and XGBoost outperform both logistic models in raw predictive accuracy, with validation accuracies of 84.72% and 84.79% respectively. However, Random Forest attained a 100% training accuracy which signals overfitting and suggest that instead of learning patterns, it memorized the training data. In terms of the goodness of fit and predictive accuracy, XGBoost would be best suited for automated churn prediction and targeting, while stepwise logistic regression provides clear policy guidance for understanding which behaviors most strongly influence customer retention.

Model Accuracy Comparison
Model	Training Accuracy	Validation Accuracy
Logistic Regression	0.8154	0.8177
Stepwise Logistic	0.8153	0.8179
Random Forest	1.0000	0.8472
XGBoost	0.8723	0.8479

7.3 Confusion Matrix

The confusion matrices present each model’s ability to correctly classify churned versus active users. While all models perform reasonably well, the logistic models tend to miss more actual churners (false negatives), meaning some at-risk customers would go undetected and not receive timely retention offers.

In contrast, the XGBoost and Random Forest models capture a larger share of true churners, translating to greater potential for targeted interventions and revenue retention. This makes them more suitable for operational use cases while catching likely churners early is more valuable than perfectly classifying all active users.

7.4 AIC and BIC (Information Criterion)

The AIC and BIC values were computed from the log-likelihood of each logistic model, which evaluates how well the predicted probabilities fit the observed outcomes. Since these information criteria apply only to probabilistic models, they were not calculated for tree-based methods.

Both measures penalize model complexity, and while the stepwise logistic model shows slightly lower AIC and BIC values, the improvement remains marginal indicating that the simpler subset model performs almost as well as the full model while remaining more interpretable.

AIC and BIC Comparison (Lower values indicate better fit)
Model	AIC	BIC
Logistic Regression	77,154	77,382
Stepwise Logistic	77,145	77,316

7.5 F1-Score

The F1-scores reinforce the earlier insight from the confusion matrices that XGBoost and Random Forest provide the most balanced performance between correctly identifying churners and avoiding false alarms. Both models achieve F1-scores above 0.85, meaning they can reliably flag customers at risk of leaving while minimizing wasted retention efforts on loyal users.

A model with high precision and recall ensures that marketing resources are focused on the right customers (those most likely to churn) improving retention efficiency and ROI. Logistic and stepwise logistic models, while slightly less effective, still offer stable performance and are valuable for explaining the key behavioral factors driving churn.

F1-Score, Precision, and Recall Comparison
Model	Dataset	Precision	Recall	F1-Score
Logistic Regression	Training	0.8202	0.8201	0.8202
Logistic Regression	Validation	0.8218	0.8234	0.8226
Stepwise Logistic	Training	0.8202	0.8200	0.8201
Stepwise Logistic	Validation	0.8219	0.8239	0.8229
Random Forest	Training	1.0000	1.0000	1.0000
Random Forest	Validation	0.8447	0.8605	0.8525
XGBoost	Training	0.8765	0.8745	0.8755
XGBoost	Validation	0.8543	0.8485	0.8514

7.6 ROC-AUC Graphs

The ROC-AUC results reinforce earlier evaluation metrics. XGBoost achieves the highest AUC (0.94), with its curve hugging the top-left corner of the plot which is a clear indicator of superior discrimination between churned and active users. Logistic and stepwise models follow closely with AUC values of 0.90, offering comparable accuracy with simpler interpretability, while Random Forest trails slightly behind at 0.85.

This means XGBoost can most reliably rank customers by churn risk, enabling targeted retention actions with minimal waste. The model’s strong AUC shows it can effectively separate likely churners from loyal users, providing decision-makers with a trustworthy basis for prioritizing customer outreach.

ROC-AUC Comparison (Higher values indicate better discrimination)
Model	AUC Score
Logistic Regression	0.9050
Stepwise Logistic	0.9050
Random Forest	0.8468
XGBoost	0.9418

7.7 Variable Importance

The variable importance charts provide deeper insight how the models make predictions rather than just how well they perform. While accuracy and AUC measure predictive power, variable importance highlights the drivers of churn, helping stakeholders understand which customer behaviors most influence the outcome.

Both Random Forest and XGBoost independently identified the same top predictors: subscription_type, weekly_hours, and customer_service_inquiries as the strongest contributors to churn likelihood. This cross-model consistency strengthens confidence in these features as reliable business levers.

Though the exact ranking order varies slightly between models, the recurring presence of these key predictors indicates that churn risk is primarily driven by subscription tier, engagement intensity, and customer support interactions. For PlaylistPro, this suggests prioritizing subscription-level retention strategies, personalized engagement campaigns, and reducing service friction as high-impact actions.

8 Model Selection and Justification

Among all four models tested, XGBoost emerged as the strongest predictive performer. The baseline and stepwise logistic regression models offered interpretability but struggled to capture the complex, non-linear relationships driving churn; especially among reatined users leading to higher opportunity costs if relied upon for targeting. While the stepwise model improved parsimony, it provided no real predictive gain beyond the baseline, as confirmed by nearly identical AUC scores. The Random Forest, though strong on training data, displayed signs of overfitting with a perfect training accuracy and a noticeably lower ROC curve on validation, indicating inflated confidence without true generalization. In contrast, XGBoost maintained the best balance between accuracy, F1-score, and AUC, leveraging gradient boosting to refine predictions iteratively. Its ability to correct prior misclassifications makes it the most reliable model for identifying at-risk customers and driving retention-focused business strategies.

9 Conclusions

The predictive analysis confirmed that churn at PlaylistPro is driven by low engagement and high service friction. Among all models, XGBoost emerged as the most reliable model, delivering the highest AUC and balanced F1-score without overfitting. Its precision in identifying high-risk customers makes it ideal for operational deployment, while logistic models remain valuable for explaining the underlying behavioral drivers. These insights lay the foundation for the next phase (prescriptive analytics) where PlaylistPro can simulate retention interventions, optimize incentive allocation, and quantify how targeted engagement strategies can maximize customer lifetime value.

10 Appendix

10.1 Predictions

This section demonstrates the predictive performance of the best-performing model (XGBoost) by showing actual vs predicted outcomes for a sample of validation set observations. This provides a practical view of how the model performs on unseen data and helps identify any systematic prediction patterns.

XGBoost Predictions: Sample of 20 Validation Observations
Actual	Predicted Class
Active	Churned
Active	Churned
Active	Active
Active	Active
Churned	Churned
Active	Active
Churned	Active
Active	Churned
Churned	Active
Churned	Churned
Active	Active
Active	Active
Active	Active
Active	Active
Churned	Churned
Churned	Churned
Churned	Churned
Churned	Churned
Churned	Churned
Active	Active

10.2 Model Output Comparison for `age` and `age²`

This section provides the complete statistical outputs that supported the decision to include the quadratic age term in the final models. The comparison between linear and quadratic specifications demonstrates the significant improvement in model fit achieved by capturing the U-shaped relationship between age and churn probability.

Linear vs Quadratic Model Comparison
Model	R²	AIC (Linear)	AIC (Logistic)	BIC (Linear)	BIC (Logistic)	F-Statistic	P-Value
Linear (Age only)	0.0024	181067.3	172904.4	181096.5	172923.9	NA	NA
Quadratic (Age + Age²)	0.0280	177812.2	169648.0	177851.2	169677.2	3299.784	0


**ANOVA F-Test Results:**

Analysis of Variance Table

Model 1: as.numeric(churned) ~ age
Model 2: as.numeric(churned) ~ poly(age, 2)
  Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
1 124998 31154                                  
2 124997 30352  1    801.27 3299.8 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


**Logistic Model Coefficients:**


Linear Model (Age only):

                Estimate   Std. Error   z value     Pr(>|z|)
(Intercept) -0.210195163 0.0163366034 -12.86652 6.946583e-38
age          0.005450966 0.0003167996  17.20635 2.379488e-66


Quadratic Model (Age + Age²):

                Estimate   Std. Error   z value Pr(>|z|)
(Intercept)  2.110990704 4.441047e-02  47.53363        0
age         -0.105655974 1.998564e-03 -52.86596        0
I(age^2)     0.001149224 2.044015e-05  56.22388        0


**Likelihood Ratio Test:**

Analysis of Deviance Table

Model 1: churned ~ age
Model 2: churned ~ age + I(age^2)
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1    124998     172900                          
2    124997     169642  1   3258.4 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

10.3 Stepwise Model Selection Tracing Output

This section shows the complete backward elimination process that led to the final stepwise logistic regression model. The tracing output demonstrates how variables were systematically removed based on AIC criteria, revealing the step-by-step refinement that produced the most parsimonious model while maintaining predictive power.

**Backward Stepwise Selection Process:**

Starting with full model including all predictors and quadratic age term.

Start:  AIC=77154.1
churned ~ age + num_subscription_pauses + signup_date + weekly_hours + 
    average_session_length + song_skip_rate + weekly_songs_played + 
    weekly_unique_songs + num_favorite_artists + num_platform_friends + 
    num_playlists_created + num_shared_playlists + notifications_clicked + 
    subscription_type + payment_plan + payment_method + customer_service_inquiries + 
    I(age^2)

                             Df Deviance    AIC
- signup_date                 1    77106  77152
- num_playlists_created       1    77106  77152
- weekly_songs_played         1    77106  77152
- payment_plan                1    77106  77152
- num_platform_friends        1    77107  77153
- num_shared_playlists        1    77107  77153
<none>                             77106  77154
- average_session_length      1    77109  77155
- num_favorite_artists        1    77110  77156
- payment_method              3    77118  77160
- weekly_unique_songs         1    77159  77205
- notifications_clicked       1    77503  77549
- age                         1    81554  81600
- I(age^2)                    1    82159  82205
- song_skip_rate              1    82207  82253
- num_subscription_pauses     1    83571  83617
- weekly_hours                1    93368  93414
- customer_service_inquiries  2   100187 100231
- subscription_type           3   100734 100776

Step:  AIC=77152.1
churned ~ age + num_subscription_pauses + weekly_hours + average_session_length + 
    song_skip_rate + weekly_songs_played + weekly_unique_songs + 
    num_favorite_artists + num_platform_friends + num_playlists_created + 
    num_shared_playlists + notifications_clicked + subscription_type + 
    payment_plan + payment_method + customer_service_inquiries + 
    I(age^2)

                             Df Deviance    AIC
- num_playlists_created       1    77106  77150
- weekly_songs_played         1    77106  77150
- payment_plan                1    77106  77150
- num_platform_friends        1    77107  77151
- num_shared_playlists        1    77107  77151
<none>                             77106  77152
- average_session_length      1    77109  77153
- num_favorite_artists        1    77110  77154
- payment_method              3    77118  77158
- weekly_unique_songs         1    77159  77203
- notifications_clicked       1    77503  77547
- age                         1    81554  81598
- I(age^2)                    1    82159  82203
- song_skip_rate              1    82207  82251
- num_subscription_pauses     1    83571  83615
- weekly_hours                1    93368  93412
- customer_service_inquiries  2   100187 100229
- subscription_type           3   100735 100775

Step:  AIC=77150.11
churned ~ age + num_subscription_pauses + weekly_hours + average_session_length + 
    song_skip_rate + weekly_songs_played + weekly_unique_songs + 
    num_favorite_artists + num_platform_friends + num_shared_playlists + 
    notifications_clicked + subscription_type + payment_plan + 
    payment_method + customer_service_inquiries + I(age^2)

                             Df Deviance    AIC
- weekly_songs_played         1    77106  77148
- payment_plan                1    77106  77148
- num_platform_friends        1    77107  77149
- num_shared_playlists        1    77107  77149
<none>                             77106  77150
- average_session_length      1    77109  77151
- num_favorite_artists        1    77110  77152
- payment_method              3    77118  77156
- weekly_unique_songs         1    77159  77201
- notifications_clicked       1    77503  77545
- age                         1    81554  81596
- I(age^2)                    1    82160  82202
- song_skip_rate              1    82207  82249
- num_subscription_pauses     1    83571  83613
- weekly_hours                1    93368  93410
- customer_service_inquiries  2   100187 100227
- subscription_type           3   100736 100774

Step:  AIC=77148.16
churned ~ age + num_subscription_pauses + weekly_hours + average_session_length + 
    song_skip_rate + weekly_unique_songs + num_favorite_artists + 
    num_platform_friends + num_shared_playlists + notifications_clicked + 
    subscription_type + payment_plan + payment_method + customer_service_inquiries + 
    I(age^2)

                             Df Deviance    AIC
- payment_plan                1    77106  77146
- num_platform_friends        1    77107  77147
- num_shared_playlists        1    77108  77148
<none>                             77106  77148
- average_session_length      1    77109  77149
- num_favorite_artists        1    77110  77150
- payment_method              3    77118  77154
- weekly_unique_songs         1    77159  77199
- notifications_clicked       1    77503  77543
- age                         1    81554  81594
- I(age^2)                    1    82160  82200
- song_skip_rate              1    82207  82247
- num_subscription_pauses     1    83571  83611
- weekly_hours                1    93368  93408
- customer_service_inquiries  2   100187 100225
- subscription_type           3   100736 100772

Step:  AIC=77146.4
churned ~ age + num_subscription_pauses + weekly_hours + average_session_length + 
    song_skip_rate + weekly_unique_songs + num_favorite_artists + 
    num_platform_friends + num_shared_playlists + notifications_clicked + 
    subscription_type + payment_method + customer_service_inquiries + 
    I(age^2)

                             Df Deviance    AIC
- num_platform_friends        1    77108  77146
- num_shared_playlists        1    77108  77146
<none>                             77106  77146
- average_session_length      1    77110  77148
- num_favorite_artists        1    77111  77149
- payment_method              3    77118  77152
- weekly_unique_songs         1    77160  77198
- notifications_clicked       1    77503  77541
- age                         1    81554  81592
- I(age^2)                    1    82160  82198
- song_skip_rate              1    82207  82245
- num_subscription_pauses     1    83572  83610
- weekly_hours                1    93369  93407
- customer_service_inquiries  2   100187 100223
- subscription_type           3   100737 100771

Step:  AIC=77145.72
churned ~ age + num_subscription_pauses + weekly_hours + average_session_length + 
    song_skip_rate + weekly_unique_songs + num_favorite_artists + 
    num_shared_playlists + notifications_clicked + subscription_type + 
    payment_method + customer_service_inquiries + I(age^2)

                             Df Deviance    AIC
- num_shared_playlists        1    77109  77145
<none>                             77108  77146
- average_session_length      1    77111  77147
- num_favorite_artists        1    77112  77148
- payment_method              3    77119  77151
- weekly_unique_songs         1    77161  77197
- notifications_clicked       1    77505  77541
- age                         1    81555  81591
- I(age^2)                    1    82161  82197
- song_skip_rate              1    82209  82245
- num_subscription_pauses     1    83574  83610
- weekly_hours                1    93369  93405
- customer_service_inquiries  2   100192 100226
- subscription_type           3   100739 100771

Step:  AIC=77145.06
churned ~ age + num_subscription_pauses + weekly_hours + average_session_length + 
    song_skip_rate + weekly_unique_songs + num_favorite_artists + 
    notifications_clicked + subscription_type + payment_method + 
    customer_service_inquiries + I(age^2)

                             Df Deviance    AIC
<none>                             77109  77145
- average_session_length      1    77112  77146
- num_favorite_artists        1    77113  77147
- payment_method              3    77121  77151
- weekly_unique_songs         1    77162  77196
- notifications_clicked       1    77506  77540
- age                         1    81557  81591
- I(age^2)                    1    82163  82197
- song_skip_rate              1    82210  82244
- num_subscription_pauses     1    83576  83610
- weekly_hours                1    93370  93404
- customer_service_inquiries  2   100192 100224
- subscription_type           3   100741 100771


**Final Model Formula:**

churned ~ age + num_subscription_pauses + weekly_hours + average_session_length + 
    song_skip_rate + weekly_unique_songs + num_favorite_artists + 
    notifications_clicked + subscription_type + payment_method + 
    customer_service_inquiries + I(age^2)


**Final Model Summary:**


Call:
glm(formula = churned ~ age + num_subscription_pauses + weekly_hours + 
    average_session_length + song_skip_rate + weekly_unique_songs + 
    num_favorite_artists + notifications_clicked + subscription_type + 
    payment_method + customer_service_inquiries + I(age^2), family = binomial, 
    data = train_split)

Coefficients:
                                   Estimate Std. Error  z value Pr(>|z|)    
(Intercept)                       1.347e+00  8.301e-02   16.228  < 2e-16 ***
age                              -2.081e-01  3.255e-03  -63.935  < 2e-16 ***
num_subscription_pauses           5.183e-01  6.831e-03   75.879  < 2e-16 ***
weekly_hours                     -8.210e-02  7.385e-04 -111.170  < 2e-16 ***
average_session_length           -4.662e-04  2.618e-04   -1.781 0.074880 .  
song_skip_rate                    2.251e+00  3.302e-02   68.168  < 2e-16 ***
weekly_unique_songs               7.664e-04  1.050e-04    7.300 2.88e-13 ***
num_favorite_artists             -1.280e-03  6.230e-04   -2.055 0.039893 *  
notifications_clicked            -1.247e-02  6.282e-04  -19.847  < 2e-16 ***
subscription_typeFree             3.484e+00  3.073e-02  113.382  < 2e-16 ***
subscription_typePremium         -1.349e-02  2.496e-02   -0.541 0.588746    
subscription_typeStudent          1.719e+00  2.580e-02   66.611  < 2e-16 ***
payment_methodCredit Card        -8.678e-02  2.556e-02   -3.395 0.000687 ***
payment_methodDebit Card         -4.203e-02  2.546e-02   -1.651 0.098784 .  
payment_methodPaypal             -3.714e-02  2.551e-02   -1.456 0.145353    
customer_service_inquiriesMedium  1.709e+00  2.312e-02   73.893  < 2e-16 ***
customer_service_inquiriesHigh    3.470e+00  2.737e-02  126.779  < 2e-16 ***
I(age^2)                          2.267e-03  3.347e-05   67.725  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 138559  on 100000  degrees of freedom
Residual deviance:  77109  on  99983  degrees of freedom
AIC: 77145

Number of Fisher Scoring iterations: 6


**Model Comparison:**

Full Model vs Stepwise Model Comparison
Model	AIC	BIC	DF	Deviance
Full Model	77154.10	77382.41	99977	77106.10
Stepwise Model	77145.06	77316.29	99983	77109.06