# How to Build an Employee Attrition Prediction Model
Employee turnover is expensive—Gallup's workplace research estimates range from 50-200% of annual salary per departure. Predicting which employees are likely to leave enables proactive retention interventions. This guide walks through building an attrition prediction model.
Understanding the Problem
What We're Predicting
Target Variable Options
| Definition | Pros | Cons |
|---|---|---|
| Left within 6 months | More actionable | Shorter data history |
| Left within 12 months | Balanced | Most common |
| Left within 24 months | More data | Less actionable |
| Resignation (vs. all departure) | Focused on preventable | Smaller sample |
Recommended: Voluntary resignation within 12 months—balances actionability with statistical power.
Why This Is Hard
Class Imbalance Annual turnover of 15% means 85% of employees don't leave—significant imbalance.
Temporal Dynamics What predicts departure changes over time (market conditions, company context).
Feature Sensitivity Many predictive features raise privacy and ethical concerns.
> Download our free AI Recruitment Playbook — a practical resource built from real implementation experience. Get it here.
## Data Requirements
Core HR Data
Employee Demographics - Tenure (employment duration) - Age (sensitive—use carefully) - Job level/grade - Department/function - Location - Employment type (full-time, part-time)
Compensation - Base salary - Comp ratio (salary vs. market/range) - Last raise date and amount - Bonus eligibility and payout
Job History - Time in current role - Number of role changes - Promotion history - Lateral moves
Performance and Engagement
Performance Data - Performance ratings (current and trend) - Calibration results - Goal completion rates - Recognition received
Engagement Indicators - Survey responses (if available) - eNPS scores - Training participation - Voluntary activity participation
System Activity (Use Carefully)
System Usage - Badge-in patterns (if available) - System login patterns - Communication patterns (anonymized/aggregated)
Caution: Activity monitoring data raises significant privacy and ethics concerns. Consider carefully before including.
Manager and Team
Manager Factors - Manager tenure - Manager's team turnover rate - Time since manager change - Manager performance rating
Team Factors - Team size - Team turnover rate - Peer turnover (social network effects)
Feature Engineering
Tenure-Based Features
```python # Key tenure features features['tenure_months'] = employee['tenure_days'] / 30 features['tenure_risk_zone'] = 1 if 12 <= tenure_months <= 24 else 0 # High-risk period features['anniversary_approaching'] = 1 if days_to_anniversary < 60 else 0 ```
Compensation Features
```python # Compensation features features['comp_ratio'] = salary / market_midpoint features['time_since_raise_months'] = months_since_last_raise features['raise_velocity'] = avg_annual_raise_pct features['below_range'] = 1 if salary < range_min else 0 ```
Career Progression Features
```python # Career features features['time_in_role_months'] = months_in_current_role features['promotions_last_3yr'] = count_promotions_last_3_years features['stalled_career'] = 1 if time_in_role > 36 and promotions_last_3yr == 0 else 0 features['recent_lateral'] = 1 if had_lateral_move_last_12_months else 0 ```
Manager and Team Features
```python # Manager features features['manager_tenure_months'] = manager_tenure features['manager_turnover_rate'] = manager_team_turnover_last_12mo features['new_manager'] = 1 if time_with_manager < 6 else 0
# Team features features['team_turnover_rate'] = team_turnover_last_12mo features['peers_departed_recently'] = count_peer_departures_last_3mo ```
Engagement Proxies
```python # Engagement features (if available) features['training_hours_last_12mo'] = training_hours features['recognition_count'] = recognition_received_count features['survey_response_rate'] = survey_responses / survey_opportunities features['engagement_score'] = latest_engagement_survey_score ```
Recommended Reading
- AI Recruitment: How Companies Are Reducing Time-to-Hire 63% While Improving Quality of Hire
- The Complete AI Hiring Bias Audit Checklist for HR Leaders
- AI Performance Management: Moving Beyond Annual Reviews
## Model Selection
Algorithm Comparison
| Algorithm | Pros | Cons | When to Use |
|---|---|---|---|
| Logistic Regression | Interpretable, fast | Linear only | Baseline, explainability critical |
| Random Forest | Handles non-linear, robust | Less interpretable | Good default |
| XGBoost/LightGBM | Best performance typically | Black box | Performance priority |
| Neural Networks | Complex patterns | Data hungry, black box | Very large datasets |
Handling Class Imbalance
Options - SMOTE (Synthetic Minority Over-sampling) - Class weights in training - Threshold adjustment - Ensemble methods designed for imbalance
Recommendation: Start with class weights; add SMOTE if needed.
Model Training Pipeline
```python from sklearn.model_selection import train_test_split, cross_val_score from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import precision_score, recall_score, roc_auc_score import xgboost as xgb
# Split data (careful with time-based leakage) X_train, X_test, y_train, y_test = train_test_split( features, target, test_size=0.2, stratify=target )
# Train model with class weights model = xgb.XGBClassifier( scale_pos_weight=len(y_train[y_train==0]) / len(y_train[y_train==1]), max_depth=6, learning_rate=0.1, n_estimators=100 )
model.fit(X_train, y_train)
# Evaluate y_pred = model.predict(X_test) y_prob = model.predict_proba(X_test)[:, 1]
print(f"Precision: {precision_score(y_test, y_pred)}") print(f"Recall: {recall_score(y_test, y_pred)}") print(f"AUC: {roc_auc_score(y_test, y_prob)}") ```
Evaluation Metrics
Metric Selection
For Attrition Prediction, Focus On:
| Metric | Why It Matters |
|---|---|
| Precision | Avoid false positives (unnecessary interventions) |
| Recall | Don't miss actual departures |
| AUC-ROC | Overall discrimination ability |
Threshold Tuning - Higher threshold = higher precision, lower recall - Lower threshold = higher recall, lower precision - Choose based on intervention cost vs. departure cost
Business Metric: Catch Rate
``` Catch Rate = Employees who left that model flagged / Total departures ```
At what threshold can you catch 50% of departures? 70%? What's the false positive rate?
Deployment Considerations
Ethical Guidelines
Do - Use model to trigger positive outreach (career conversations, retention offers) - Be transparent with managers about data used - Provide paths for employees to discuss career concerns - Regular fairness audits across protected groups
Don't - Use predictions negatively (deny opportunities, training) - Over-rely on predictions without human judgment - Ignore false positive impact on employees - Use invasive surveillance data
Integration Architecture
``` HR Data Sources → Feature Store → ML Model → Risk Scores ↓ Manager Dashboard HR Alerts Retention Campaigns ```
Risk Score Delivery
To HR - Monthly risk reports by department - High-risk employee lists - Aggregated trends
To Managers - Team risk overview (not individual scores initially) - Discussion guides for career conversations - Intervention recommendations
Caution on Individual Scores Showing managers individual scores can backfire: - Self-fulfilling prophecy - Differential treatment - Privacy concerns
Consider showing only aggregate team risk with guidance on universal career conversations.
Monitoring
Model Performance - Monthly AUC tracking - Quarterly recalibration check - Annual retraining
Bias Monitoring - Risk score distribution by demographics - Intervention rate equity - Outcome equity
Common Pitfalls
Pitfall 1: Data Leakage
Including features that reveal the outcome: - ❌ "Submitted resignation" flag - ❌ Exit interview data - ❌ Future termination date
Pitfall 2: Survivorship Bias
Training only on current employees misses those who left early. Include departed employees' historical data.
Pitfall 3: Ignoring Time
Using current state to predict past departures. Always use point-in-time features as of prediction date.
Pitfall 4: Over-Reliance
Model is one input, not the answer. Human judgment, context, and individual conversations remain essential.
## Implementation Realities
No technology transformation is without challenges. Based on our experience, teams should be prepared for:
- Change management resistance — Technology is only half the battle. Getting teams to adopt new workflows requires sustained training and leadership buy-in.
- Data quality issues — AI models are only as good as the data they are trained on. Expect to spend significant time on data cleaning and standardization.
- Integration complexity — Legacy systems rarely have clean APIs. Budget for custom middleware and expect the integration timeline to be longer than estimated.
- Realistic timelines — Meaningful ROI typically takes 6-12 months, not the 90-day miracles some vendors promise.
The organizations that succeed are the ones that approach transformation as a multi-year journey, not a one-time project.
## Success Metrics
Model Metrics - AUC > 0.75 (good); > 0.85 (excellent) - Catch 50%+ of departures in top 20% risk
Business Metrics - Retention rate improvement in flagged population - Retention program ROI - Manager satisfaction with tool
Contact APPIT's HR analytics team to discuss attrition prediction solutions.



