Overview
This post documents an end-to-end supervised machine learning workflow built in Python during my graduate coursework at Boston University. The goal was to build a reproducible pipeline that handles data preprocessing, feature selection, model training, and evaluation — the kind of workflow that translates directly to real-world analytics problems.
Problem & Dataset
The analysis used a structured tabular dataset with a mix of numeric and categorical features. The target variable was a binary classification outcome. The challenge was to:
- Handle missing values and skewed distributions thoughtfully
- Engineer features that improve signal without data leakage
- Select a modeling strategy that generalizes well
Pipeline Design
| |
Cross-Validation & Model Comparison
Models were compared using Stratified K-Fold cross-validation (k=5) to account for class imbalance and reduce variance in performance estimates.
| Model | CV AUC (mean ± std) |
|---|---|
| Logistic Regression | 0.81 ± 0.03 |
| Random Forest | 0.87 ± 0.02 |
| Gradient Boosting | 0.89 ± 0.02 |
| |
Feature Importance
After fitting the final Gradient Boosting model, feature importances were extracted and visualized using matplotlib. The top features aligned well with domain knowledge, providing a sanity check on the model.
Key Takeaways
- Pipeline design matters: keeping preprocessing inside a
Pipelineobject prevents data leakage during cross-validation, a subtle but critical mistake to avoid. - Model selection: Gradient Boosting outperformed simpler models, but the gain wasn’t free — it required careful regularization to avoid overfitting on the training set.
- Feature selection: Using
SelectKBestinside the pipeline reduced noise features and slightly improved generalization.
Tools Used
Full code available in: BU_OMDS_SU25_DX699B_RW