Overview
Built a machine learning pipeline to predict whether an Airbnb listing would receive a perfect review score.
Problem
The goal was to separate perfect-rating and non-perfect-rating listings using a mix of structured listing data and natural language text.
Data / Inputs
- Nearly 100,000 training records.
- Numerical, categorical, and text-based listing fields.
- Review-oriented and host-oriented listing metadata.
Approach
- Cleaned and standardized structured features.
- Engineered derived columns for business-relevant signals.
- Applied TF-IDF to text fields and reduced dimensionality where helpful.
- Trained and compared logistic regression, random forest, and XGBoost models.
- Tuned probability thresholds to improve the business usefulness of predictions.
Results
- Reached approximately 0.815 validation ROC-AUC.
- Improved classification performance through feature selection and threshold tuning.
What I Learned
This project sharpened my understanding of how feature engineering, threshold decisions, and text representation choices affect real-world classification outcomes.