My first submission
- No text cleaning (but filtering nltk stopwords for Catalan)
- Vectorization: played with bag-of-words & tf-idf
- Model: logistic regression & linear SVM classifier
- Score on public testing dataset:
Kaggle competitions vs Business
- The rest of the time went by attempting to get to
- Many practices were attempted, sometimes running into serious overfitting
- Feature engineering
- Model fine-tuning
Feature Engineering
- Stopword removal
- Lemmatization
- Accent removal
- Spelling corrections
- Different vectorization methods
- By unigrams and bi-grams
- fastText word embeddings in Catalan, dim. 300
Model selection
- Wide range of scikit-learn models (spoiler: no model overtook
LinearSVC
)
- Comprehensive grid-searches for optimal parameters
- Neural networks on top of fastText embeddings
Chronology
- Day 1: Competition starts, first submission scores
- ~Day 3: Other competitors start making submissions
- Day 4: Toni Lozano challenges the leading position
- Day 5: Toni takes leadership
- Day 6: I overtake with the first submission over
- Day 7: Toni overtakes with his final submission
Result
- The winner was decided on a difference of score points
- Both scores went down on the private validation dataset due to overfitting
Differences with doing ML in Production
- It rarely pays to overoptimize the model. Prefer slightly less performant models that can be explained.
- Performance and dependencies will be an issue.
- Keep tracking code, data & models:
- Introduce dataset versioning & feature store Data Catalogue
- Track & store models Model Registry
- Introduce good practices for software development
- Stay away from notebooks ML Engineering
- Test your code profusely
- Define validation metrics Continuous Monitoring
Thanks!
A few links:
Show challenge overview
Datathon: https://www.kaggle.com/competitions/archivalDatathon/
Profiling: file:///home/ber2/archivalDatathon/training_profile.html
Exploration: look for the inbalance in the classes
The point of versioning data and models is reproducibility
Do not pay attention to good engineering practices: testing is superseded by validation, code duplication is faster than solving python import paths
Winning public score: 0.93262
Winning private score: 0.93111
4/7 people went above 0.90
2/7 people went above 0.93
Feature engineering typically has an impact of one order of magnitude higher