dirty_cat

Statistical learning on non-curated categorical data

Website
GitHub respository

Introduction

dirty_cat is a machine learning project initiated by Patricio Cerda as part of his PhD thesis.
dirty_cat helps with machine-learning on non-curated categories.
It provides encoders that are robust to morphological variants, such as typos, in the category strings.

For a more detailed description of the problem of encoding dirty categorical data, see the papers Similarity encoding for learning with dirty categorical variables and Encoding high-cardinality string categorical variables.

My contributions to the project

I work on the project since october 2020, as part of my job of software engineering apprentice at Inria.