dirty_cat
Statistical learning on non-curated categorical data
WebsiteGitHub respository
Introduction
dirty_cat is a machine learning project initiated by Patricio Cerda as part of his PhD thesis.dirty_cat helps with machine-learning on non-curated categories.
It provides encoders that are robust to morphological variants, such as typos, in the category strings.
For a more detailed description of the problem of encoding dirty categorical data, see the papers Similarity encoding for learning with dirty categorical variables and Encoding high-cardinality string categorical variables.