Skrub

A library for prepping tables for machine learning

Skrub (formerly dirty_cat) is a library for prepping tabular data for machine learning.

It is closely linked to scikit-learn, since both libraries share common core contributors (such as Vincent Maladiere, Guillaume Lemaître, Julien Jerphanion, and of course, Gaël Varoquaux).

I started working on the library as soon as I joined Inria, in late 2020, and have been contributing ever since.

At its inception, the library had one objective: encoding non-normalized tabular data. For a congress at École Normale Supérieure Paris-Saclay, I created a short video showcasing the use cases and tools provided by the library at the time.

You should check it out if you’re interested to know what the lib does!

Starting in 2023, we have expended its scope significantly, and I’m happy to have contributed to the decisions – both technical and philosophical –, the technology and the promotion at multiple conferences and events!

I'm (quite proudly) top #1 contributor to the code 😊

Although it’s not my full-time job anymore, I’m still contributing on-and-off to the library, most often giving input on directions and technical decisions.

Thanks for reading! Check out the project on GitHub!

References

2023

  1. skrub: prepping tables for machine learning
    J. Stojanovic, L. Boulard, and G. Varoquaux
    In Proceedings of the 15th European Conference on Python in Science, 2023

2022

  1. dirty_cat: a library for machine learning on dirty categorical data
    L. Boulard, G. Varoquaux, and P. Cerda
    In Proceedings of the 14th European Conference on Python in Science, 2022
  2. dirty_cat: a Python package for Machine Learning on Dirty Categorical Data
    L. Boulard, G. Varoquaux, and P. Cerda
    In Proceedings of the 1st Paris-Saclay University Multidisciplinary Junior Congress, 2022