Easy fast .apply for pandas

Author

Kevin Jablonka

apply in pandas is slow. This is the case because it does not take advantage of vectorization. That means, in general, if you have something for which there is a built-in pandas (or numpy) function, you should use that instead of apply, because those functions will be optimized and typically vectorized.

The pandarallel package allows you to parallelize apply on a pandas DataFrame or Series object. It does this by using multiprocessing. However, since it uses multiple processes, it will use more memory than a simple apply.

If your data just barley fits in memory, you should not use pandarallel. However, if it does fit in memory, and you have a lot of cores, then pandarallel can speed up your code significantly with just changing one line of code.

from pandarallel import pandarallel

pandarallel.initialize(progress_bar=True)

# df.apply(func)
df.parallel_apply(func)
Guo, Jeff, and Philippe Schwaller. 2024. “It Takes Two to Tango: Directly Optimizing for Constrained Synthesizability in Generative Molecular Design.” arXiv. https://doi.org/10.48550/ARXIV.2410.11527.
Landrum, Gregory A., Jessica Braun, Paul Katzberger, Marc T. Lehner, and Sereina Riniker. 2024. “Lwreg: A Lightweight System for Chemical Registration and Data Storage.” Journal of Chemical Information and Modeling 64 (16): 6247–52. https://doi.org/10.1021/acs.jcim.4c01133.
Orsi, Markus, and Jean-Louis Reymond. 2024. “One Chiral Fingerprint to Find Them All.” Journal of Cheminformatics 16 (1): 53.