Easy fast .apply for pandas

Author

Kevin Jablonka

apply in pandas is slow. This is the case because it does not take advantage of vectorization. That means, in general, if you have something for which there is a built-in pandas (or numpy) function, you should use that instead of apply, because those functions will be optimized and typically vectorized.

The pandarallel package allows you to parallelize apply on a pandas DataFrame or Series object. It does this by using multiprocessing. However, since it uses multiple processes, it will use more memory than a simple apply.

If your data just barley fits in memory, you should not use pandarallel. However, if it does fit in memory, and you have a lot of cores, then pandarallel can speed up your code significantly with just changing one line of code.

from pandarallel import pandarallel

pandarallel.initialize(progress_bar=True)

# df.apply(func)
df.parallel_apply(func)