from pandarallel import pandarallel
=True)
pandarallel.initialize(progress_bar
# df.apply(func)
df.parallel_apply(func)
Easy fast .apply
for pandas
apply
in pandas
is slow. This is the case because it does not take advantage of vectorization. That means, in general, if you have something for which there is a built-in pandas
(or numpy
) function, you should use that instead of apply
, because those functions will be optimized and typically vectorized.
The pandarallel
package allows you to parallelize apply
on a pandas
DataFrame
or Series
object. It does this by using multiprocessing
. However, since it uses multiple processes, it will use more memory than a simple apply
.
If your data just barley fits in memory, you should not use pandarallel
. However, if it does fit in memory, and you have a lot of cores, then pandarallel
can speed up your code significantly with just changing one line of code.