A curated list of awesome Python frameworks, libraries, software and resources.
Table of Contents
- Pandas Shortcomings
- Almost Under Every Hood: Apache Arrow
- In-Memory DataFrame Libraries
- Distributed Computing Libraries
- GPU Libraries
- Other Libraries and Ports from R
We all love pandas
and most of us have learnt how to do data manipulation using this amazing library. However useful and great pandas
is, unfortunately it has some well-known shortcomings that developers have been trying to address in the past years. Here are the most common weak points:
- You might read that
pandas
generall requires as much RAM as 5-10 times the dataset you are working with, mostly because many operations generate under the hood a in-memory copy of the data; pandas
eagerly executes code, i.e. executes every statement sequentially. For example, if you read a.csv
file and then filter it on a specific column (say,col2 > 5
), the whole dataset will be first read into memory and then the subset you are interested in will be returned. Of course, you could manually writepandas
command sequentially to improve on efficiency, but one can only do so much. For this reason, many of the pandas alternatives implementlazy
evaluation - i.e. do not execute statements until a.collect()
or.compute()
method is called - and include a query execution engine to optimise the order of the operations (read more here).
This awesome-repo aims to gather the libraries meant to overcome pandas
weaknesses, as well as some resources to learn them. Everyone is encouraged to add new projects, or edit the descriptions to correct potential mistakes.
Most of these libraries leverage Apache Arrow, "a language-independent columnar memory format". In other words, unlike good old .csv
files that are stored by rows, Apache Arrow storage is (also) column-based. This allows partitioning the data in chunks with a lot of clever tricks to enable greater compression (like storing sequences of repeated values) and faster queries (because each chunk also stores metadata like the min or max value).
Arrow offers Python bindings with its Python API, named pyarrow
. This library has modules to read and write data with either Arrow formats (.parquet
most notably, but also .feather
) and other formats like .csv
and .json
, but also data from cloud storage services and in a streaming fashion, which means data is processed in batches and does not need to be read wholly into memory. The module pyarrow.compute
allows to perform basic data manipulation.
On its own, pyarrow
is rarely being used as a standalone library to perform data manipulation: usually more expressive and feature rich modules are built upon Arrow, especially on its fast C or Rust API interface. For this reason, most of the libraries listed here will display a general landing page and links to other languages APIs (mostly, Python and R). To be honest, the R {arrow}
interface has a backend for {dplyr}
(the equivalent of pandas
in R), which makes its use more straightforward. Development for the R {arrow}
package is quite active!
These libraries leverage Apache Arrow memory format to implement a parallelised and lazy execution engine. These are designed to take advantage of all the cores (and threads) of a machine, but are mainly geared towards dealing with in-memory data (i.e., with read_csv()
-like functions to read the data into the memory of your machine, instead of processing it on a cluster node).
polars
: Polars claims to be "a blazingly fast DataFrames library implemented in Rust using Apache Arrow Columnar Format as memory model". It leverages (quasi-)lazy evaluation, uses all cores, has multithreaded operations and its query engine is written in Rust.polars
has an expressive and high-level API that enables complex operations.duckdb
: is another fairly recent DataFrame library. It offers both a SQL interface and a Python API: in other words, it can be used to query.csv
and Arrow's.parquet
files, but also in-memorypandas.DataFrame
s using both SQL and a syntax closer to Python orpandas
. It supports window functions.duckdb
has a very nice blog that explains its optimistations under the hood: for example, here you can find an overview of how Apache's.parquet
format works and the performance tricks used by the query engine to run several orders of magnitude faster thanpandas
.
Compared to the libraries above, the following are meant to orchestrate data manipulation over clusters, i.e. distribute computing across several nodes (multiple machines with multiple processors) via a query execution engine. Since they need to plan execution preemptively, they can be slower than pandas
on a single-core machine.
These are the "first generation" of query planners that - as of now - are not built around columnar storage. Nonetheless, they represent the industry standard for distributed computing, and offer much more than data manipulation: they can even implement machine learning libraries to train models on the cluster. The downside is that moving from pandas
to, say, Apache's spark
is not straightforward, as the API syntax can differ.
dask
is among the first distributed computing DataFrame libraries, alongsidespark
.dask
does not only offer a distributed computing equivalent ofpandas
, but also ofnumpy
andscikit-learn
, so that "you don't have to completely rewrite your code or retrain to scale up", i.e. to run code on distributed clusters. A Dask DataFrame "is a large parallel DataFrame composed of many smaller Pandas DataFrames, split along the index".spark
: the de-facto leader of distributed computing, now it is faring more competition. It offersMLib
, a machine learning library, as well asGraphX
for "graphs and graph-parallel computation". Its Python API ispyspark
.
The "next generation" of distributed query planners, that leverage columnar-based storage.
arrow-datafusion
: this is Arrow's query engine, written in the Rust programming language. Much likeduckdb
,datafusion
offers both a SQL and Python-like API interface. Much likepyspark
,datafusion
"allows you to build a plan through SQL or a DataFrame API against in-memory data, parquet or CSV files, run it in a multi-threaded environment, and obtain the result back in Python".ballista
is the distributed query engine, written in Rust, built on top ofarrow-datafusion
. Here is a comparison betweenballista
andspark
.ballista
will offer a Python client API.
The following libraries leverage Ray as their default distributing engine. Ray is a Python API for building distributed applications (or, in technical jargon, "scaling your code") - i.e., it offers a series of functions to make your code run on multiple computing nodes. Using their own words, Ray is a "cloud-provider independent compute launcher/autoscaler", i.e. can be used seamlessly on cloud platforms such as GoogleCloud, Amazon AWS and the likes. Ray can be used to parallelise code - such as your own scripts - but also for every step of machine learning: hyperparameter tuning and model serving, for instance. The community has built many libraries to build a bridge between Ray and libraries such as XGBoost and scikit-learn, lightGBM, PyTorch Lightning, Ariflow, PyCaret and many more.
Ray itself has a Datasets "loading and pre-processing library" to load and exchange data in Ray libraries and applications. As explained in the blogpost of the official launch, however, "Datasets is not intended as a replacement for generic data processing systems like Spark. Datasets is meant to be the last-mile bridge between ETL pipelines and distributed applications running on Ray". Under the hood, though, Ray Datasets still leverages Apache Arrow to store in-memory tables in the distributed clusters.
modin
attempts to parallelize as much of thepandas
API as is possible. Its developers claim to "have worked through a significant portion of the DataFrame API" such much so thatmodin
"is intended to be used as a drop-in replacement forpandas
, such that even if the API is not yet parallelized, it is still defaulting topandas
". The library is currently under active development.- Under the hood,
modin
calls either todask
,ray
or OmiSciDB to automatically orchestrate the tasks across cores on your machine or cluster. A good overview of what happens is here. - Offers a spreadsheet-like API to modify dataframes and a whole lot of other experimental features, like using SQL queries on DataFrames and even fitting XGBoost models using distributed computing.
- Has excellent documentation, including explanatory articles on the differences between
modin
andpandas
ordask
(also here).
- Under the hood,
mars
is a "unified framework for large-scale data computation", likedask
, but is tensor-based. It offers modules to scale numpy, pandas, scikit-learn and many other libraries.
These are libraries that implement an abstract layer to make pandas
code easier to reuse across distributed frameworks, mainly dask
and spark
. It is to be noted that, on October 2021, pyspark
adopted a pandas
-like API.
fugue
is "a unified interface for distributed computing that lets users execute Python, pandas, and SQL code on Spark and Dask without rewrites". Simply put,fugue
adds an abstract layer that makes code portable between across differing computing frameworks.koalas
is a wrapper aroundspark
that offerspandas
-like APIs. It is no longer developed, sincepyspark
adopted thepandas
API andkoalas
was merged intopyspark
.
Generally libraries work on CPUs and clusters are usually made up of CPUs. Apart from some notable exceptions, such as deep learning libraries like Tensorflow and PyTorch, usually regular libraries do not work on GPUs. This is due to major architectural differences across the two chips.
cuDF
is a GPU dataframe library, which is part of the RapidsAI framework, that enables "end-to-end data science and analytics pipelines entirely on GPUs". There are many other libraries, likecuML
for machine learning,cuspatial
for spatial data manipulations, and more.cuDF
is based on Apache Arrow, because the memory format is compatible with both CPU and GPU architecture.blazingSQL
is "is a GPU accelerated [distributed] SQL engine built on top of the RAPIDS ecosystem" and, as such, leverages Apache Arrow. Think of this as Apachespark
on GPU.
R has an amazing library called {dplyr}
that enables easy data manipulation. {dplyr}
is part of the so-called {tidyverse}
, a unified framework for data manipulation and visualisation.
pyjanitor
was originally conceived as apandas
extension of the well-known{janitor}
R package. The latter was a package to clean strings with ease in a R'data.frame
ortibble
objects, but later incorporated new methods to make it more similar to{dplyr}
. Adding and removing column, for example, is easier with the dedicated methodsdf.remove_column()
anddf.add_column()
, but also renaming column is easier withdf.rename_column()
. This enables to run smoother pipelines that exploit method chaining.pydatatable
is a Python port of the astounding{data.table}
library in R, that achieves impressive results thanks to parallelisation.pandasql
allows to querypandas.DataFrame
s using SQL syntax.