Big Data

Pandas for Big Data

I wondered what could be done with Python and what I consider to be Big Data. So I wrote a Python script where I read 8639828 X and Y values (time series data) into a Pandas DataFrame, and then calculated descriptive statistics on the Y channel. All of that was accomplished in only 1.2 seconds. I performed the test on time series data with 10 million values, and the data was read and statistics performed in less than 2 seconds. At 50 million values, NumPy was unable to allocate enough of something to read and analyze the data. I see a lot of potential.

Dask

Dask gives you the ability to perform pandas, NumPy, and ML operations on large datasets. Learn how Dask works and how to use it.


pip install dask
# Import dask.dataframe
import dask.dataframe as dd
#read large file into variable ddf
ddf = dd.read_csv(r"data.csv")
# See the # of partitions (try to keep them < 100 MB in size)
ddf.npartitions

Dask handles large arrays too.

Visualization

Datashader is a open source Python library for big data (millions of values) visualization. It supports data formats of Pandas or Dask, multidimensional arrays in xarray or CuPy, columnar data in cuDF, and ragged arrays in . Tutorial: How to Render Huge Datasets in Python through Datashader

Conversion

How to filter large (GB) CSV files for import into Pandas

How to use SQL instead of Pandas

Practical SQL for Data Analysis by Haki Benita is loaded with greate examples on how to use SQL to perform many of the tasks you might use Pandas for, and achieving the result using less memory and with better speed.