Big Data Analysis

File Based Storage

TDMS

The NI TDMS is an open file format that efficiently stores both the data and metadata within a binary file.   Several interface APIs are provided by NI to read/write to this file format.   Details on the internal structure may be viewed here.  

A Python library named npTDMS permits reading and writing to the TDMS file format.   I was able to perform basic statistics on a TDMS file with a channel of data that contained 10 million values using npTDMS to read the file, and then NumPy to perform the statistical analysis.   It took 2.5 seconds to read the file and then perform the statistical analysis.   I tried a similar structured file with 50E6 (50 million) values, but it failed to allocate sufficient memory to analyze the file.  

HDF5

Hierarchical Data Format 5 (HDF5) stores heterogeneous Big Data that you can then manipulate with using NumPy (Python), and many more languages.   The HDF5 format supports embedding of metadata.   It is an open format that is free to use.   neonscience.org has a very useful description of the file format, including visuals.  

 

Cloud Storage & Analysis

Apache Spark

Apache Spark is a multi-language engine for executing data science and machine learning.   The open source data processing framework uses distributed and parallel computing to process truly Big Data.   The API hides most of complexity of a distributed processing engine behind simple method calls.   Use languages such as Python, R, C#, SQL, Java, Scala, and F# to interface with it.   Machine learning is supported through Spark MLlib.  

In Apache Spark, two types of nodes exist: the master (called SparkContext in PySpark ) which is the main computer of the cluster, and many workers.   The master organizes the work and distributes it among workers, and then it retrieves the results.  

A Python library for Spark is PySpark   |   Tutorial: PySpark and SparkSQL Basics   Tutorial: How to install PySpark locally   Best PySpark Tutorial for Beginners-Learn Spark with Python   Tutorial: Connect PySpark to Google Cloud SQL  

Top 8 Altgernatives to Apache Spark

Hosted Apache Spark

Hosted solutions for Apache Spark include Databricks, Amazon EMR, and several others.  

Databricks

You can get completely free accesss to the Databricks cloud hosted Apache Spark engine via the Databricks Community Edition.   It uses Amazon Web Services (AWS), but no AWS cost is incurred under the Databricks Community Edition.   Databricks 'notebooks' (compatible with iPython) created in the environment are made public to promote community development.   Databricks Connect allows you to connect your favorite IDE on your local PC (e.g. VS Code) to Databricks clusters.   Databricks Connect tutorial   |   This tutorial describes how to configure FREE Databricks 'community edition' account with a driver node (2 cores), 15 GB RAM, but no worker node.   Databricks Community Edition tutorial  

 

Links

A Neanderthal’s Guide to Apache Spark in Python

What is Apache Spark? The big data platform that crushed Hadoop

Steps to Install Apache Spark

Introduction to Spark MLlib

How to Speed Up Your Python Code through PySpark

 

Viualization

Datashader is a open source Python library for big data (millions of values) visualization.   It supports data formats of Pandas or Dask, multidimensional arrays in xarray or CuPy, columnar data in cuDF, and ragged arrays in .   Tutorial: How to Render Huge Datasets in Python through Datashader