Big Data

In scientific computings, we are used to HDF5 or netCDF. As it is possible to use HDF5 for big data, the insdustries have built many tools for better performance.



Batch is for extremely large datasets and accessing the whole set of data.

  1. Hadoop

Hadoop comes with

  1. HDFS
  2. YARN
  3. MapReduce

Data processing with Hadoop will extensively read and write into the storage devices.


Stream processing is best for calculations on a small patch of data which is streaming through the system.

  1. Storm: extremely low latency as real time
  2. Samza: together with Kafka


  1. Spark
  2. Flink

Back to top

© 2016-2018, Lei Ma | Created with Sphinx and . | On GitHub | Physics Notebook Statistical Mechanics Notebook Neutrino Physics Notes Intelligence | Index | Page Source