#cs451

Hadoop is great, but has a ton of boilerplate and repetition. We can have better abstraction.

  • Hive
  • Pig

Spark


Technically more efficient and more usable than Hadoop

RDDs (Resilient Distributed Datasets)

  • collections of objects spread across a cluster
  • built through parallel transformations
  • automatically rebuilt on failure