Cluster Computing with Apache Spark*
Apache Spark is a general engine for working with cluster scale data. This talk will introduce core concepts such as Map/Reduce and Resilient Distributed Data Sets (RDD's), give an overview of the Spark platform, and get into some code.
Spark offers high level operators that make it easy to build cluster scale applications in only a few lines of code. One of the core abstractions in Spark are Resilient Distributed Datasets (RDDs), “a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner”. It has language bindings for Scala, Java, and Python.
Spark started as a research project at the UC Berkeley AMPLab in 2009, it was open sourced in 2010, and became a top level Apache project in 2014. The major Hadoop vendors began offering commercial support soon after.
java, python, Scala, spark, hadoop
This is a new talk. I haven't given it before. I've been working with Spark for the past year. I've spoken at Open Source Bridge the past two years.
Todd Lisonbee is a Technical Lead on the Big Data Analytics team at Intel. He is passionate about agile development, clean code, and automated testing. He presented previously at Open Source Bridge in 2013 and 2014.