Spark has well-known performance benefits over the more classic MapReduce paradigm but it also embraces the diversity one finds on the workflow. I mean, nowadays a typical dev knows a dozen of programming language well and another two dozen language so-so. Languages are one’s colors on one’s painting palette. You mix and match to create the result you aim for.

Spark natively provides support for many popular development languages like Scala, Java, Python and R. For example, compare the following snippets.

Closures in Python with Spark:

lines = sc.textFile(...)
lines.filter(lambda s: "ERROR" in s).count()

Closures in Scala with Spark:

val lines = sc.textFile(...)
lines.filter(s => s.contains("ERROR")).count()

Closures in Java with Spark:

JavaRDD lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>()  {
   Boolean call(String s) {
       return s.contains("error");
    }
  }).count();

Spark natively provides a rich and ever-growing library of operators which simplify the task of a developer, things like mapPartitions, reduce, reduceByKey, sortByKey, subtract,… All of which should be familiar to lambda-minded people, LINQ programmers and alike. It allows to write compact code while allowing multiple coding cultures to understand each other.

An important note here is that while scripting frameworks like Apache Pig provide many high-level operators as well, Spark allows you to access these operators in the context of a full programming language—thus, you can use control statements, functions, and classes as you would in a typical programming environment.

rdd1.map(splitlines).filter("ERROR")
rdd2.map(splitlines).groupBy(key)
rdd2.join(rdd1, key).take(10)

This simple application expresses a complex flow of six stages. But the actual flow is completely hidden from the user — the system automatically determines the correct parallelization across stages and constructs the graph correctly. In contrast, alternate engines would require you to manually construct the entire graph as well as indicate the proper parallelization.

Spark also lets you access your datasets through a simple yet specialized Spark shell for Scala and Python. With the Spark shell, developers and users can get started accessing their data and manipulating datasets without the full effort of writing an end-to-end application. Exploring terabytes of data without compiling a single line of code means you can understand your application flow by literally test-driving your program before you write it up.

sparkperformanceWhile this post has focused on how Spark not only improves performance but also programmability, we should’t ignore one of the best ways to make developers more efficient: performance!

Developers often have to run applications many times over the development cycle, working with subsets of data as well as full data sets to repeatedly follow the develop/test/debug cycle. In a Big Data context, each of these cycles can be very onerous, with each test cycle, for example, being hours long.

While there are various ways systems to alleviate this problem, one of the best is to simply run your program fast. Thanks to the performance benefits of Spark, the development lifecycle can be materially shortened merely due to the fact that the test/debug cycles are much shorter.

Finally, Spark is just far simpler in terms of coding. Just consider the typical word-count example using map-reduce versus the Spark way:

public static class WordCountMapClass extends MapReduceBase
public static class WordCountMapClass extends MapReduceBase
  implements Mapper<LongWritable, Text, Text, IntWritable> {
  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();
  public void map(LongWritable key, Text value,
                  OutputCollector<Text, IntWritable> output,
                  Reporter reporter) throws IOException {
    String line = value.toString();
    StringTokenizer itr = new StringTokenizer(line);
    while (itr.hasMoreTokens()) {
      word.set(itr.nextToken());
      output.collect(word, one);
    }
  }
}
public static class WorkdCountReduce extends MapReduceBase
  implements Reducer<Text, IntWritable, Text, IntWritable> {
  public void reduce(Text key, Iterator values,
                     OutputCollector<Text, IntWritable> output,
                     Reporter reporter) throws IOException {
    int sum = 0;
    while (values.hasNext()) {
      sum += values.next().get();
    }
    output.collect(key, new IntWritable(sum));
  }
}

 

val spark = new SparkContext(master, appName, [sparkHome], [jars])
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
                 .map(word => (word, 1))
                 .reduceByKey(_ + _)
  counts.saveAsTextFile("hdfs://...")