Apache Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley’s AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.

Posts

Classic spam classification using Spark MLLib

Using MLLib naive Bayes for spam classification.

Spark GraphFrames basics

GraphFrames on Spark for the clueless.

Diverse Dataiku tricks

Diverse things I collected while developing solutions on top of Dataiku.

Basic dataviz with Apache Zeppelin

About Zeppeling and the fun/great/useful things you can do with it.

The multi-cultural aspect of Spark

Spark speaks multiple languages and it allows developers to use what's most appropriate to the task at hand.