Apache Zeppelin is a notebook-like system for data visualization which connect to Spark via diverse languages like Scala, Python, SQL and more. It allows you, much like Jupyter, to develop a story using markdown, angular/html/js, python scripts, scala…all mixed in one notebook. In addition, the Spark and SQL contexts allow you to pinch big data and do things which outside Zeppelin require SparklyR, pySpark and more hybrid solutions.

If you want to test-drive Zeppelin it only takes a minute to spin up a docker container, for example

docker pull dylanmei/zeppelin
docker run --rm -p 8080:8080 dylanmei/zeppelin

but there are many other containers out there with different pro/con. Once the container is running you can access the Zeppelin notebooks via http://192.168.99.100:8080 or what docker IP is defined as a default on your system:

Zeppelin home.

How to fetch data into the system? Simply use the shell:

%sh
wget  http://archive.ics.uci.edu/ml/machine-learning-databases/ionosphere/ionosphere.data

The dataset is from the UCI data repo and contains wine measurements, see details here. If you want to see the files in the local dir just use the common ls in shell mode:

%sh
ls

Something Jupyter is missing is the bootstrap-like organization Zeppelin has; you can create columns by adjusting the proportions with 12 being a full-width row.

Zeppelin cols

How to get the csv into Spark? Use the sparkContext sc and create a dataframe:

val wine = sc.textFile("file:///usr/zeppelin/wine.data")
val df = wine.toDF()
df.take(10).mkString("\n")

which prints the first ten lines of the dataset. At this point you can use all of Spark and Scala to do your thing. The multi-language (multi-cultural?) aspect of Zeppelin/Spark allows you, however, to use whatever suits you best. So, let’s say you want to do some pre-analysis in SQL. Simply push the dataset as a table like so:

df.registerTempTable("wine")

and now you can do things like

%sql
select * from wine limit 10

You will notice that the output ain’t pretty because we have not separated the csv fields properly. This can be done using a bit of Scala. First define a Wine type:

case class Wine(Alcohol: Float, 
MalicAcid:Float,
Ash:Float,
AlcalinityOfAsh:Float,  
Magnesium:Float,
TotalPhenols:Float,
Flavanoids:Float,
NonflavanoidPhenols:Float,
Proanthocyanins:Float,
ColorIntensity:Float,
Hue:Float,
OD280:Float,
Proline:Float)

The description of the fields is available here. Next, create a strongly typed dataframe

val df = wine.map({ line =>
         line.split(",")
        }).map({s=>  Wine(s(0).toFloat, s(1).toFloat, s(2).toFloat, s(3).toFloat, s(4).toFloat, s(5).toFloat, s(6).toFloat, s(7).toFloat, s(8).toFloat, s(9).toFloat, s(10).toFloat, s(11).toFloat, s(12).toFloat)}).toDF()

and re-export the frame to SQL with

df.registerTempTable("wine")

Now you can fetch some data again and it will be properly presented. If you ask for aggregated data it will automatically be visualized. For example;

select Alcohol, avg(hue) as AverageHue from wine group by Alcohol

will be presented as a pie-chart:

zeppelinpie

Now, what if you want to use custom dataviz? It requires a bit of fiddling with Scala and marshaling data from the Scala context to the Angular context. Let’s take the Hue column of the wine dataset and transfer this to Angular:

z.angularBind("wine", df.select("Hue").collect.map(s=>s(0)))

now you can do the usual thing with Angular and jQuery. Or you can use Kendo UI as well like so:

%angular
<link rel="stylesheet" href="http://kendo.cdn.telerik.com/2016.3.1028/styles/kendo.common-material.min.css" />
<link rel="stylesheet" href="//kendo.cdn.telerik.com/2016.3.1028/styles/kendo.material.min.css" />
<link rel="stylesheet" href="//kendo.cdn.telerik.com/2016.3.1028/styles/kendo.material.mobile.min.css" />

<script src="//kendo.cdn.telerik.com/2016.3.1028/js/jquery.min.js"></script>
<script src="//kendo.cdn.telerik.com/2016.3.1028/js/kendo.all.min.js"></script>

<div id="chart" data-hue={{wine}}></div>
<button onClick="makeChart()" class="btn btn-primary">Go</button>

<script>
function makeChart(){
   // if you want to access the data via the ng scope you can use eg.
   // var scope = angular.element($("#chart")).scope();
    var  hueData = $("#chart").data("hue");
    $("#chart").kendoChart({
                title: {
                    text: "Hue values of the wines"
                },
                legend: {
                    visible: false
                },
                series: [{
                    type: "line",
                    data:hueData,
                    style: "smooth",
                    markers: {
                        visible: false
                    }
                }],
                categoryAxis: {
                    title: {
                        text: "Wine"
                    },
                    majorGridLines: {
                        visible: false
                    },
                    majorTicks: {
                        visible: false
                    }
                },
               
            });
   
} 
</script>

which results in something like

zeppelin kendo wine

Note the following:

  • now you have wine objects in three contexts; sql, scala and angular.
  • the angular binding is effectively dynamic; if you alter the object using Scala it will automatically update in the Angular dataviz.
  • there are various ways to marshal things between the contexts and there are various ways you can use Angular as well
  • things like bootstrap, jQuery and stuff are all by default present in Zeppelin
  • the series shown in the graph could be analyzed using R for instance, there an Zeppelin/R interpreter for this.

So, the whole Zeppelin environment is really a place where multiple technologies live in harmony and where the Spark backend can replace the usual NodeJS (or ASP.Net or Django or whatnot) for BI dashboards or predictive analytics. I would not throw away HortonWorks, Tableau, Dataiku, Shiny and other solutions just yet but Zeppelin is definitely something to keep an eye on.