Dataiku: a great data science platform

dataikuThere are various platforms with great features out there and it would not be fair to consider them all in the same category. Whether one or the other is the right choice for you depends on your preferred technology, your budget, the size of your team and more. Shiny, for example, is really great for dashboarding on top of R but if your codebase is Python it will not be the right match. Hortonworks/Apache Zeppelin is an awful lot of goodies in one box but it will push you through a learning curve and suits bigbigbig data size and teams.

Dataiku is an exceptional mix of things in many ways. Below I summarize the things which I think standing out and which turns it into a great match for a broad range of projects.

 R and Python under one roof

Dataiku integrates Jupyter with R and (duh) Python so you can mix-match notebooks. Even more, you can plug into workflows both languages wherever you like. Taken together with Spark you can go from megabytes to petabytes without leaving the platform and consume Python modules or R packages in any way you like.

 Workflows or code

If you have worked for a while with Jupyter or RStudio you know what a code-maze you can create; snippets and files everywhere. Parts deal with data transcleaning, parts with ML, parts with dataviz. If other have to deal with your code it becomes even more confusing. The Dataiku approach is based on a workflow concept (see image) which allows you to keep an overview of how data is being manipulated. Well, the visual designer is not unique since apps like Orange or Rapidminer also take this approach. What makes Dataiku special is that it integrates custom R/Python code without having to program custom modules and everything is somehow more accessible than other products.

Dataiku analytics flow.

 Web based

Dataiku is pretty much the only product inside the browser. That is, everything is web based. Unlike for example RStudio with a desktop and server product, Dataiku is essentially a web app with all the advantages and disadvantages of HTML.

Dataiku full screen.

 Dashboarding integrated

The dashboarding experience is well thought out in several ways:

  • one can explore datasets by means of auto-charting widgets and transfer these automatically to dashboards
  • things like d3.js, bootstrap and other frameworks are readily available so if you are not happy with the default charting you can extend the dataviz without dealing with yet another JS framework. Even more, there are so many great widgets out there based on d3.js you can plug into Dataiku
  • the datasets you have defined or transformed in you project are immediately available in the dashboard via ajax requests and Python services

Dataiku is in a way a nicely streamlined consolidation of Python, R, JavaScript and other things (e.g. security and collaboration).

 Machine learning the way you like it

Platforms like Rapidminer, RevolutionAnalytics (Microsoft R Server) or H2O have their own ML implementations. Which is sometimes a good things but most of the time it means you have to understand yet again another API and set of parameters. With Dataiku you can use the stuff you know and love: sklearn API, dplyr package, Spark and whatnot. Here again, it feels like good cooking: harmonizing ingredients and flavors into something tasty. Dataiku has not reinvented gastronomy, it has made cooking smooth and professional.

 Collaboration and security integrated

Describing what you are doing is vital if you work with others. Sharing comments, protecting things (e.g. making data read-only), tracing changes, exporting/importing projects, reproducing results…are key if you want do data science in a collaborative, reproducible and productive fashion. Dataiku does here as well a very good job to support all of this.

Available (almost) everywhere

You can use (or try out) Dataiku on Azure, locally on Mac/Windows/Linux, on AWS, on VirtualBox and VMWare. The setup is straightforward. In comparison, setup an RServer or a Shiny server is a burden. Deploying Rapidminer on Azure is near impossible.

 ML automation

dataikuautoanalysis dataikudecisiontree dataikujupyterconversion

One can launch a series of analyzes which are automatically parametrized and the output is very nicely (interactively) detailed. Very valuable on its own but Dataiku goes a step further by allowing you to export the model as a Jupyter notebook. So, you can use it as-is or fine tune it with Python further more. That’s a great jump-start and a powerful feature.

Having done all sorts of data science research on various platforms and various technologies (including custom cloud implementations) I have seen good and bad everywhere. Dataiku is not perfect and is for sure only the beginning of a bigger story, but as it stands now it’s a great consolidation of all one needs to do collaborative data science. It’s solid and well thought out in various areas. Worth a try if you haven’t done so.