In a related article we have shown how you can do some PdM with R. It’s tedious. In this article we show how a platform (Dataiku) can simplify your life tremendously.

The whole project can be seen online and you can experiment with it as much as you like.

Business context: a car rental company predicting failure and maintenance of parts.

What you need:

  • as much data about the parts as possible
  • knowledge about what failed and when
  • anything you think might have an influence on the predictions (does air humidity have an influence?). This is where domain expertise matters and where as a data scientist you need input.

The process in Dataiku is not different from a manually created model: data upload, data cleaning, learning a model and making predictions. In the picture above you can see the flow and the process steps highlighted.
Dataiku makes things easy in multiple ways:

  • the process flow, once created can be run repeatedly with the same or different data. That is, you have reproducible data science.
  • uploading data or connecting raw data to external source is a breeze
  • the tedious process of cleaning data, transforming and/or partitioning things is super-easy. Lots of helpful tools perform what takes deep API knowledge in R or Python.
  • the result can be immediately turned into a web service

Not always needed but still nice to have: you can use the integrated dashboarding (below) environment (HTML, d3.js and all that).

What is the catch? Well, you need to buy a license but as far as I can see that’s the only setback. Just compare the amount of code you need in R or Python vs. the visual construction in Dataiku. Besides the collaborative features and much more (Spark integration and such).