Setting up and connecting Spark

The easiest way to install Spark is via SparklyR. Of course, you can download and install it the classic way as well but I find the R-road very straightforward and easy. Go to a Jupyter R-notebook and install the library


then install Spark using

spark_install(version = "1.6.2")

There is a useful function which then tells you where Spark is located on your system


Start a terminal session or use SSH. If you have a demo version on VMWare or VirtualBox you need something like

ssh -p 2222 dataiku@localhost

Edit the bash config using

sudo nano /.bashrc

(or VI or any other editor) and set

export SPARK_HOME=the-dir-where-spark-was-installed

the directory is the one returned by the spark_install_dir function. You can the execute

source /.bashrc

and check that things are running by means of

bash spark-shell.cmd

or pyspark or any other Spark-like command.

To connect this setup to DSS you need to stop/install/start DSS. Go to the DSS directory (on the CentOS demo machine this is ‘/home/dataiku/dss’ and execute the command as listed in the guide

./bin/dss stop
./bin/dssadmin install-spark-integration
./bin/dss stop

If all went OK you will now see in the DSS config of Spark something like the image below

DSS Spark Config

A side-effect of all this is the fact that you can use Scala and other goodies in your flows. That’s a chapter on its own.

Existing processes can also be changed so they execute in a Spark context rather than a local one; go to the ‘advanced’ settings and simply change the engine.

DSS Engine

Duplicating a project using dsscli

If you are using MacOS and you export a project you will notice that the zip is automatically expanded. To import a project you need a zip however and zipping it manually does not seem to be accepted by DSS. You can import/export things via the CLI utility however. Use your terminal or a SSH session (see above) and go to the DSS directory and use

bash dsscli project-export PROJECT-KEY

where PROJECT-KEY is the key of the project you wish to export. This key can be fetched in various ways. The easiest way is via the ‘changes’ tab in the project’s home (see image).

To import (and thus duplicate) the project use

bash dsscli project-import --project-key NEW-KEY

where NEW-KEY is the new (supposedly non-existing) key for the new project.

Stealing optimal Scikit-learn parameters

DSS hides a lot of magic which happens when you create models. The click-and-go interface allows you to turn on various ML algorithms and then presents you the most appropriate one. If you have used sklearn you know that things are not that easy; you need to convert categorical features, clean things, use grid-search to optimize hyperparms and all that. At the same time, the DSS machinery does not always give you the flexibility you have using plain Python.

So, sometimes I use DSS to do the cleanup via the usual flow and create the most appropriate model but then refine things in a notebook in or outside DSS. The cool thing is that you can fetch the parametrization from DSS and copy/paste it in Python. For example, in the image below you can see a XGBoost algorithm which came out as the best classifier. In the “grid-search optimization” section you can see the parameter combination.

You can copy/paste these parms, replace the “:” with a “=”  and you are done.

import xgboost
xgboost_classifier = xgboost.XGBClassifier(reg_alpha= 0, colsample_bytree= 0.8, colsample_bylevel= 1, 
                                           learning_rate= 0.2, max_delta_step= 0, min_child_weight= 0, 
                                           subsample= 1, reg_lambda= 1, max_depth= 10, gamma= 0)
xgboost_classifier =, train_y)
xgboost_score = cross_val_score(xgboost_classifier, train_X, train_y, n_jobs=-1).mean()
print("{0} -> XGBoost: {1})".format(columns, xgboost_score))

I’m sure this parametrization can be obtained from standard sklearn searching techniques but DSS makes things easier (and more fun too).

Using the public API

The public API allows one to use DSS programmatically; creating or listing projects, uploading data and whatnot. There is a Python wrapper which can be installed

pip install dataiku-api-client

The documentation of the client can be found here.

For example, to add a project and list the existing projects use

import dataikuapi 
from dataikuapi.dssclient import DSSClient as client


api = client(host, apiKey)
p = api.create_project("API","Created via API", "me")
dss_projects = api.list_project_keys()

The API key can be found in the settings of the service and the address of the host is of course something which depends on your setup.


Editing/deleting custom meanings

Meanings or metadata can be added but for some reason there isn’t a direct way to edit then from the same interface once added. I think there is a UX glitch here.

DSS Custom Meaning

There are two ways to edit meanings, one via the Catalog and one via the public API.

The Catalog contains all you assets and the meanings are in a separate tab.

The other way is via the API. The meanings are defined on a global level and note that things are lower-cased even if you defined it differently in the UI:

api = client(host, apiKey)
# can fetch the whole list using : api.list_meanings()
meaning = api.get_meaning("binary")
definition = meaning.get_definition()

Adding a meaning is like so

api.create_meaning("meaning_4", "Test meaning", "PATTERN", pattern="[A-Z]+")

Now, how to delete a custom meaning? Well you can’t. There is a hack via the shell, going to the directory where the meanings are stored and deleting the file but this means that DSS might get in a corrupted state if you have not ensured that the meaning is not used anywhere. So, all in all, there is still a bit of work in this area.