Digital marketing optimization amounts to developing mechanisms which optimize the conversion of people using digital media to purchase products or services. The data coming from web sessions can be augmented with CRM matchings, from geographic positioning (from mobile data, IP address or other), from historical data and whatnot. Every industry and company has its own private stash of data and channels, it’s in general difficult to describe a unique works-for-all approach. There are however some ingredients in the recipe which you likely will find in any solution:
- a combination of multiple data sources; flat file, databases,…
- a large amount of data (the more the merrier) which easily pushes solutions into Hadoop-like platforms
- machine learning (ML) algorithms; typically stochastic decision trees but XGBoost just as well
- the need to predict new customers or incentives which convert them
- the classic factors: churn rate, propensity and other
- the need deliver executive, high-level visualizations for marketers and alike
Technically this means a mixture of big-data techniques (Spark, MLLib, Hadoop…), R and Python coding, ML algorithms, dataviz through HTML/JS…and all of this together in a reusable, repeatable and collaborative workflow.
This sounds like a daunting challenge but the good news is that the Dataiku platform makes it all really easy. Mind, I did not say you don’t need to understand things or don’t need to code or that all is click-and-go. Just that Dataiku as a platform supports the whole process very well, the backend integration and the workflow in particular. So, in what follows I do not give you a magical solution for any marketing optimization problem but I sketch the steps I typically take to tackle such problems. If you need more details, just contact me.
Within Dataiku one can define a workflow which, once defined, represents a repeatable, traceable and collaborative pipeline:
- the flow can be triggered when new data is dumped or on the basis or various other events
- the parts can be developed independently by different people (i.e. collaboratively created)
- the ML part can be deployed as a stand-alone REST service and integrated into other legacy processes
Note that in this flow there is a combination of technologies which all work harmoniously together:
- Spark processing (MLLib, SparkR, PySpark and all that)
- standard Python and R processing; the integration of any (really, ANY) package/module can be plugged here and the connection to the flow context is via simple input/output of so-called managed datasets (in essence CSV files which are project-bound)
- SQL processing inside PostGreSQL/MSSSQL/MySQL/Mongo, you name it
- Scala programming and ML
This should come over as paradise to a any data scientist and it is.
Data aggregation and consolidation
Typically the marketing data consists of CRM and database data. You can integrate in the flow shell scripting which can export CRM data to, say, flat file via the vendor’s CLI. This is what you see in the lower part of the aggregation in the flow. The upper part is the extraction of data from a RDBMS. Typically, the recorded data from website visitors; what do they click on, what do they download etc. The so-called touchpoints and attributes.
The aim is of course to have the ultimate full 360 degree view of what a visitor or existing customer does. The existing customer defines through ML and time series analysis patterns which are then used to recognize almost-concerted visitors (i.e. not yet a real customer).
Do you need Dataiku to do this? Strictly speaking, no of course not. The reason why you want to do this is to have the iteration as automatic as possible in one single flow. If another process needs to do this it means you need to hook up processes or even worse, perform manually some triggering. With the shell, SQL and Python recipes you likely can do 99% of anything out there, including REST calls, CLI and API consumption of any kind.
The join of datasets is based on common keys (primary key, mail address…) and can be done in memory or in a Spark cluster. It means that you can effectively experiment with such flows on your desktop and scale to infinity once you are satisfied.
Data cleaning and transformation is a chapter on its own. Dataiku has a whole lot of time-saving features, see for example Building a Data Pipeline to Clean Dirty Data which highlights well the cool techniques. The built-in techniques do not prevent you to fiddle with R/Python scripts in addition. Everything depends on how much and how deep the cleaning is. Also, some of the cleaning can already happen within the export, the SQL processing or through the CLI.
My experience here is that Dataiku does a really good job. Plenty of time-saving features like
- geo-conversions (to and from geopoints)
- parsing of data types and dates in particular
- splitting/merging columns and all that
I think the time-saving also sits in the fact that you don’t need to look up the latest SparkFrame API, the million of Pandas methods or the confusing R logic. I mean, it’s often a mundane task to parse datetime and whatnot. Not the most exciting part of machine learning and Dataiku helps to keep your job fun and focus on the creative part of data mining.
Modeling customer behavior through Markov chains
The behavior of website visitors is not erratic but it does show a lot of noise. One can look at it like the way tourists travel by car to some destination, say Madrid. The destination is the analog of a conversion (i.e. buying a service or product). If you look at all visitors who end up buying something they almost all behave differently but there are big highways and small roads. If you travel by car from Berlin to Madrid you will take some highways, deviate for lunch, stop for the night. Everybody has however a different rhythm and pace. If you look at millions and millions of people traveling from Berlin to Madrid you do see however patterns appearing. These patterns are best described mathematically as Markov chains. A Markov chain is like a graph with some probabilities on top. For example, after one day of traveling from Berlin you have various probabilities of being around Paris. No certainty but the highest probabilities are around Paris and not around Marseille. Similarly, when an arbitrary visitors visits Amazon for a particular book there are plenty of probabilities from the homepage to the actual book. A Markov chain can capture these probabilities on the basis of existing data. Once you have this model you can foresee how people use your site and which places they should not visit to get converted.
Note that there are other ways to model website visitors. Things like persona segmentation and clickstream analysis among other.
Within Dataiku you can define all of this in R, Python or Scala. In the flow above Spark processed the cleaned data using Malakov, a Scala Markov library. You can experiment with this outside the flow in your preferred environment and then plug the code into the flow. Dataiku also has Jupyter integrated, so you can do the experimenting with notebooks even inside Dataiku. The notebooks are kept as part of the project and it’s another place where collaboration is possible.
Feature engineering means combining, reducing, augmenting features in any way which increase the accuracy of a predictive model (whether classifier or regression). When dealing with marketing data it’s not uncommon to have more than 100K features. This on its own is not a problem but the sparsity of this often is. Typical metrics like the Euclidean metric do not function on sparse data, for example. So, things like noise reduction, PCA and combinations thereof are used. In Dataiku there is a great utility which automatically does PCA (see image) for you, including one-of-K (one hot encoding).
Feature engineering leads to hundreds rather than thousands of features. This is where the expertise of a data scientist shines (or not). Figuring out what works best is a combination of ML expertise, business understanding and patience. Working hard helps.
Machine learning and REST-ifying things
Here again, Dataiku gives you superpowers to quickly figure out which ML algorithm suits best your context. The environments dictates also what will happen. If you run atop Spark you will use MLLib, if you choose Python the sklearn lib will spin and so on. Dataiku does it all. At the end of the experimenting and the crunching you end up with a best-candidate in function of your predefined metric.
Often XGBoost or decision trees will come out when dealing with marketing optimization.
Note that in the flow there is an automatic splitting of the data for testing. Obviously, if you use plain Python code instead of (what is called in Dataiku) visual recipes you can touch the metal here; use grid or stochastic search to optimize hyper-parameters and all that.
When you are satisfied with the model and the predictions you can deploy the model so it can run standalone as a REST service. This means that you can predict how likely a non-customer is likely to convert. This can be augmented with propensity computations and more but the essence remains the same.
The deployment of this REST service is a topic on its own and is more a data engineering subject than a data scientist one. Let’s just say that all of this is covered in Dataiku too.
In the same way that you can have automatic PCA you can output a full-fledged forecasting model with Dataiku. Of course, forecasting (much like any ML) is an art on its own and don’t expect that this academic discipline will become obsolete any time soon (any GARCH and non-linear series?). The great things is that you get a jump start and that time series analysis can be handled just as easily as any other data task.
Data visualization and dashboarding
There are many very good JS libraries out there helping you to create stunning dataviz. My favorites are d3js and KendoUI. Sometimes I also create completely custom visualization on top of SVG or Canvas. In Dataiku you can do anything you like, mix-match JS libraries in whatever way since there is hardly anything pre-built here. Well, you can define charts and export them to the dashboard but the interactivity and cohesion cannot really qualify as modern dashboarding. One thing that does feel great is the ease with which you can consume datasets and Python REST services. This is comparable to defining services in Flask, Bottle or Django. The whole combination feels very powerful.
In the case of marketing optimization I usually come up with a combination of Sankey diagrams, timelines and custom widgets (see the adjacent image). This type of HTML/JS development is not really where Dataiku shines but it does feel great that this aspect is also neatly integrated in the whole project. In the end this is what Dataiku gives you; a grand kitchen and tools to cook great dishes and dinners. It’s not to learn cooking, for fast food or heating up pizza’s, it’s for serious professional cooking.
- Dataiku has an overview of how to enrich weblogs which also explains how to extract colorful geo-representations of where visitors are from
- there is cool way to model marketing optimization in the same way one models stocks and shares through modern portfolio theory, some hints can be found here. One can easily integrate such an approach in the Dataiku flow as well, contact me if you are interested
- you can find here a collection of Dataiku tricks
- there is a dedicated marketing analytics page on the Dataiku site where you can find diverse success stories in this business direction