Markov Chain

The past year was filled with hefty research in the domain of digital marketing optimization and I never thought that marketing would bring me knee-deep into all sorts of intriguing mathematics (the polytomous Rasch model) and software domains (say, using Eclipse to develop R++ packages) but it was absolutely fascinating. It’s clear that 2015 was a glorious year for machine learning (ML), artificial intelligence and (big) data science. The ascension (read: exponential growth) of R and Python as de-facto languages for statistics and ML is clear from the Stackoverflow questions and the many solutions and companies offering solutions on Spark, Hadoop et al. is explosive. My position in this market is (well, no surprise) a bit singular in the sense that few people/companies really start from a blank canvas or try to come up with innovative solutions by looking at the customer’s case in the first place instead of trying to mold the customer into an existing framework/solution:

  • a large portion of the ML market is busy with integrating the classic ML algorithms and using brute force (i.e. large Hadoop clusters and alike) to extract gold from lead
  • there is an serious portion of the industry busy with the infrastructural aspects
  • startups and cloud provider deliver ML as SaaS thus forcing customer to re-cast their needs into digestible ML format
  • academia research purely theoretical frameworks which are sometimes years away from applications, resulting in algorithms which are often hard to convert into concrete code
  • small teams or individuals like me create intellectual property in function of concrete business needs. From visiting meetups and smalltalk I reckon that this is a small fraction of the industry.

So, what has been the aim all year long? Trying to see what people do on the way to buying something and figuring out a way to help them do this better, faster and more efficiently. For the clients this means a better buying experience, customized guidance and information, less hassle. For the vendor it means a higher conversion-rate and a lean, growing business. The typical win-win.

Customer Journey

How does it happen? In short, as a vendor you collect petabytes of data and try to see patterns which identify behaviors and roads towards the conversion. Conversion being something like buying a product, but more generically it means reaching a predefined point. The analogy here is the idea that heaps of people travel each year to their holiday destination and you try to see where they go and how they get there without knowing in advance what they will do or how they will do it. In this sense, digital optimization is about figuring out the highways out of the infinite many roads people take on their way to some destination.

How does one do it? There are many aspects to the problem at hand:

  • the need to efficiently deal with large amounts of data. This is where many people act; setting up clusters, data-crunching pipelines and using exotic open-source solutions. Cloud infrastructure is key here
  • transforming and merging data into amenable formats, reducing noise and data cleansing in general. The ML solution you have chosen plays a role here in the sense that the language determines to some extend how you deal with the data here.
  • doing some actual analysis on the data
  • interpreting the results and transforming them into a format which lends itself to presentation. While this seems an obvious thing in practice it can be hard to interprete some ML outputs. For instance, clustering high-dimensional data is a common thing to do but people have only one shoe-size, not a vector of shoe-sizes
  • finding ways to visualize results either for the vendor or the end-user or both
  • conveying research results into human understandable ways so that the intellectual property can be considered as acquired, maintainable and future-proof.

I suppose that the combination of very different disciplines (data visualization in Photoshop vs. reading math research papers) is what makes the job interesting for me and what makes it difficult to find the right data-scientist for some companies.

Markov Chain

Like any other trend in the IT industry, ready-made solutions are becoming available and boxed algorithms are wrapped into consultancy and business deals. Indeed like any other software subject, if you need an out-of-band algorithm you need someone who articulates the domain expertise, the business context and the pretty much the whole list aforementioned. I guess, that’s precisely where my role is.

On the level of storage and infrastructure I learned a tremendous amount about Azure, Linux, Cassandra, Spark, Hadoop and all that. Not typically my cup of tea but happy to understand the intricacies of those worlds. The past year has made me in many ways culture-agnostic; every technology and hyped product has good and bad, I have become less and less a believer in a particular tool, language, technology. I enjoy MongoDB or TensorFlow for what is offers, there is good in Redis, Hive or Oracle and there is equally good and bad when thinking of Github, Java or Clojure. I work happily nowadays on Windows, Mac and Ubuntu without noticing a personal threshold. I switch from R to JavaScript without preference, rather focusing on the result and the easiest way to get things done. Out of the lot, I found the following pieces to be of particular interest

  • the acquisition of Revolution Analytics by Microsoft has seen a tighter integration of R into various products. AzureML integrates R and Python. SQL Server 2016 integrates the R language thus creating a bridge to pretty much everything and anything ML. Ready-made VM’s on Azure allow you to get up and running with scalable R solutions (DeployR). It’s very likely that R will be a first-rank language in the next version of Visual Studio.
  • openCPU is an open source solution which treats R-packages as Tomcat servlets. It allows you to create websites and webservices from within R-Studio
  • using Azure workers as a commodity to scale .Net solutions . The ease with which one can use Azure as plug-and-play infrastructure together with the tight Visual Studio integration of it all. While many solutions are based on Spark/Apache these days, the Microsoft approach means you have on all levels a seamless integration of dev-tools, infrastructure, software and billing.

With respect to pure research I cannot help to smile at various gems I enjoyed in 2015:

  • large portion of my research-time was spent without doubt in applying (extended) Markov chains and turning them in all sorts of ways back and forth between marketing concepts and pure probability theory
  • I went deep into time-series analysis and find the subject absolutely fascinating. Worked things out in R, in Python, in C# and even in JavaScript (NodeJS)
  • applying Markovitz portfolio theory: seeing analogies between stock portfolios and digital marketing is this typical trans-domain linking exercise that I totally love
  • the unbelievable flexibility of various domains; applying survival analysis to customer conversion, using non-negative matrix factorization to segment a customer audience, applying longitudinal analysis, clickstream analysis and all that to time-like data is at times revealing and at times totally disappointing. Research has its highs and lows for sure
  • delving into research articles from pyschology: Rasch models, Thurstonian models

Finally, on the algorithmic level I cannot count the time I spent in the past year with R and the thousands of stats packages out there on CRAN. Much like JavaScript the R-language has its fair share of quirks and idiosyncrasies and well…you get used to it. Is it made for what it’s used today? Probably not, but that can be said of any scripting languages really. The main reason as far as I am concerned to use R is the unmitigated wealth of packages on CRAN embracing both settled subject and razor-sharp statistical and ML research. There is literally everything. Is R a good programming language? Nah. Horrible syntax. Horrible documentation. Dealing with large bodies of code requires discipline. Refactoring and unit-testing is like driving an oldtimer; fun for a while but you miss the speed and gears. On the upside, I found Rcpp (C++ for R) to be a solution in many ways:

  • it binds the wealth of CRAN packages to the boundless world of C++ (including BOOST and whatnot)
  • OMG, it blazingly fast. It’s not unusual to see speedup factors of 10, 20, 100.
  • it allows you to use OOP and all the things you miss sometimes in R
  • it potentially allows you to get away from R-Studio even though I think R-Studio is really fine as an IDE for coding and writing reports (R markdown and knitr really deliver)

Yes, C++ is hot again from my angle. Nowadays I would not be surprise to see myself develop pure C. Times change.