Reproducible research is the idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them. In the more narrow context of data science consultancy it means you package things in such a way that the customer can re-run and re-trace all the steps you have taken with the same data or re-run things with other (congruent) data sets. This sounds easy but the reality is more complex:
- usually reproducing results means integrating the data analysis in a (production) pipeline
- data science having quite explicitly a statistical nature, it means that re-tracing things is never exactly the same but should be within some margin. The margin and context can be difficult to explain on its own
- including data and recalculating the analysis on this data is usually not possible in realtime (because e.g. handling 250 million rows is not a web-ready thing)
- articulating everything so that it can be consumed by human beings (as analytics reports) or machines (as web services) can be a formidable integration challenge; combining SAP, Sharepoint, R code, Python bits and WebAPI services in one coherent chain can be a serious challenge.
In the past year or so I have spent a considerable amount of time doing time series analysis and digital marketing optimization for one of my favorite customers. The result of all this was a huge amount of R-code and processes which had to be integrated in some way into a larger product. The thing with R (and this is true in fact for pretty much every math software I have used) is that it works greats within its own boundary but fails to deliver when it comes to playing nice with other technologies. It’s not only about marshaling of data types, no, it’s much deeper:
- should it run inline with the other technology, say ASP.Net or in its own process space
- how can you load-balance things or have a grid so that the analytical results are returned in a timely manner
- how to parametrize function calls so they can be cached
and similar operational issues. Various options are around
- use Azure ML which hides the Python or R code and turns things magically into a webservice. The problem with this approach is the rather tedious way that your R-code has to be turned into a workflow diagram and the difficulty to debug it. Also, much like any other visual programming paradigm, it works well for small things but a diagram flow with hundreds of nodes is not convenient.
- use DeployR from Revolution Analytics (part of Microsoft) which offers a serious (enterprise level) amount of tools and stuff to turn any R-code into a service. While absolutely terrific in general, it has the drawback that it does deploy things are packages but rather as R-files. Thus leaving out the unit testing and other other good-have things.
- use OpenCPU based on Apache and a smart guy working overtime. Tightly integrated with standard R-packages and RStudio, it fails when thinking about scalability and support. Let’s call it serious open source with serious issues if you realize there ain’t a company behind it.
- use the Domino Cloud (and alike) which also supports other languages like Julia and IPython. Not really an on-premise solution however and for many customers is out of question to have data somewhere out there.
There is quite some flux in this domain and it will take a few years before the dust settles. Like any other language, R and Python are being embraced these days on many levels: from IDE’s to deep into the backend (SQL Server 2016 integrating R at its core). Ultimately, there will be stable and reliable ways to turn analytical code into web services but right now it’s all very chaotic. In the meantime, I picked out the four above and went through the process of installing, evaluating and fail/success. Adjacent to that I discovered that
- having R unit tests is really a great thing even if sometimes testing statistical stuff on big data can be a pain
- creating R packages is not that difficult and I don’t understand why it’s not being promoted more. Using the devtools::use_data() trick to include a cache of data in a package is one of my favorites. An R-package really is a complete unit of work; it includes the tests, the data, the code, the documentation and more. So, when it comes to deploying things it’s a bit odd to see that vendors do not enforce the creation of packages (can you imagine deploying C# code as code and not as dll?). Seems to me that deploying R-files (instead of R-packages) leads inevitably to a spaghetti of snippets and overlapping functionality. Easier deployments means sometimes less structure and less rigor.
- writing documentation is fun and that the combination of LaTeX and markdown is a terrific way to expose mathematical ideas inside the docs.
- RStudio is a terrible IDE if you are used to WebStorm, Visual Studio and alike.
Before detailing a few things about the deployments I’d like to add my findings related to static linking of R with .Net:
- the R.Net library seems to be cool but really is unstable and lacks maturity. I don’t understand that large companies (Microsoft in particular) have not released yet a solid way to plug R into mainstream development. Considering the fact that R is a rising star in the TIOBE index, it’s odd that it’s not more fully integrated. In any case, yes, you can have an in-process linking with C# but things are painful and far away from enterprise-ready.
- the F# type provider from Blue Mountain Capital is mature but has its own disadvantage. If you wish to convert some simple R expressions (say, SomeData$Genetics$Sequences) you need lengthy statements and data marshaling. I like the intellisense and what type providers do but you can still feel a huge boundary. Also, even if at the end you have wrapped your R stuff into F# there is still quite a gap towards say ASP.Net or NodeJS.
Since there is little info out there about setting up OpenCPU and DeployR, let me highlight the things I discovered along the way. My particular aims were:
- setting up web services which would wrap R-code and make it accessible to arbitrary clients (mobile, web etc.)
- to have a scalable solution both in terms of data size and in terms of clients
- to have, if possible, a continuous integration loop to minimize the transition between package development and public availability
- a security system which to some extend protects the intellectual property contained in the R bits
Let me straight away tell you that there is currently no perfect solution and that, as I said earlier, things are very much in flux.
The steps below focus on deploying an OpenCpu service on Azure but you can equally well deploy things on a non-virtual Ubuntu or use Docker to set thing up in a container. The OpenCPU image can be installed within the Docker client (Kitematic) or via the console, more details can be found here. Of course, the Docker setup will only install the basics, you will need to install the necessary R packages and go through the configuration of it as well.
- Take an Ubuntu 14.04 image on Azure and make sure that you take at least an A2 level Standard since RStudio needs quite some memory. Make sure you add an RDP, HTTP, SSH and FTP endpoint with the defaults. The endpoints allow one to communicate with the server.
After install connect via SSH and do immediately
sudo passwd root
this assigns a password to the root. By default the root is not switched on and after installing things the assigned user in Azure does sometimes not work anymore.
FYI: On Mac Use Terminal > Shell > New Remote Connection to SSH. On Windows you need to install PUTTY.
- Do an update (
apt-getis the package manager on Ubuntu)
12sudo apt-get update
- Install a desktop since RDP does not work without a desktop and there is none installed by default. On version 14.04 the only working one is
xfce4, do not try to install another one.
12sudo apt-get install xfce4
- Install the RDP package
12sudo apt-get install xrdp
If after restarting the server (for whatever reason) the SSH key doesn’t work anymore on your client then do the following
12ssh-keygen -R analyticsservices.cloudapp.net
- To install R, OpenCPU and RServer do the following (check the latest here)
1234567891011sudo apt-get updatesudo apt-get install r-basesudo add-apt-repository -y ppa:opencpu/opencpu-1.5sudo apt-get updatesudo apt-get upgradesudo apt-get install -y opencpusudo apt-get install opencpu-cachesudo apt-get install -y rstudio-server
- To add a user someone to Rstudio you need to add it to Ubuntu
12sudo adduser someone
At this point you should have a fully functional OpenCPU server. RStudio Server can be accessed via the “http://yourname.cloudapp.net/rstudio” address and the ocpu server is on “http://yourname.cloudapp.net/ocpu”. The latter address will disply a test-service interface. The actual packages can be accessed with the prototypical address “http://yourname.cloudapp.net/ocpu/library/packagename/www”.
To capture the Ubuntu image (to use it as a VM template on Azure) have a look at this article
Adding packages to R can be done like so
or via the RStudio Server interface.
When trying to add packages via the RStudio web interface the dir seems not writeable, use the -R option to set things recursively
sudo chmod -R 777 /usr/lib/R/
and this should be done for any other dir which RStudio cannot access.
To download file to the current location via the RStudio shell
curl -O https://cran.rstudio.com/src/contrib/evaluate_0.7.2.tar.gz
Installing sometimes doesn’t work and a lock stays behind, you can remove it with
Rm -rf /usr/lib/R/library/00LOCK-git2r
Regarding DeployR and Azure things are easier. You can use both Linux and Windows machines and the documentation is more explicit. So, in this case I took a Windows 2008 R2SP1 machine and went through the prerequisites. The Java RE seemed like an easy thing but in the end it proved a whole lot of confusion. The documentation states that you should install Java “server” JRE but you should actually install the normal Java JRE. There are three types of Java Oracle downloads; JDK, JRE and Server JRE. Take the standard JRE one.
The tricky bit is to make things available outside the Azure box by configuring the Windows firewall, the Azure VM endpoints and the DeployR internal public address.
The DeployR server is a hosted in Tomcat and communicates over port 7400. In the end you want to tell DeployR what the public address is but until this is functional you need to tell it that the internal Azure IP address has to be used. Since it figures out the internal IP on its own by default you should see the administrative console inside a remote desktop session. Of course, you need to make sure in the VM configuration that the RDP endpoint is enabled for that.
Add two additional endpoints:
- a TCP endpoint with public port 80 going to private port 7400. This makes the default public address go to the Tomcat site.
- a TCP endpoint with public port 7406 and private port 7406. This allows the DeployR events to talk to the DeployR site.
These endpoints will allow to communicate outside Azure to the VM. By default the Windows firewall will block the access however. If you open the the advanced firewall settings you will notice that the DeployR setup has added various entries. The important thing is to enable the private and public access which are disabled by default. You should do the same with the other firewall rules.
Side info: the security settings on Internet Explorer drove me crazy on Win2008 and I installed Chrome in order to have a reasonable browsing experience. IE remains the most annoying bit of software ever.
You need to change the default “change” password for the admin and testuser. Once through all this you can experiment and upload things ad libidum. There are some oddities here and there but overall the whole experience is really pleasant. The documentation around the client development is fairly complete while the documentation around R development is confusing:
- DeployR refers to R packages but really means R snippets or files. R packages are not being deployed, you can only deploy R files.
- you don’t need to do anything special (outside the revoInput/deployrInput and stuff) to turn your R stuff into a service. No need to compile things in R, unlike OpenCPU.
- you should use the Repository Manager to deploy scripts. There is an “R scripts” in the admin console but for some reason it will tell you that there are “no scripts” if you attempt to upload an R file.
The ability to test-run script and see the JSON data, response/request details is really helpful. The Broker clients are also easy to use, so all in all I think Microsoft made a good move by acquiring Revolution Analytics.