Data Science DevOps and Docker

Data scientists sometimes have to (help) “productionize” their work, i.e. integrate data analysis, dashboard, and predictive modeling into a larger process or software pipeline. For example, imagine a system that (1) monitors for a data change, (2) triggers data analysis process whenever a change happens, and (3) takes the output of the analysis to show a webpage and/or store output parameters in a database for other systems to use.

Data scientists typically work in part (2), prototyping bunch of R or python codes. But when it’s time to build and deploy the system, integrating such data science codes is not trivial. A big challenge is that Data scientists' work environment (e.g. Macbook laptop with R and many, many, many R packages) is typically very different from a “deployment” environment (e.g. linux box in AWS EC2 or corp VMs). Installing R and bunch of dependency R libraries on the machine is frowned upon by ops and software engineers, since it’s usually a painful, fragile process.

In an ideal world, R / python codes data scientist developed on their laptop would “just work” when dropped on the deployment server(s). Too good to be true?? Well, that ideal world is here already thanks to the fantastic technology called “Docker”. Using Docker, data science analysis and prototype could become super close to something that could be deployed very fast and efficiently. Just like devops helped developers productionize and operationalize their work better. We can even call it “data science devops”. vertical_large

Essentially, the first step to achieve data science devops consist of two practices:

Make the R codes into a command line script that could be executed via Rscript, preferably with advanced option parsers like R argparse. This has the added benefit of forcing reproducibility. Also data scientists are forced to think more in terms of API way and “do one thing well” (UNIX philosophy) mentality that lead to cleaner code structure.
Dockerize the R application. Start with, e.g. rocker/verse and add/modify Dockerfile as needed.

With the above two practices, on any machine with Docker installed, the app could be "deployed" like:

$ docker pull your-org/my-r-app

and could be run like:

$ docker run your-org/my-r-app ARG1, ARG2, ...

The beauty is that, it will run in ANY environment where Docker is installed: your Mac or Windows laptop, EC2 linux host, Your corp VM, and so on.

Is it actually easy? NO. You actually need to spend good ~100 hours or so to be at home writing your own Dockerfile with confidence. Is learning how to dockerize R app helpful? YES, very much so. Once you make the habit of developing your R analysis pipeline in a Dockerized, reproducible setting, your codes will be cleaner, more reproducible, and super easy to deploy. Your dev / ops coworkers will thank you. You and your team’s productivity will improve (YMMV).

So, dear fellow data scientists --- install Docker and start learning how to use it; Welcome to data science devops.