R on Kubernetes - serverless Shiny, R APIs and scheduled scripts

Why run R on Kubernetes?

Kubernetes is a free and open-source utility to run jobs within a computer cluster. It abstracts away the servers the jobs are running on so you need only worry about the code to run. It has features such as scheduling, auto-scaling, and auto-healing to replace nodes if they breakdown.

If you only need to run R on a single machine, then its probably a bit OTT to use Kubernetes, but if you are starting to work with multiple Docker containers and/or VMs it gets more and more attractive to have a way to easily orchestrate them.

Kubernetes works via Docker containers, so if you are already familiar with using Docker for abstracting away code environments, it should be a short step up to abstracting away the computers those Docker containers run upon.

R is a good mix with Docker since it provides a way to have stable production environments in a landscape of shifting dependencies given R’s fast changing open source foundation. There is a growing ecosystem to help R users with Docker including the Rocker Project, containerit and harbor, and Rhys Jackson has also crafted a Kubernetes client for R in googleKubernetesR.

If your R scripts are deployed in Docker containers, you can use them in Kubernetes. This includes applications such as R powered APIs using OpenCPU or plumber, Shiny apps, batch R jobs that can scale horizontally over many CPUs, or scheduled analysis.

What you gain is reliable, flexible, production ready R applications, that will scale, run on many cloud providers including Google, and once set up easy to deploy - in most cases pushing to GitHub can be the trigger to serve your new code.

Alternatives

This article deploys to Google Kubernetes Engine (GKE), but there are many other methods to cover some of Kubernetes functionality, that you may prefer.

  • Kubernetes itself can run on all the cloud providers aside Google Cloud Platform, as well as your own servers.
  • Google App Engine Flexible is a more managed platform that takes care of some details, but you lose control of others. I wrote about this previously for running R APIs and a demo is in this GitHub repo.
  • Docker Compose is an equivalent service that only runs on one host, Docker Swarm runs on multiple hosts and most comparible to Kubernetes.
  • Azure Container Service is a more managed Kubernetes platform than GKE

Regardless which one you use, the principles are similar in that they all build on Docker containers, and one of the advantages of Kubernetes is you can move to other providers a lot more easily, since Docker/Kubernetes is fast becoming the standard for cloud developments.

Setup background

The below is a setup I have used for the above applications, and may be able to help if you are looking for similar functionality.

I ran into several gotchas whilst developing it, so hopefully it will help you avoid some of those. Most vexing for me was finding a way to serve different R scripts on one kubernetes cluster on different URL paths (e.g. /r-script1, /shinyapp2, /r-api-3, etc) - many thanks to Rhys’ help with that. The answer was that (at the time of writing) the default Google Kubernetes Engine doesn’t support rewriting URLs, so instead its better to install another application on your cluster to take care of the above URL serving (termed ingress fanout), namely using nginx ingress instead of the default GKE ingress.

This article then demonstrates:

  1. Creating a Kubernetes cluster for your R applications
  2. Installing Helm (and Tiller) to aid with a nginx ingress installation
  3. Deploying R containers, with examples for Shiny, an R API and an R scheduled script
  4. Serving up those R containers on nice URLs

References

These were used to help develop the below:

Part 1 - Creating an R cluster

Follow the setup steps to authenticate with gcloud and kuberctl then create your cluster via:

gcloud container clusters create r-cluster --num-nodes=3

…or if using googleKubernetesR :

library(googleKubernetesR)

## create a 3 node cluster called r-cluster with defaults
createCluster(projectId = gcp_get_project(), zone = "europe-west3-b")

You’ll need to wait a bit as it provisions all the VMs.

Most of the below will use the terminal/shell for working rather than R, but in the future a lot of this may be possible via googleKubernetesR within an R session.

Set up your shell to get the credentials for Kuberctl:

gcloud container clusters get-credentials r-cluster

And we are all set to start setting up this cluster for R jobs!

Here is a screenshot from the web UI of what it should look like:

Part 2 - Install the nginx ingress controller

The first recommended setup is to enable an ingress controller.

Image from nginx

This is a pod that directs internet traffic to the right R container we set up later.

We install the ingress controller via Helm and its counterpart Tiller on the cluster.

Helm is a package manager for Kubernetes.

You can skip installing Helm/Tiller/ingress controller if you don’t need to have R apps on a URL, for instance if its just scheduled cron jobs you want on your cluster, but it is recommended if you want to serve up APIs and Shiny.

The below is mostly copied straight from Bitnami’s article.

Install Helm

The below assumes RBAC rules are enabled, which is true from Kubernetes version 1.8 onwards.

curl -o get_helm.sh https://raw.githubusercontent.com/kubernetes/helm/master/scripts/get
chmod +x get_helm.sh
./get_helm.sh

Install Tiller on cluster

Step 1: Create the Tiller service account

Create a tiller-serviceaccount.yaml file using kubectl:

kubectl create serviceaccount tiller --namespace kube-system

Step 2: Bind the Tiller service account to the cluster-admin role

Create a tiller-clusterrolebinding.yaml file with the following contents:

      kind: ClusterRoleBinding
      apiVersion: rbac.authorization.k8s.io/v1beta1
      metadata:
        name: tiller-clusterrolebinding
      subjects:
      - kind: ServiceAccount
        name: tiller
        namespace: kube-system
      roleRef:
        kind: ClusterRole
        name: cluster-admin
        apiGroup: ""

Deploy the ClusterRoleBinding:

kubectl create -f tiller-clusterrolebinding.yaml

Step 3: Update the existing Tiller deployment

Update the existing tiller-deploy deployment with the Service Account you created earlier:

helm init --service-account tiller --upgrade

Wait a few seconds for the Tiller server to be redeployed.

Step 4: Test the new Helm

All being well, you should be able to execute this command without errors:

helm ls

And that’s it! You have configured Helm in your Kubernetes cluster.

Step 5: Deploy ngnix Ingress Controller

Now you can deploy the nginx ingress controller using Helm to take care of details:

helm install --name nginx-ingress stable/nginx-ingress --set rbac.create=true

You should see your IP when you issue:

kubectl get service nginx-ingress-controller
#> NAME                       CLUSTER-IP    EXTERNAL-IP   PORT(S)                      AGE
#> nginx-ingress-controller   10.7.253.89   <pending>     80:32713/TCP,443:31061/TCP   23s

Part 3 - Deploying R Workloads!

Now the cluster is all setup for you to deploy your R containers. You can install other containers in any other language too, such as normal HTML webapps or utilties.

We deploy them using the command kubectl run, and go through some examples below.

A: Shiny apps

Below we deploy two Shiny apps, in this case the googleAuthR demo app that is configured to run on port 3838, in a /shiny/ folder, and a wordcloud Shiny app by Flavio Barros that has had its Dockerfile configured to run on port 80 in the root folder, to demonstrate how to handle different Docker deployments.

kubectl run shiny1 --image gcr.io/gcer-public/shiny-googleauthrdemo:latest --port 3838
kubectl run shiny2 --image=flaviobarros/shiny-wordcloud --port=80

Expose containers as node ports

At the moment we are only selecting which services the cluster will run - we won’t be able to see them yet until we expose the apps and configure the ingress rules

These will be pointed to via the nginx ingress controller.

We first open up their ports as nodeports, so the ingress can see them:

kubectl expose deployment shiny1 --target-port=3838  --type=NodePort
kubectl expose deployment shiny2 --target-port=80  --type=NodePort

Create Ingress with ngninx

This lets you be able to call the different services from one ip address.

Save this as r-ingress-nginx.yaml

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: r-ingress-nginx
  annotations:
    kubernetes.io/ingress.class: nginx
    nginx.ingress.kubernetes.io/rewrite-target: / 
    nginx.ingress.kubernetes.io/ssl-redirect: "false"
spec:
  rules:
  - http:
      paths:
      - path: /wordcloud/
      # app deployed to /wordcloud/
        backend:
          serviceName: shiny2
          servicePort: 80
      - path: /gar/
      # app deployed to /gar/shiny/
        backend:
          serviceName: shiny1
          servicePort: 3838

Deploy via

kubectl apply -f r-ingress-nginx.yaml

Verify and keep refreshing until it has an ip address assigned in the ADDRESS field (5mins)

kubectl get ingress r-ingress-nginx

See your deployed containers

The Nginx ingress controller takes care of any URL path not covered in the r-ingress-nginx ingress. By default this should give a default backend - 404 message.

  • /gar/shiny/ is served by the googleAuthR demo app. /gar/ covers the standard Shiny welcome screen, and /shiny/ is the folder that is configured in the Dockerfile
  • /wordcloud/ holds the Wordcloud Shiny app. In this case, the Shiny app is configured to appear in the root directory of the /wordcloud/ URL.

Here are some screenshots:

googleAuthR app

Wordcloud app

Default webpage if not defined in ingress

Note it matters where you put the app in your Dockerfile. The nginx.ingress.kubernetes.io/rewrite-target: / line in the ingress lets you specify a URL that is relative to your specified URLs.

If you want to have your app served on the absolute URL specified in the Dockerfile, see below for a no-redirect version.

No redirects

Without redirects work too, but the Shiny app needs to have the correct URL path configured in its Docker that matches the URLs you are specifying in the Ingress e.g. for our examples as the googleAuthR demo is in path /shiny/ and the Wordcloud demo in the root /, the Ingress rules need to match those. Beware clashes where two Shiny apps have configured to the same URL path, e.g. root /

Save this as r-ingress-nginx-noredirect.yaml

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: r-ingress-nginx
  annotations:
    kubernetes.io/ingress.class: nginx
    #nginx.ingress.kubernetes.io/rewrite-target: / # not used in this example
    nginx.ingress.kubernetes.io/ssl-redirect: "false"
spec:
  rules:
  - http:
      paths:
      - path: /
        backend:
          serviceName: shiny2
          servicePort: 80
      - path: /shiny/
        backend:
          serviceName: shiny1
          servicePort: 3838

Deploy via

kubectl apply -f r-ingress-nginx-noredirect.yaml

Again, the Nginx ingress controller takes care of any URL path not covered in the r-ingress-nginx ingress.

However now:

  • /shiny/ is served by the googleAuthR demo app.
  • / holds the Wordcloud Shiny app.

B: R APIs - OpenCPU

Kubernetes is well suited to serving R APIs as they can auto-scale to demand, covering peaks and troughs in demand.

A demo API is shown below, using openCPU for predicting website clicks, but the same principle applies if using plumber or another framework by swapping out the Dockerfile to serve the API on a port.

This assumes you have configured the ingress controller above

kubectl run opencpu-demo --image gcr.io/gcer-public/opencpu-demo:latest --port 8004

Expose the port as before:

kubectl expose deployment opencpu-demo --target-port=8004  --type=NodePort

Now we do another Ingress deployment for the R-API, in addition to the Shiny apps above.

Save the below as r-ingress-opencpu.yaml

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: r-ingress-nginx-ocpu
  annotations:
    kubernetes.io/ingress.class: nginx
    nginx.ingress.kubernetes.io/ssl-redirect: "false"
spec:
  rules:
  - http:
      paths:
      - path: /ocpu/
        backend:
          serviceName: opencpu-demo
          servicePort: 8004

…and deploy via:

kubectl apply -f r-ingress-opencpu.yaml

Your ip address will be as shown as your ADDRESS when you issue:

kubectl get ingress r-ingress-nginx-ocpu
#> NAME                   HOSTS     ADDRESS         PORTS     AGE
#> r-ingress-nginx-ocpu   *         35.205.249.34   80        10m

Note no rewriting this time as OpenCPU takes care of itself on /ocpu/ folder.

Use your OpenCPU installation

Assuming your IP is 1.2.3.4; we can see the OpenCPU is installed via 1.2.3.4/ocpu/info and the test page at 1.2.3.4/ocpu/test/

For our example, the custom package uploaded is available at 1.2.3.4/ocpu/library/predictClickOpenCPU

We use the API for this package by creating a POST hit to API endpoint

1.2.3.4/ocpu/library/predictClickOpenCPU/R/predictMarkov/json

…with the parameter pageview_names=["/example/96","/example/213","/example/107"]

Constructing the above in the terminal with curl, we test the API is functioning:

curl --data 'pageview_names=["/example/96","/example/213","/example/107"]' http://35.233.42.6/ocpu/library/predictClickOpenCPU/R/predictMarkov/json
#> {
#  "page": ["/example/251"],
#  "probability": [0.5657]
#> }

Web API endpoints can also be used as webhooks, for instance to react and do something when something is pushed to GitHub for continuous integration purposes.

C: Scheduled R - Tasks via cronJobs

Kuberenetes has a special option for deploying scheduled tasks, called cronJobs.

When configuring your Dockerfile, you can choose so it runs entirely self-contained, perhaps calling an API and writing the result to a database. For each new job you can make a new Dockerfile and build them on something like build triggers.

In that case the Dockerfile should run with all configuration files included in the Docker build. One advantage of running on GKE is that the job can be authenticated with the same project via googleAuthR::gar_gce_auth() so will have access to Google Cloud Storage, BigQuery etc.

An alternative is to configure the Dockerfile to rely on a config file, and have that config file mounted as a volume in the pod - that isn’t covered here yet.

Creating an executable R Dockerfile

This is an example Dockerfile that installs dependencies and R packages, loads a local configuration file then runs a custom function when it is called upon by the cronJob:

FROM rocker/verse
MAINTAINER Mark Edmondson (r@sunholo.com)

# install R package dependencies
RUN apt-get update && apt-get install -y \
    gcc gsl-bin libblas-dev \
    ## etc....
    ## clean up
    && apt-get clean \ 
    && rm -rf /var/lib/apt/lists/ \ 
    && rm -rf /tmp/downloaded_packages/ /tmp/*.rds

# any necessary environment arguments
ENV SECRET_ARG hellomum

# Install packages from CRAN and GitHub
RUN install2.r --error \ 
    -r 'http://cran.rstudio.com' \
    googleAuthR googleAnalyticsR bigQueryR \
    && Rscript -e "devtools::install_github('MarkEdmondson1234/youtubeAnalyticsR')" \
    && Rscript -e "devtools::install_github('yourprivate/customPackage')" \
    ## clean up
    && rm -rf /tmp/downloaded_packages/ /tmp/*.rds

# my_config.yaml is in same folder as Dockerfile
COPY my_config.yaml my_config.yaml

## this runs when this Docker container is called
CMD ["Rscript", "-e", "customPackage::do_something('my_config.yaml')"]

Running cronJobs

For cronJobs, specify the schedule argument, and we change the restart policy to only do so on failure.

kubectl run scheduled-r --schedule="23 4 * * *" --restart=OnFailure --image=gcr.io/xxxxxx/private_docker_image:latest

That is getting a bit complicated for the terminal , so I prefer to deploy via the file method:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: scheduledr
spec:
  schedule: "23 4 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: my_image
            image: gcr.io/xxxxxx/private_docker_image:latest
          restartPolicy: OnFailure

Save the file as say cron-r.yaml and deploy via:

kubectl create -f ./cron-r.yaml

You should then see the job has been created via

kubectl get cronjob scheduledr

And in the Web UI under Workloads

Part 4 - Going further

This article turned into a lot more words than I intended, so I’ll stop now, but see below on some more things I could have written about:

Updating your cluster

Its a bit of configuration to get there, but once deployed since the services are running off Docker, and those in turn are built using build triggers that are from GitHub pushes, it means that changes to scripts can all be controlled via GitHub pushes and branches, which you are probably using anyway for version control.

When you change a configuration file and want to deploy, use kubectl apply instead of kubectl create to change the settings.

Its best practice to specify a dev and production branch to your GitHub Docker images, so you can safely test containers before you break your production.

Secrets and https / SSL configuration

Kubernetes supports more robust methods to guard your secrets that including them in the Dockerfiles - see here for proper treatment.

Related to that is serving over https. In this case I found Bitnami the easiest method to follow, using Helm again to configure TLS with LetsEncrypt and Kube-lego.

Auto scaling

Launching the cluster with these flags enables auto-scaling

gcloud container clusters create r-cluster --num-nodes=0 --enable-autoscaling --min-nodes=0 --max-nodes=3

On Google Kubernetes Engine this scales down to 0, which should be cheaper than running a few CPUs running 247, and you can configure when new nodes are launched such as when it hits 80% CPU.

Parallel tasks

For big parallel processing tasks, Kubernetes is an option that can launch as many CPUs you need.

However, you will need to configure your workload so that each service works on a different part of the data, so some form of central ledger is needed that keeps track of which rows on the database or files have been treated, and to remove them from the input files once completed.

A strategy to handle this is to use message brokers that specify which work needs to be done. An example of using a message broker is here.

And finally…

After all that, do remember to tear down your cluster and/or load balancers when finished with, as you are charged by minute of uptime per node. Most tutorials include this at the end - you can either delete in the web UI but to be complete here:

  1. Delete the Ingress: This deallocates the ephemeral external IP address and the load balancing resources associated to your application:
kubectl delete ingress your-ingress
  1. Delete the cluster: This deletes the compute nodes of your container cluster and other resources such as the Deployments in the cluster:
gcloud container clusters delete r-cluster

Summary

When working with Kubernetes documentation you can tell that it is a fast changing developing product, which can be a bit frustrating when looking for solutions, but I anticipate the APIs to settle down in the future.

This should hopefully serve as a good jumping off point for you to develop your own Kubernetes R cluster. Having got Kubernetes under my belt I now feel fully tooled up for deploying R (and other applications) at scale, and in an exciting and developing framework that should stand in good sted for a few years to come.

Share Comments
comments powered by Disqus