R at scale on the
Google Cloud Platform

Mark Edmondson (@HoloMarkeD)

Sep 13th, 2019 - GDG New York

code.markedmondson.me

fgf

Credentials

My R Timeline

  • Digital agencies since 2007
  • useR since 2012 - Motive: how to use all this web data?
  • Shiny enthusiast e.g. https://gallery.shinyapps.io/ga-effect/
  • Google Developer Expert - Google Analytics & Google Cloud
  • Several Google API themed packages on CRAN via googleAuthR
  • Part of cloudyr group (AWS/Azure/GCP R packages for the cloud) https://cloudyr.github.io/
  • Now: Data Engineer @ IIH Nordic

GA Effect

ga-effect

googleAuthRverse

  • searchConsoleR
  • googleAuthR
  • googleAnalyticsR
  • googleComputeEngineR (cloudyr)
  • bigQueryR (cloudyr)
  • googleCloudStorageR (cloudyr)
  • googleLanguageR (rOpenSci)

Slack group to talk around the packages #googleAuthRverse

Scale (almost) always starts with Docker containers

Dockerfiles from The Rocker Project

https://www.rocker-project.org/

Maintain useful R images

  • rocker/r-ver
  • rocker/rstudio
  • rocker/tidyverse
  • rocker/shiny
  • rocker/ml-gpu

Thanks to Rocker Team

rocker-team

Dockerfiles

Docker + R = R in Production

  • Flexible No need to ask IT to install R places, use docker run; across cloud platforms; ascendent tech

  • Version controlled No worries new package releases will break code

  • Scalable Run multiple Docker containers at once, fits into event-driven, stateless serverless future

Creating Docker images with Cloud Build

Continuous development with GitHub pushes

build-triggers

Images versioned in private repository

cloud-repo

Scaling R scripts, Shiny apps and APIs

Strategies to scale R

  • Vertical scaling - increase the size and power of one machine
  • Horizontal scaling - split up your problem into lots of little machines
  • Serverless scaling - send your code + data into cloud and let them sort out how many machines

Vertical scaling

Bigger boat

bigger-boat

Bigger VMs

Good for one-off workloads

Pros

Probably run the same code with no changes needed
Easy to setup

Cons

Expensive
May be better to have data in database

Launching a monster VM in the cloud

3.75TB of RAM: $423 a day (compare ~$1 a day for standard tier VM)

RStudio Server

rstudio-server

Standard VM serving Shiny

Cloud computing considerations

  • Only charged for uptime, can configure lots of VMs so…
  • Have lots of specialised VMs (Docker images) not one big workstation
  • Keep code and data separate e.g. googleCloudStorageR or bigQueryR
  • Consider VMs as like functions of computing power

Horizontal scaling

Lots of little machines can accomplish great things

dunkirk

Parellise your code

Good for parallelisable or scheduled data tasks

Pros

Fault redundency
Forces repeatable/reproducable infrastructure
library(future) makes parallel processing very useable

Cons

Changes to your code for split-map-reduce
Write meta code to handle I/O data and code
Not applicable to some problems

Adopt a split-map-reduce mindset

  • Break problems down into stateless lumps
  • Reuseable bricks that can be applied to other tasks

Setup a cluster

New in googleComputeEngineR v0.3 - shortcut that launches cluster, checks authentication for you

library(future)

googleComputeEngineR has custom method for future::as.cluster

Forecasting example

Multi-layer future loops

Can multi-layer future loops (use each CPU within each VM)

Thanks for Grant McDermott for figuring optimal method (Issue #129)

CPU utilization

3 VMs, 8 CPUs each = 24 threads (~$3 a day)

Serverless scaling

We spoke previously of

Clusters of VMs + Docker = Horizontal scaling

Kubernetes

Clusters of VMs + Docker + Task controller = Kubernetes

Kubernetes

Good for Shiny

Pros

Auto-scaling, task queues etc.
Scale to billions
Potentially cheaper
May already have cluster in your organisation

Cons

Needs stateless, idempotent workflows
Message broker?
Minimum 3 VMs
Can get complicated

Dockerfiles for Shiny apps

Built on Cloud Build upon GitHub push

Kubernetes deployments - Shiny

Expose your workloads via Ingress

Shiny apps waiting for service

shiny-kubernetes

IIH Nordic’s Shiny Apps

shiny-app-on-k8s

New! Cloud Run

Cloud Run

  • Built on top of Kubernetes via Knative
  • Managed Container-as-a-Service

Cloud Run Pros/Cons

Good for R APIs

Pros

Auto-scaling
Scale from 0
Simple to deploy
https / authentication embedded

Cons

Needs stateless, idempotent workflows
No websockets yet (no Shiny :( )

plumber APIs

https://www.rplumber.io/

Make an API out of your script:

Adapt plumber API for the model

htmlwidgets

@serialiser htmlwidgets is great

http://gallery.htmlwidgets.org/

Adapt plumber API for the model - test in local R session

Creates a webserver to run the R code.

Cloud Run deployment

See cloudRunR

Cloud Run Docker file

Based on:

FROM trestletech/plumber
LABEL maintainer="mark"

COPY [".", "./"]

ENTRYPOINT ["R", "-e", 
            "pr <- plumber::plumb(commandArgs()[4]); 
            pr$run(host='0.0.0.0', 
                   port=as.numeric(Sys.getenv('PORT')))"]
CMD ["api.R"]

Cloud Run Docker file - autogenerated

And add any packages needed by model.

Cloud Run deployment - server-side auth

  • Server-side - JSON credentials file for GA account in api.R

Cloud Run deployment - client-side auth

  • Client-side - use Cloud Run’s authenticated calls to restrict API calls

Continuous Development with Cloud Build

Set up a build trigger for the GitHub repo you commit the Dockerfile to:

Cloud Build successful

Deploy to Cloud Run

Deployed on Cloud Run

Can scale to a billion, and be available for other languages.

I thought I knew a bit about R and Google Cloud but then…

GoogleNext19 - Data Science at Scale with R on GCP

A 40 mins talk at Google Next19 with lots of new things to try!

https://www.youtube.com/watch?v=XpNVixSN-Mg&feature=youtu.be

next-intro

New concepts

Great video that goes more into Spark clusters, Jupyter notebooks, training using ML Engine and scaling using Seldon on Kubernetes that I haven’t tried yet

next19

Some shots from the video

Google Cloud Platform - Serverless Pyramid

Google Cloud Platform - R applications

Conclusions

Take-aways

  • Anything scales on Google Cloud Platform, including R
  • Docker docker docker
  • library(future)
  • Pick scaling stategy most suitable for you

Gratitude

  • Thank you for listening
  • Thanks to Anna for inviting me
  • Thanks to RStudio for all their cool things. Support them by buying their stuff.
  • Thanks again to Rocker
  • Thanks to Google for Developer Expert programme and building cool stuff.

Say hello afterwards