Use Cases for R on the
Google Cloud Platform

Mark Edmondson (@HoloMarkeD)

October 14th, 2020

Credentials

My R Timeline

  • Digital agencies since 2007
  • useR since 2012 - Motive: how to use all this web data?
  • Shiny enthusiast e.g. https://app.iihnordic.dk/ga-effect/
  • Google Developer Expert - Google Analytics & Google Cloud
  • Several Google API themed packages on CRAN via googleAuthR
  • Part of cloudyr group (AWS/Azure/GCP R packages for the cloud) https://cloudyr.github.io/
  • Now: Data Engineer @ IIH Nordic

GA Effect

https://app.iihnordic.dk/ga-effect/

ga-effect

googleAuthRverse

  • searchConsoleR
  • googleAuthR
  • googleAnalyticsR
  • googleComputeEngineR (cloudyr)
  • bigQueryR (cloudyr)
  • googleCloudStorageR (cloudyr)
  • googleLanguageR (rOpenSci)
  • googleCloudRunner (NEW!)

Slack group to talk around the packages #googleAuthRverse

Why R for digital marketing

Data Science programming

  • R has specialised tools for every stage of data projects
  • Gathering data - standard data.frame objects
  • Cleaning data - tidyverse
  • Modelling data - many statistical packages
  • Presentation - R Markdown, Shiny, ggplot2, JavaScript viz

Where R sits

  • Its a data science language that changes the way you think about data
  • I love Python too, the 2nd best programming language for everything
  • SQL, Go and JavaScript round out 99% of data needs

Why R in the (Google) Cloud?

  • No need to migrate code from R to scale it in production
  • Use R’s UX to integrate with Cloud services
  • Level up R’s abilities
  • Share R micro-services with non-R users

Google Cloud Platform - Serverless Pyramid

Scale (almost) always starts with Docker containers

Dockerfiles from The Rocker Project

https://www.rocker-project.org/

Maintain useful R images

  • rocker/r-ver
  • rocker/rstudio
  • rocker/tidyverse
  • rocker/shiny
  • rocker/ml-gpu

Thanks to Rocker Team

rocker-team

Dockerfiles

FROM rocker/tidyverse
MAINTAINER Mark Edmondson (r@sunholo.com)

# install R package dependencies
RUN apt-get update && apt-get install -y \
    libssl-dev 

## Install packages from CRAN
RUN install2.r --error \ 
    -r 'http://cran.rstudio.com' \
    googleAuthR \ 
    googleComputeEngineR \ 
    googleAnalyticsR \ 
    searchConsoleR \ 
    googleCloudStorageR \
    bigQueryR \ 
    ## install Github packages
    && installGithub.r MarkEdmondson1234/youtubeAnalyticsR 

Docker + R = R in Production

  • Flexible No need to ask IT to install R places, use docker run; across cloud platforms; ascendant tech

  • Version controlled No worries new package releases will break code

  • Scalable Run multiple Docker containers at once, fits into event-driven, stateless serverless future

Scaling R scripts, Shiny apps and APIs

Strategies to scale R

  • Vertical scaling - increase the size and power of one machine (VMs)
  • Horizontal scaling - split up your problem into lots of little machines (VM clusters)
  • Serverless scaling - send your code + data into cloud and let them sort out how many machines

Google Cloud Platform - Serverless Pyramid

googleCloudRunner - serverless scaling

Cloud Run

  • Built on top of Kubernetes via Knative
  • Managed Container-as-a-Service for HTTP requests

Cloud Run Pros/Cons

Good for R APIs

Pros

Auto-scaling
Scale from 0
Simple to deploy
https / authentication embedded

Cons

Needs stateless, idempotent workflows
Limited support for Shiny

plumber APIs

https://www.rplumber.io/

Make an API out of your script:

#' @get /hello
#' @html
function(){
  "<html><h1>hello world</h1></html>"
}

Adapt plumber API for your R needs

#' Echo the parameter that was sent in
#' @param msg The message to echo back.
#' @get /echo
function(msg=""){
  list(msg = paste0("The message is: '", msg, "'"))
}

#' Plot out data from the iris dataset
#' @param spec If provided, filter the data to only this species (e.g. 'setosa')
#' @get /plot
#' @png
function(spec){
  myData <- iris
  title <- "All Species"

  # Filter if the species was specified
  if (!missing(spec)){
    title <- paste0("Only the '", spec, "' Species")
    myData <- subset(iris, Species == spec)
  }

  plot(myData$Sepal.Length, myData$Petal.Length,
       main=title, xlab="Sepal Length", ylab="Petal Length")
}

Cloud Run Docker file

Based on:

FROM trestletech/plumber

COPY [".", "./"]

ENTRYPOINT ["R", "-e", "pr <- plumber::plumb(commandArgs()[4]); pr$run(host='0.0.0.0', port=as.numeric(Sys.getenv('PORT')))"]
CMD ["api.R"]

Demo Cloud Run R application

  • R API

Can scale to a billion, and be available for other languages.

Cloud Run - R Use Cases

  • Data modelling as a service via API call
  • Parallel processing via multiple API calls
  • Dynamic plots for hosted in iframes for data viz products like Data Studio/Tableau
  • JavaScript/HTML rendering of data

Cloud Build

  • Cloud Build runs docker commands in sequence
  • Triggered via API call or git commit
  • Useful for batched services
  • As any code can be in container, can combine R with other languages
  • Cloud Build runs cloudbuild.yaml scripts that call Docker containers

Example cloudbuild.yml

steps:
- name: 'gcr.io/cloud-builders/docker'
  id: Docker Version
  args: ["version"]
- name: 'alpine'
  id:  Hello Cloud Build
  args: ["echo", "Hello Cloud Build"]
- name: 'rocker/r-base'
  id: Hello R
  args: ["Rscript", "-e", "paste0('1 + 1 = ', 1+1)"]

Polygot cloudbuild.yml

steps:
- name: gcr.io/gcer-public/gago:master
  args:
  - reports
  - --view=81416156
  - --dims=ga:date,ga:medium
  - --mets=ga:sessions
  - --start=2014-01-01
  - --end=2019-11-30
  - --antisample
  - --max=-1
  - -o=google_analytics.csv
  id: download google analytics
  dir: build
  env: GAGO_AUTH=/workspace/auth.json
- name: gcr.io/cloud-builders/gsutil
  args:
  - cp
  - gs://mark-edmondson-public-read/polygot.Rmd
  - /workspace/build/polygot.Rmd
  id: download Rmd template
- name: gcr.io/gcer-public/packagetools:master
  args:
  - Rscript
  - -e
  - |-
    lapply(list.files('.', pattern = '.Rmd', full.names = TRUE),
                 rmarkdown::render, output_format = 'html_document')
  id: render rmd
  dir: build

Continuous Development with Cloud Build

Set up a build trigger for the GitHub repo you commit the Dockerfile to:

Cloud Build Use Cases

  • Scheduled R batch scripts
  • Continuous development/integration
  • Docker image builds
  • Long running processes (24hrs)
  • Language neutral yaml format to share dataflows

Cloud Build app

  • Pre-authenticated APIs for the IIH Team
  • Shiny App running on Google Kubernetes Engine
  • Share cloudbuild.yml files with pre-made jobs like GA import into BigQuery

R and GCP Community

GoogleNext19 - Data Science at Scale with R on GCP

A 40 mins talk at Google Next19 with lots of new things to try!

https://www.youtube.com/watch?v=XpNVixSN-Mg&feature=youtu.be

next-intro

New concepts

Great video that goes more into Spark clusters, Jupyter notebooks, training using ML Engine and scaling using Seldon on Kubernetes that I haven’t tried yet

next19

bigrquery integration with dplyr

Use dplyr R code across datasets including BigQuery (from https://rpubs.com/shivanandiyer/BigRQuery)

library(bigrquery) # R Interface to Google BigQuery API  
library(dplyr) # Grammar for data manipulation  
library(DBI) # Interface definition to connect to databases 

bq_conn <-  dbConnect(bigquery(), 
                      project = "project-id",
                      dataset = "dataset-id", 
                      use_legacy_sql = FALSE)
                      
bq_table <- dplyr::tbl(bq_conn, "my-table") 

Use standard dplyr code that translates to BigQuery SQL behind the scenes

top_10 <-
  bq_table %>% 
    group_by(my_column) %>% 
    summarise_all(funs(sum)) %>% 
    arrange(desc(offence)) %>% 
    top_n(10)

Conclusions

Take-aways

Gratitude

  • Thank you for listening
  • Thanks to Moe for inviting me
  • Thanks to RStudio for all their cool things. Support them by buying their stuff.
  • Thanks again to Rocker
  • Thanks to Google for Developer Expert programme and building cool stuff.

Say hello afterwards

fgf