Four Ways to Schedule R scripts on Google Cloud Platform

A common question I come across is how to automate scheduling of R scripts downloading data. This post goes through some options that I have played around with, which I’ve mostly used for downloading API data such as Google Analytics using the Google Cloud platform, but the same principles could apply for AWS or Azure.

Scheduling scripts advice

But first, some notes on the scripts you are scheduling, that I’ve picked up.

Don’t save data to the scheduling server

I would suggest to not save or use data in the same place you are doing the scheduling. Use a service like BigQuery (bigQueryR) or googleCloudStorageR (googleCloudStorageR) to first load any necessary data, do your work then save it out again. This may be a bit more complicated to set up, but will save you tears if the VM or service goes down - you still have your data.

To help with this, on Google Cloud you can authenticate with the same details you used to launch a VM to authenticate with the storage services above (as all are covered under the http://www.googleapis.com/auth/cloud-services scope) - you can access this auth when on a GCE VM in R via googleAuthR::gar_gce_auth()

An example skeleton script is shown below that may be something you are scheduling.

It downloads authentication files, does an API call, then saves it up to the cloud again:

library(googleAuthR)
library(googleCloudStorageR)

gcs_global_bucket("my-bucket")
## auth on the VM
options(googleAuthR.scopes.selected = "https://www.googleapis.com/auth/cloud-platform")
gar_gce_auth()

## use the GCS auth to download the auth files for your API
auth_file <- "auth/my_auth_file.json"
gcs_get_object(auth_file, saveToDisk = TRUE)

## now auth with the file you just download
gar_auth_service(auth_file)

## do your work with APIs etc.
.....

## upload results back up to GCS (or BigQuery, etc.)
gcs_upload(my_results, name = "results/my_results.csv")

Set up a schedule for logs too

Logs are important for scheduled jobs, so you have some idea on whats happened when things go wrong. To help with scheduling debugging, most googleAuthR packages now have a timestamp on their output messages.

You can send the output of your scripts to log files, if using cron and RScript it looks something like this:

RScript /your-r-script.R > your-r-script.log

…where > sends the output to the new file.

Over time though, this can get big and (sigh) fill up your disk so you can’t log in to the VM (speaking from experience here!) so I now set up another scheduled job that every week takes the logs and uploads to GCS, then deletes the current ones.

Consider using Docker for environments

Several of the methods below use Docker.

The reasons for that is Docker provides a nice reprodueable way to define exactly what packages and dependencies you need for your script to run, which can run on top of any type of infrastructure as Docker has quickly become a cloud standard.

For instance, migrating from Google Cloud to AWS is much easier if both can be deployed using Docker, and below Docker is instrumental in allowing you to run on multiple solutions.

Bear in mind that when a Docker container relaunches it won’t save any data, so any non-saved state will be lost (you should make a new container if you need it to contain data), but you’re not saving your data to the docker container anyway, aren’t you?

Scheduling options - Pros and cons

Here is an overview of the pros and cons of the options presented in more detail below:

Method Pros Cons
1. cronR Addin on RStudio Server Simple and quick Not so robust, need to log into server to make changes, versioning packages.
2. gce_vm_scheduler and Dockerfiles Robust and can launch from local R session, support versioning Need to build Dockerfiles, all scripts on one VM
3. Master & Slave VM Tailor a fresh VM for each script, cheaper Need to build Dockerfiles, more complicated VM setup.
4. Google AppEngine with flexible containers Managed platform Need to turn script into web responses, more complicated setup

1 - cronR plus RStudio Server

This is the simplest and the one to start with.

  1. Start up an RStudio Server instance
  2. Install cronR
  3. Upload your R script
  4. Schedule your script using cronR RStudio addin

With googleComputeEngineR and the new gcer-public project containing public images that include one with cronR already installed, this is as simple as the few lines of code below:

library(googleComputeEngineR)

## get the tag for prebuilt Docker image with googleAuthRverse, cronR and tidyverse
tag <- gce_tag_container("google-auth-r-cron-tidy", project = "gcer-public")
# gcr.io/gcer-public/google-auth-r-cron-tidy

## start a custom Rstudio instance
vm <- gce_vm(name = "my-rstudio",
              predefined_type = "n1-highmem-8",
              template = "rstudio",
              dynamic_image = tag,
              username = "me", password = "mypassword")

Wait for it to launch and give you an IP, then log in, upload a script and configure the schedule via the cronR addin.

Some more detail about this workflow can be found at these custom RStudio example workflows on the googleComputeEngineR website.

2- gce_vm_scheduler and Dockerfiles

This method I prefer to the above since it lets you create the exact environment (e.g. package versions, dependencies) to run your script in, that you can trail dev and production versions with. It also works locally without needing to log into the server each time to deploy a script.

Handy tools for Docker - containerit and Build Triggers

Here we introduce Docker images, which may have been more a technical barrier for some before (but worth knowing, I think)

containerit

Things are much easier now though, as we have the magic new R package containerit which can generate these Docker files for you - just send containerit::dockerfile() around the script file.

Build Triggers

Along with auto-generating Dockerfiles, for Google Cloud in particular we now also have Build Triggers which automates building the Docker image for you.

Just make the Dockerfile, then set up a trigger for when you push that file up to GitHub - you can see the ones used to create the public R resources here in the googleComputeEngineR repo.

Recipe

Putting it all together then, documentation of this workflow for scheduling R scripts is found here.

  1. If you don’t already have one, start up a scheduler VM using gce_vm_scheduler
  2. Create a Dockerfile either manually or using containerit that will run your script upon execution
  3. Upload the Dockerfile to a git repo (private or public)
  4. Setup a build trigger for that Dockerfile
  5. Once built, set a script to schedule within that Dockerfile with gce_schedule_docker

This is still in beta at time of writing but should be stable by the time googlecomputeEngineR hits CRAN 0.2.0.

3 - Master and Slave VMs

Some scripts take more resources than others, and since you are using VMs already you can have more control over what specifications of VM to launch based on the script you want to run.

This means you can have a cheap scheduler server, that launch biggers VMs for the duration of the job. As GCP charges per minute, this can save you money over having a schedule server that is as big as what your most expensive script needs running 24/7.

This method is largely like the scheduled scripts above, except in this case the scheduled script is also launching VMs to run the job upon.

Using googleCloudStorageR::gcs_source you can run an R script straight from where it is hosted upon GCS, meaning all data, authentication files and scripts can be kept seperate from the computation. An example master script is shown below:

## intended to be run on a small instance via cron
## use this script to launch other VMs with more expensive tasks
library(googleComputeEngineR)
library(googleCloudStorageR)
gce_global_project("my-project")
gce_global_zone("europe-west1-b")
gcs_global_bucket("your-gcs-bucket")

## auth to same project we're on
googleAuthR::gar_gce_auth()

## launch the premade VM
vm <- gce_vm("slave-1")

## set SSH to use 'master' username as configured before
vm <- gce_ssh_setup(vm, username = "master", ssh_overwrite = TRUE)

## run the script on the VM that will source from GCS
runme <- "Rscript -e \"googleAuthR::gar_gce_auth();googleCloudStorageR::gcs_source('download.R', bucket = 'your-gcs-bucket')\""
out <- docker_cmd(vm, 
                  cmd = "exec", 
                  args = c("rstudio", runme), 
                  wait = TRUE)

## once finished, stop the VM
gce_vm_stop(vm)

More detail is again available at the googleComputeEngineR website.

4 - Google App Engine with flexible custom runtimes

Google App Engine has always had schedule options, but only for its supported languages of Python, Java, PHP etc. Now with the introduction of flexible containers, any Docker container running any language (including R) can also be run.

This is potentially the best solution since it runs upon a 100% managed platform, meaning you don’t need to worry about servers at all, and it takes care of things like server maintence, logging etc.

Setting up your script for App Engine

There are some requirements for the container that need configuring so it can run:

  • You can not use googleAuthR::gar_gce_auth() so will need to upload the auth token within the Dockerfile.
  • AppEngine expects a web service to be listening on port 8080, so your schedule script needs to be triggered via HTTP requests.

For authentication, I use the system environment arguments (i.e. those usually set in .Renviron) that googleAuthR packages use for auto-authentication. Put the auth file (such as JSON or a .httr-oauth file) into the deployment folder, then point to its location via specifying in the app.yaml. Details below.

To solve the need for being a webservice on port 8080 (which is then proxied to normal webports 80/443), plumber is a great service by Jeff Allen of RStudio, which already comes with its own Docker solution. You can then modify that Dockerfile slightly so that it works on App Engine.

Recipe

To then schedule your R script on app engine, follow the guide below, first making sure you have setup the gcloud CLI.

  1. Create a Google Appengine project in the US region (only region that supports flexible containers at the moment)
  2. Create a scheduled script e.g. schedule.R - you can use auth from environment files specified in app.yaml.
  3. Make an API out of the script by using plumber - example:
library(googleAuthR)         ## authentication
library(googleCloudStorageR)  ## google cloud storage
library(readr)                ## 
## gcs auto authenticated via environment file 
## pointed to via sys.env GCS_AUTH_FILE

#* @get /demoR
demoScheduleAPI <- function(){
  
  ## download or do something
  something <- tryCatch({
      gcs_get_object("schedule/test.csv", 
                     bucket = "mark-edmondson-public-files")
    }, error = function(ex) {
      NULL
    })
      
  something_else <- data.frame(X1 = 1,
                               time = Sys.time(), 
                               blah = paste(sample(letters, 10, replace = TRUE), collapse = ""))
  something <- rbind(something, something_else)
  
  tmp <- tempfile(fileext = ".csv")
  on.exit(unlink(tmp))
  write.csv(something, file = tmp, row.names = FALSE)
  ## upload something
  gcs_upload(tmp, 
             bucket = "mark-edmondson-public-files", 
             name = "schedule/test.csv")
  
  message("Done", Sys.time())
}
  1. Create Dockerfile. If using containerit then replace FROM with trestletech/plumber and add the below lines to use correct AppEngine port:

Example:

library(containerit)

dockerfile <- dockerfile("schedule.R", copy = "script_dir", soft = TRUE)
write(dockerfile, file = "Dockerfile")

Then change/add these lines to the created Dockerfile:

EXPOSE 8080
ENTRYPOINT ["R", "-e", "pr <- plumber::plumb(commandArgs()[4]); pr$run(host='0.0.0.0', port=8080)"]
CMD ["schedule.R"]

Example final Dockerfile below. This doesn’t need to be built in say a build trigger as its built upon app engine deployment.

FROM trestletech/plumber
LABEL maintainer="mark"
RUN export DEBIAN_FRONTEND=noninteractive; apt-get -y update \
 && apt-get install -y libcairo2-dev \
    libcurl4-openssl-dev \
    libgmp-dev \
    libpng-dev \
    libssl-dev \
    libxml2-dev \
    make \
    pandoc \
    pandoc-citeproc \
    zlib1g-dev
RUN ["install2.r", "-r 'https://cloud.r-project.org'", "readr", "googleCloudStorageR", "Rcpp", "digest", "crayon", "withr", "mime", "R6", "jsonlite", "xtable", "magrittr", "httr", "curl", "testthat", "devtools", "hms", "shiny", "httpuv", "memoise", "htmltools", "openssl", "tibble", "remotes"]
RUN ["installGithub.r", "MarkEdmondson1234/googleAuthR@7917351", "hadley/rlang@ff87439"]
WORKDIR /payload/
COPY [".", "./"]

EXPOSE 8080
ENTRYPOINT ["R", "-e", "pr <- plumber::plumb(commandArgs()[4]); pr$run(host='0.0.0.0', port=8080)"]
CMD ["schedule.R"]
  1. Specify app.yaml for flexible containers as detailed here. Add any environment vars such as auth files, that will be included in same deployment folder.

Example:

runtime: custom
env: flex

env_variables:
  GCS_AUTH_FILE: auth.json
  1. Specify cron.yaml for the schedule needed:
cron:
- description: "test cron"
  url: /demoR
  schedule: every 1 hours
  1. You should now have these files in the deployment folder:

    • app.yaml - configuration of general app settings
    • auth.json - an authentication file specified in env arguments or app.yaml
    • cron.yaml - specification of when your scheduling is
    • Dockerfile - specification of the environment
    • schedule.R - the plumber version of your script containing your endpoints

Open the terminal in that folder, and deploy via gcloud app deploy --project your-project and the cron schedule via gcloud app deploy cron.yaml --project your-project.

It will take a while (up to 10 mins) the first time.

  1. The App Engine should then be deployed on https://your-project.appspot.com/ - every GET request to https://your-project.appspot.com/demoR (or other endpoints you have specified in R script) will run the R code. The cron example above will run every hour to this endpoint.

Logs for the instance are found here.

This approach is the most flexible, and offers a fully managed platform for your scripts. Scheduled scripts are only the beginning, since deploying such actually gives you a way to run R scripts in response to any HTTP request from any language - triggers could also include if someone updates a spreadsheet, adds a file to a folder, pushes to GitHub etc. which opens up a lot of exciting possibilities. You can also scale it up to become a fully functioning R API.

Summary

Hopefully this has given you an idea on your options for R on Google Cloud regarding scheduling. If you have some other easier workflows or suggestions for improvements please put them in the comments below!

Share Comments
comments powered by Disqus