googleAuthR
searchConsoleR
googleAuthR
googleAnalyticsR
googleComputeEngineR
(cloudyr)bigQueryR
(cloudyr)googleCloudStorageR
(cloudyr)googleLanguageR
(rOpenSci)Slack group to talk around the packages #googleAuthRverse
https://www.rocker-project.org/
Maintain useful R images
rocker/r-ver
rocker/rstudio
rocker/tidyverse
rocker/shiny
rocker/ml-gpu
FROM rocker/tidyverse:3.6.0
MAINTAINER Mark Edmondson (r@sunholo.com)
# install R package dependencies
RUN apt-get update && apt-get install -y \
libssl-dev
## Install packages from CRAN
RUN install2.r --error \
-r 'http://cran.rstudio.com' \
googleAuthR \
googleComputeEngineR \
googleAnalyticsR \
searchConsoleR \
googleCloudStorageR \
bigQueryR \
## install Github packages
&& installGithub.r MarkEdmondson1234/youtubeAnalyticsR \
## clean up
&& rm -rf /tmp/downloaded_packages/ /tmp/*.rds \
Flexible No need to ask IT to install R places, use docker run
; across cloud platforms; ascendent tech
Version controlled No worries new package releases will break code
Scalable Run multiple Docker containers at once, fits into event-driven, stateless serverless future
Continuous development with GitHub pushes
Good for one-off workloads
Pros
Probably run the same code with no changes needed
Easy to setup
Cons
Expensive
May be better to have data in database
3.75TB of RAM: $423 a day (compare ~$1 a day for standard tier VM)
library(googleComputeEngineR)
# your customised Docker image built via Build Triggers
custom_image <- gce_tag_container("custom-shiny-app",
"your-project")
## make new Shiny template VM for your self-contained Shiny app
vm <- gce_vm("myapp",
template = "shiny",
predefined_type = "n1-standard-2",
dynamic_image = custom_image)
googleCloudStorageR
or bigQueryR
Good for parallelisable or scheduled data tasks
Pros
Fault redundency
Forces repeatable/reproducable infrastructure
library(future) makes parallel processing very useable
Cons
Changes to your code for split-map-reduce
Write meta code to handle I/O data and code
Not applicable to some problems
New in googleComputeEngineR
v0.3 - shortcut that launches cluster, checks authentication for you
library(googleComputeEngineR)
vms <- gce_vm_cluster()
#2019-03-29 23:24:54> # Creating cluster with these arguments:template = r-base,dynamic_image = rocker/r-parallel,wait =
#FALSE,predefined_type = n1-standard-1
#2019-03-29 23:25:10> Operation running...
...
#2019-03-29 23:25:25> r-cluster-1 VM running
#2019-03-29 23:25:27> r-cluster-2 VM running
#2019-03-29 23:25:29> r-cluster-3 VM running
...
#2019-03-29 23:25:53> # Testing cluster:
r-cluster-1 ssh working
r-cluster-2 ssh working
r-cluster-3 ssh working
googleComputeEngineR
has custom method for future::as.cluster
# create cluster
vms <- gce_vm_cluster("r-vm", cluster_size = 3)
plan(cluster, workers = as.cluster(vms))
# get data
my_files <- list.files("myfolder")
my_data <- lapply(my_files, read.csv)
# forecast data in cluster
library(forecast)
cluster_f <- function(my_data, args = 4){
forecast(auto.arima(ts(my_data, frequency = args)))
}
result <- future_lapply(my_data, cluster_f, args = 4)
Can multi-layer future loops (use each CPU within each VM)
Thanks for Grant McDermott for figuring optimal method (Issue #129)
3 VMs, 8 CPUs each = 24 threads (~$3 a day)
Clusters of VMs + Docker = Horizontal scaling
Clusters of VMs + Docker + Task controller = Kubernetes
Good for Shiny
Pros
Auto-scaling, task queues etc.
Scale to billions
Potentially cheaper
May already have cluster in your organisation
Cons
Needs stateless, idempotent workflows
Message broker?
Minimum 3 VMs
Can get complicated
Built on Cloud Build upon GitHub push
FROM rocker/shiny
MAINTAINER Mark Edmondson (r@sunholo.com)
# install R package dependencies
RUN apt-get update && apt-get install -y \
libssl-dev
## Install packages from CRAN needed for your app
RUN install2.r --error \
-r 'http://cran.rstudio.com' \
googleAuthR \
googleAnalyticsR
## assume shiny app is in build folder /shiny
COPY ./shiny/ /srv/shiny-server/myapp/
Good for R APIs
Pros
Auto-scaling
Scale from 0
Simple to deploy
https / authentication embedded
Cons
Needs stateless, idempotent workflows
No websockets yet (no Shiny :( )
Make an API out of your script:
library(googleAnalyticsR)
#' Return output data from the ga_time_normalised ga_model
#' @param viewID The viewID for Google Analytics
#' @get /data
function(viewID=""){
model <- ga_time_normalised(viewID)
model$output
}
#' Plot out data from the ga_time_normalised ga_model
#' @param viewID The viewID for Google Analytics
#' @get /plot
#' @serializer htmlwidget
function(viewID=""){
model <- ga_time_normalised(viewID)
model$plot
}
@serialiser htmlwidgets
is great
Creates a webserver to run the R code.
curl http://localhost:8000/data?viewID=81416156
See cloudRunR
Based on:
FROM trestletech/plumber
LABEL maintainer="mark"
COPY [".", "./"]
ENTRYPOINT ["R", "-e",
"pr <- plumber::plumb(commandArgs()[4]);
pr$run(host='0.0.0.0',
port=as.numeric(Sys.getenv('PORT')))"]
CMD ["api.R"]
And add any packages needed by model.
Set up a build trigger for the GitHub repo you commit the Dockerfile to:
Can scale to a billion, and be available for other languages.
curl https://my-r-api-ewjogewawq-uc.a.run.app/data?viewID=81416156
A 40 mins talk at Google Next19 with lots of new things to try!
https://www.youtube.com/watch?v=XpNVixSN-Mg&feature=youtu.be
Great video that goes more into Spark clusters, Jupyter notebooks, training using ML Engine and scaling using Seldon on Kubernetes that I haven’t tried yet