Advanced API building techniques

These go into more edge cases: no parsing content, batching and caching API responses.

Skip parsing

In some cases you may want to skip all parsing of API content, perhaps if it is not JSON or some other reason.

For this, you can use the option option("googleAuthR.rawResponse" = TRUE) to skip all tests and return the raw response.

Here is an example of this from the googleCloudStorageR library:

gcs_get_object <- function(bucket, 
                           object_name){
  ## skip JSON parsing on output as we epxect a CSV
  options(googleAuthR.rawResponse = TRUE)
  
  ## do the request
  ob <- googleAuthR::gar_api_generator("https://www.googleapis.com/storage/v1/",
                                       path_args = list(b = bucket,
                                                        o = object_name),
                                       pars_args = list(alt = "media"))
  req <- ob()
  
  ## set it back to FALSE for other API calls.
  options(googleAuthR.rawResponse = FALSE)
  req
}

Batching API requests

If you are doing many API calls, you can speed this up a lot by using the batch option. This takes the API functions you have created and wraps them in the gar_batch function to request them all in one POST call. You then recieve the responses in a list.

Note that this does not count as one call for API limits purposes, it just speeds up the processing.

Setting batch endpoint

From version googleAuthR 0.6.0 you also need to set an option of the batch endpoint. This is due to multi-batch endpoints being deprecated by Google. You are also no longer able to send batches for multiple APIs in one call.

The batch endpoint is usually of the form:

www.googleapis.com/batch/api-name/api-version

e.g. For BigQuery, the option is:

options(googleAuthR.batch_endpoint = "https://www.googleapis.com/batch/bigquery/v2")

Working with batching

Batching is done via the gar_batch() function. This transforms functions created by gar_api_generator() into one POST request following the batching syntax given by Google API docs. Here is an example for the Google Analytics management API, but its very standard for all the APIs that support batching. If however it doesn’t have a batching section in its docs, it is likely that the API does not support batching.

## usually set on package load
options(googleAuthR.batch_endpoint = "https://www.googleapis.com/batch/urlshortener/v1")

## from goo.gl API
shorten_url <- function(url){

  body = list(
    longUrl = url
  )
  
  f <- gar_api_generator("https://www.googleapis.com/urlshortener/v1/url",
                         "POST",
                         data_parse_function = function(x) x$id)
  
  f(the_body = body)
  
}

## from goo.gl API
user_history <- function(){
  f <- gar_api_generator("https://www.googleapis.com/urlshortener/v1/url/history",
                         "GET",
                         data_parse_function = function(x) x$items)
  
  f()
}
library(googleAuthR)

gar_auth()

ggg <- gar_batch(list(shorten_url("http://markedmondson.me"), user_history()))

Walking through batch requests

A common batch task is to walk through the same API call, modifying only one parameter. An example includes walking through Google Analytics API calls by date to avoid sampling. This is implemented at gar_batch_walk()

This modifies the API function parameters, so you need to supply it which parameters you will (and will not) vary, as well as the values you want to walk through. Some rules to help you get started are:

  • The f function needs to be a gar_api_generator() function that uses at least one of path_args, pars_args or body_args to construct the URL (rather than say using sprintf() to create the API URL)
  • You don’t need to set the headers as described in the Google docs for batching API functions - those are done for you.
  • The argument walk_vector needs to be a vector of the values of the arguments to walk over, which you indicate will walk over the pars/path or body arguments on the function via on of the *_walk arguments e.g. if walking over id=1, id=2, for a path argument then it would be path_walk="id" and walk_vector=c(1,2,3,4)
  • gar_batch_walk() only supports changing one value at a time, for one or multiple arguments (I think only changing the start-date, end-date example would be the case when you walk through more than one per call)
  • batch_size should be over 1 for batching to be of any benefit at all
  • The batch_function argument gives you a way to operate on the parsed output of each call, before they are merged.

An example is shown below - this gets a Google Analytics WebProperty object for each Account ID, in batches of 100 per API call:

# outputs a vector of accountIds
getAccountInfo <- gar_api_generator(
  "https://www.googleapis.com/analytics/v3/management/accounts",
  "GET", data_parse_function = function(x) unique(x$items$id))

# for each accountId, get a list of web properties
getWebpropertyInfo <- gar_api_generator(
  "https://www.googleapis.com/analytics/v3/management/", # don't use sprintf to construct this
  "GET",
  path_args = list(accounts = "default", webproperties = ""),
  data_parse_function = function(x) x$items)

# walks through a list of accountIds
walkWebpropertyInfo <- function(accs){
  
  gar_batch_walk(getWebpropertyInfo, 
                 walk_vector = accs,
                 gar_paths = list("webproperties" = ""),
                 path_walk = "accounts",
                 batch_size = 100, data_frame_output = FALSE)
}

##----- to use:

library(googleAuthR)
accs <- getAccountInfo()
walkWebpropertyInfo(accs)

Caching API calls

You can also set up local caching of API calls. This uses the memoise package to let you write API responses to memory or disk and to call them from there instead of the API, if its using the same parameters.

A demonstration is shown below:

#' Shortens a url using goo.gl
#'
#' @param url URl to shorten with goo.gl
#' 
#' @return a string of the short URL
shorten_url <- function(url){
  
  body = list(
    longUrl = url
  )
  
  f <- gar_api_generator("https://www.googleapis.com/urlshortener/v1/url",
                         "POST",
                         data_parse_function = function(x) x$id)
  
  f(the_body = body)
  
}

gar_auth()

## normal API fetch
shorten_url("http://markedmondson.me")

## By default this will save the API response to RAM
gar_cache_setup()

## first time no cache
shorten_url("http://markedmondson.me")

## second time cached - much quicker
shorten_url("http://markedmondson.me")

Caching is activated by using the gar_cache_setup() function.

The default uses memoise::cache_memory() which will cache the response to RAM, but you can change this to any of the memoise cache functions such as cache_s3() or cache_filesytem()

cache_filesystem() will write to a local folder, meaning you can save API responses between R sessions.

Other cache functions

You can see the current cache location function via gar_cache_get_loc and stop caching via gar_cache_empty

Invalidating cache

There are two hard things in Computer Science: cache invalidation and naming things.

In some cases, you may only want to cache the API responses under certain conditions. A common use case is if an API call is checking if a job is running or finished. You would only want to cache the finished state, otherwise the function will run indefinitely.

For those circumstances, you can supply a function that takes the API response as its only input, and outputs TRUE or FALSE whether to do the caching. This allows you to introduce a check e.g. for finished jobs.

The default will only cache for when a successful request 200 is found:

function(req){req$status_code == 200}

For more advanced use cases, examine the response of failed and successful API calls, and create the appropriate function. Pass that function when creating the cache:

# demo function to cache within
shorten_url_cache <- function(url){
  
  body = list(
    longUrl = url
  )
  
  f <- gar_api_generator("https://www.googleapis.com/urlshortener/v1/url",
                         "POST",
                         data_parse_function = function(x) x)
  
  f(the_body = body)
  
}

## only cache if this URL
gar_cache_setup(invalid_func = function(req){
    req$content$longUrl == "http://code.markedmondson.me/"
    })

# authentication
gar_auth()

## caches
shorten_url_cache("http://code.markedmondson.me")
  
## read cache
shorten_url_cache("http://code.markedmondson.me")
  
## ..but dont cache me
shorten_url_cache("http://blahblah.com")

Batching and caching

If you are caching a batched call, your cache invalidation function will need to take account that it will recieve a response which is a multipart/mixed; boundary=batch_{random_string} as its content-type header. This response will need to be parsed into JSON first, before applying your data parsing functions and/or deciding to cache. To get you started here is a cache function:

batched_caching <- function(req){

  ## if a batched response
  if(grepl("^multipart/mixed; boundary=batch_",req$headers$`content-type`)){

    ## find content that indicates a successful request ('kind:analytics#gaData')
    parsed <- httr::content(req, as = "text", encoding = "UTF-8")
    is_ga <- grepl('"kind":"analytics#gaData"', parsed)
    if(is_ga){
      ## is a response you want, cache me
      return(TRUE)
    }
  }
  
  FALSE
}

Using caching

Once set, if the function call name, arguments and body are the same, it will attempt to find a cache. If it exists it will read from there rather than making the API call. If it does not it will make the API call, and save the response to where you have specified.

Applications include saving large responses to RAM during paging calls, so that if the response fails retries are quickly moved to where the API left off, or to write API responses to disk for multi-session caching. You could also use this for unit testing, although its recommend to use httptest library for that as it has wider support for authentication obscuration etc.

Be careful to only use caching where you know the API request won’t change - if you want to reset the cache you can run the gar_cache_setup function again or delete the individual cache file in the directory.

Tests

All packages should ideally have tests to ensure that any changes do not break functionality.

If new to testing, first read this guide on using the tidyverse’s testthat, then read this for a beginner’s guide to travis-CI for R, which is a continuous integration system that will run and test your code everytime it is committed to GitHub.

This is this package’s current Travis badge: Travis-CI Build Status

However, testing APIs that need authentication is more complicated, as you need to deal with the authentication token to get a correct response.

One option is to encrypt and upload your token, for which you can read a guide by Jenny Bryan here - https://cran.r-project.org/web/packages/googlesheets/vignettes/managing-auth-tokens.html.

Mocking tests with httptest

However, I recommend using mocking to run tests online.

Mocking means you do not have to upload your authentication token - instead you first run your tests locally with your normal authentication token and the responses are saved to disk. When those tests are then run online, instead of calling the API the saved responses on disk are used to test the function.

In truth, you are most interested in if your functions call the API correctly, rather than if the API is working (the API provider has their own tests for that), so you don’t lose coverage testing against a mock file, and this has the added advantage of being much quicker so you can run tests more often.

When creating your tests, I suggest two types: unit tests that will run against your mocks; and integration tests that will run against the API but only when used locally.

To help with mocking, Neal Richardson has created a package called httptest which we use in the examples below:

Integration tests

The integration tests are against the API. Any authentication tokens should be either referenced via the environment arguments or if local JSON or .httr-oauth files ignored in your .gitignore

context("Integration tests setup")

## test google project ID
options(googleAuthR.scopes.selected = c("https://www.googleapis.com/auth/urlshortener"),
        googleAuthR.client_id = "XXXX",
        googleAuthR.client_secret = "XXXX")

## this is the name of the environment argument that points to your auth file, 
## and referenced in `gar_attach_auto_auth` so that it authenticated on package load.
auth_env <- "GAR_AUTH_FILE"

A helper function is available that will skip the test if the auth_env is unavailable, skip_if_no_env_auth()

The tests can then be ran, and will only call the API if authenticated.

test_that("My function works", {
  skip_on_cran() # don't run on CRAN at all
  skip_if_no_env_auth(auth_env)
  
  ## if not auto-auth
  ## gar_auth("location_of_auth_file")
  
  lw <- shorten_url("http://code.markedmondson.me")
  expect_type(lw, "character")
  
})

Unit tests

Running unit tests is similar, but you also need to record the API response first locally using httptest’s capture_requests(). I put this in a test block to check it worked ok.

These will save to files within your test folder, with the similar name to the API.

library(httptest)
context("API Mocking")

test_that("Record requests if online and authenticated", {
  skip_if_disconnected()
  skip_if_no_env_auth(auth_env)
  
  capture_requests(
    {
      shorten_url("http://code.markedmondson.me")
      # ...
      # any other API calls required for tests below
    })
  
})

You can then use httptests withMockAPI() function to wrap your tests:

with_mock_API({
  context("API generator - unit")
  
  test_that("A generated API function works", {
    skip_on_cran()
    
    gar_auth("googleAuthR_tests.httr-oauth")
    lw <- shorten_url("http://code.markedmondson.me")
    expect_type(lw, "character")
    
  })
  
  # ... any other tests you have captured requests from
  
})

Codecov

One of the nicest features of continuous integration / Travis is it can be used for code coverage. This uses your tests to find out how many lines of code are triggered within them. This lets you see how thorough yours tests are. Developers covert the 100% coverage badge as it shows all of your code is checked every time you push to GitHub and helps mitigate bugs.

This is this package’s current Codecov status: codecov

Using code coverage is a case of adding another metafile and a command to your travis file. This post by Eryk has some details

Offline code coverage

As mentioned above, some tests you may not be able to run online due to authentication issues. If you want, you can run these tests offline locally - go to your repositories settings on Codecov, select settings and get your Repository Upload Token. Place this in an environment var like this:

CODECOV_TOKEN=XXXXXX

…then run covr::codecov() in the package home directory. It will run your tests, and upload them to Codecov.