gago: Blazingly fast Google Analytics API downloads with Go

gago is a new Go library for working with the Google Analytics Reporting API v4.

I used it as a way to learn Go, transferring across some of the lessons I learned from working with the Google Analytics API in googleAnalyticsR. In particular how to get fast downloads and adding an anti-sample option, whilst taking advantage of Go’s natural multi-threaded nature.

The imagined use case is for when you need to download Google Analytics data but you don’t want to install an interpreted language such as Python or R to do so. Since Go compiles to binary files it should work on any system.

Why Go in data science/engineering?

The first question I get from people is why bother learning Go in the first place?

The primary reason was just to try something new, but as I worked with it I did find some advantages over my usual R, Python and SQL in my current position as a Data Engineer. These were:

  • Strongly typed language making code a lot more robust by default
  • Well supported by Google Cloud Platform
  • Looks to fill a niche that you may use Java for
  • Easy multi-threading
  • Open source library system based on GitHub
  • Portable - compiles to binary builds on Linux, Windows etc.
  • Favoured functional programming styles over OO-orientated (e.g. R like, but with Python clean syntax)

All that said I wouldn’t use it as a replacement for R for data science workloads, but perhaps some of the systems programming that I currently use Python for such as Cloud Functions it may work for, especially if response and speed is required. Perhaps also if there are bottlenecks in Python/R code then making a more efficient loop in Go may be an option, although the best way to do that in R at the moment is to use C++ and the Rcpp library.

The first application is creating a command line tool that can work on a server to download data, which can then be passed to another language for analysis or distributed to other users who need a quick way to download data but who don’t have an R/Python environment.

gago and gagocli

The first step was to import the official Google Analytics v4 library for Go, which is created using the discovery API.

As anyone who has worked with the v4 API can tell you, it is not trivial to turn the base functions into a working download of your data. It requires some knowledge about what Google Analytics data model is, and then you have to negotiate a complicated nested JSON structure and paging. gago is intended to take out some of that workload for Go developers, requiring you only need to supply a Go struct with the data you wish to get.

Here is an example on how to fetch all rows of data with the unsampled flag set that breaks up your API calls into batches under the sampling limits:

package main

import (
	"github.com/MarkEdmondson1234/gago/gago"
	"os"
)

// get auth file and authenticate
authFile := os.Getenv("GAGO_AUTH")
analyticsreportingService, _ := gago.Authenticate(authFile)

// make report request struct
var req = gago.GoogleAnalyticsRequest{
		Service:    analyticsreportingService,
		ViewID:     "106249469",
		Start:      "2016-07-01",
		End:        "2019-08-01",
		Dimensions: "ga:date,ga:sourceMedium,ga:landingPagePath,ga:source,ga:hour,ga:minute,ga:eventCategory",
		Metrics:    "ga:sessions,ga:users",
		MaxRows:    -1,
		AntiSample: true}

// API calls
report := gago.GoogleAnalytics(req)

// write data out as a CSV to console
gago.WriteCSV(report, os.Stdout)

gago is then intended for use by Go developers looking to integrate Google Analytics data into their own applications, whilst gagocli is a command line interface for all users to work with gago and the Google Analytics API.

An example of using gagocli lets you send in your request for data in a config file or direct within the command line via flags:

# in your terminal
gagocli reports -c config.yml -view 1234 -max 10

…where the config.yaml could look like:

gago:
  view: 1234567
  metrics: ga:sessions,ga:users
  dimensions: ga:date,ga:sourceMedium,ga:landingPagePath
  start: 2019-01-01
  end: 2019-03-01
  maxRows: 1000
  antisample: true

See the documentation on how to get set up with gagocli to see more examples.

‘Blazingly fast’ performance

Everything is ‘blazingly fast’ these days, so I took the opportunity too.

The API fetches for big reports using gago will I think be the fastest way available to download Google Analytics data at the moment.

This is due to gago using optimisations that take advantage of some of the Google Analytics Reporting API v4 as well as Go’s multi-threading.

To start with, the v4 feature of batching is used, meaning five API requests can be made at once. On top of this, you can fetch multiple batched requests at once, with a limit of ten per web property - Go make it easy to set off ten requests at once and monitor and patch them together again. This means 10 API requests * 5 batched = 50 API pages can be requested at once.

This is a bit hacky since its not using the intended paging functionality of the API, which traditionally lets you request one page at a time, the response of which includes a link/paging number to where the next API call should start from.

However, it seems to work. The requests are instead constructed ahead of time, and all sent at once using Go’s nice interface for asynchronous parallel fetches, the goroutine.

Benchmarks

To demonstrate, I ran the same API call across several methods using googleAnalyticsR as a faciltator.

I compared:

  • Traditional API fetching using the paging in each responses, available by setting slow_fetch=TRUE in the google_analytics() function
  • Batched v4 API requests that are the default for google_analytics()
  • A call out using system calls to the gagocli library

The API call to fetch was from a medium sized ecommerce site fetching 503,210 rows of session data containing each campaign, landing page and hour of the day of nearly two years.

The R code to create the benchmarks is below:

library(googleAnalyticsR)
library(microbenchmark)

# auth via a service account connected to GA 
googleAuthR::gar_auth_service(Sys.getenv("GAGO_AUTH"))

# use your own viewId
test_id <- 12345678	

slow_api <- function(){
  google_analytics(test_id, 
                   date_range = c("2018-01-01","2019-10-01"),
                   metrics = "sessions", 
                   dimensions = c("date","hour","campaign","landingPagePath"),
                   slow_fetch = TRUE,
                   max = -1)
} 

quicker_r <- function(){
  google_analytics(test_id, 
                   date_range = c("2018-01-01","2019-10-01"),
                   metrics = "sessions", 
                   dimensions = c("date","hour","campaign","landingPagePath"),
                   slow_fetch = FALSE, #the default
                   max = -1)
}

# a call out  of R to the terminal to an installed gagocli binary
gago_call <- function(){
  system("./bin/gagocli reports -dims 'ga:date,ga:hour,ga:campaign,ga:landingPagePath' \
  -end 2019-10-01 -max -1 -mets 'ga:sessions' -start 2018-01-01 -view 106249469 \
  -o gago.csv")
  # read the CSV created into R as well to make it a fairer comparison
  read.csv("gago.csv", stringsAsFactors = FALSE)
}

mbm <- microbenchmark(
  slow = slow_api(),
  quick_r = quicker_r(),
  gago_call = gago_call(),
  times = 3
)

The results were:

  • Slow traditional API paging: 921 seconds (15mins 21secs)
  • v4 batching: 233 seconds (3mins 53secs)
  • gago (v4 batching plus 10 API concurrency): 119 seconds (1min 59secs)

e.g. about 85% saving in time from the API paging to full optimisation. I think that does qualify for ‘blazingly fast’. It could be improved further by tailoring page size to your query, and if anti_sample is set since the more API calls made, the greater the gains should be available.

Summary

There are still some improvements to be made to gago at the time of writing such as supporting filters and segments, but as an introduction to Go it has helped me see Go’s strengths and where it may fit in my workflow. I look forward to working with Go again on a project, as overall it was nice to work with, using Visual Studio Code to help me along. I feel it has improved my horizons and I can already see how it has influenced how I code in other languages.

Share Comments
comments powered by Disqus