Creating a content recommendation engine using R, OpenCPU and GTM

Proof of concept for connecting Google Tag Manager to R

View project on GitHub

Connecting R and Google Tag Manager with OpenCPU

Enriching the dataLayer of Google Tag Manager with data from R functions opens up many possibilities. This is possible with OpenCPU, which turns R packages into a robust API.

An example and proof of concept is shown below, taking the example of a content recommendation engine making predictions based on what a user has viewed in their session.

Creating a content recommendation engine with R, OpenCPU and GTM

This presentation was given at MeasureCamp on the 5th March, 2016 and is available here.

The overview of the data architecture from the presentation is here:

data architecture

Example website

A live example with a GTM container calling OpenCPU for content predictions, as shown in the presentation is available in this Github project, here

R package called from OpenCPU

See the R code for the package and the model data here.

Using OpenCPU

Creating a webhook

Make a Github webhook to OpenCPU, and then every push to Github will be mirrored onto the OpenCPU server.

github webhook

Testing the model API works

You can test the model works in the API by using the test public server.

For production use though, you will want your own server, either using OpenCPU or the paid for alternatives such as yhat or Domino Data Lab

This Github repository is available on OpenCPU at this URL: https://MarkEdmondson1234.ocpu.io/predictClickOpenCPU/

See details on how to use OpenCPU here: https://www.opencpu.org/api.html#api-ci

Test the R API package

Test at: https://public.opencpu.org/ocpu/test/

Create a POST hit to this URL:

https://MarkEdmondson1234.ocpu.io/predictClickOpenCPU/R/predictMarkov/json

...with the parameters:

pageview_names

["/example/96","/example/213","/example/107"]

If successful you should see a 200 response code with the following JSON:

{
  "page": ["/example/251"],
  "probability": [0.5657]
}

good test opencpu

R code to generate the model

The R code to create the model is shown below:

Fetching Google Analytics data

## from https://github.com/MarkEdmondson1234/googleAnalyticsR_public
library(googleAnalyticsR)
ga_auth()

gaId <- xxxx # GA ViewId

## dimension3 contains userId in format:
## u={cid}&t={timestamp}
raw <- google_analytics(gaId,
                        start = "2016-02-01",
                        end = "2016-02-01",
                        metrics = c("pageviews"),
                        dimensions = c("dimension3", "pagePath"),
                        max_results = 20000)

Transforming the data into a form suitable for the the model

split_regex <- "u=(.+)&t=(.+)"
library(dplyr)
library(tidyr)
processed <- raw %>% tidyr::extract(dimension3,
                             c("cid","timestamp"),
                             split_regex)

## javascript to R timestamp
processed$timestamp <- as.POSIXct(as.numeric(processed$timestamp) / 1000,
                             origin = "1970-01-01")

## find users with session length > 1
nonbounce <- processed %>% group_by(cid) %>%
  summarise(session_length = n()) %>% filter(session_length > 1) %>% ungroup()

processed <- nonbounce %>% left_join(processed)

processed <- processed %>% arrange(cid, timestamp)

## for each cid, make a string of pagePath in timestamp order
sequence <- processed %>% group_by(cid) %>%
  summarise(sequence = paste(aggregation, collapse = ","))

sequence <- paste(sequence$cid, sequence$sequence, sep=",")

Creating the model

library(clickstream)

# fitting a simple Markov chain and predicting the next click
clickstreams <- sequence
csf <- tempfile()
writeLines(clickstreams, csf)
cls <- readClickstreams(csf, header = TRUE)

## Make the model:

## 1612 users - 285 seconds
model <- fitMarkovChain(cls, verbose=TRUE)

### Using the model:

## get the likely pages a user starts from
likely_start <- as.data.frame(model@start)
likely_start <- likely_start[order(likely_start$Freq, decreasing = TRUE),]

## List of pages in the model
states(model)

## Prediction:
startPattern <- new("Pattern", sequence = c("/example/96","/example/213"))
predict(model, startPattern)

## pages that absorb (e.g. are last in session)
last_pages <- absorbingStates(model)

## model is saved so it can be uploaded to the R package for the predictions:
save(model, file="./data/model.RData")

Plotting the model

If you want to see the model, the below helps plot it:

library(igraph)
example_graph <- igraph::graph.adjacency(t(as.matrix(model@transitions[[1]])),
                                 weighted = T)
                                 likely_start <- as.data.frame(example_mc@start)
likely_start <- likely_start[order(likely_start$Freq, decreasing = TRUE),]

likely_plot <- likely_start[order(likely_start$Freq),]
likely_plot <- likely_plot[likely_plot$Freq > 0.02,]

## make a sub graph as plotting them all is a big mess
sub_graph <- graph.neighborhood(graph,
                                order = 1,
                                nodes = V(graph)["/example/96" %in% as.character(likely_plot$Var1)]
)[[1]]


## the actual plot
edgeLabels <-  paste0(round(igraph::E(sub_graph)$weight, 2)*100,"%")
plot(sub_graph, edge.label = edgeLabels,
     vertex.size = 15,
     vertex.color="orange",
     # vertex.label = NA,
     vertex.label.dist = 1,
     vertex.label.cex=1.5,
     vertex.label.color = "orange",
     edge.label=1,
     edge.label.cex=1.5,
     edge.width =1.5,
     edge.lty = 2,
     edge.color = "brown",
     edge.curved = TRUE,
     edge.arrow.size = 1,
     frame=TRUE,
     asp = 0.8,
     margin = 0.2,
     main = "Example Markov Chain for Website")

example markov chain

GTM code for calling OpenCPU

The custom HTML tag code is shown below:

Creating a cookie to record URL pages in the session

//Tag : Write SessionUrls to Cookie
<script>

  str = {{Page Path}};

  // parsing URL to fit model
  index  = str.indexOf('.html');
  newUrl = str.substring(str.lastIndexOf("/predictClickOpenCPU/"),str.lastIndexOf("."), index);

  // if existing cookie append pageURL, else record this one
  if({{sessionUrls}}){

    sessionUrls = {{sessionUrls}} + "," + newUrl;

  } else {

    sessionUrls = newUrl;
  }

  //write session cookie
  document.cookie = 'sessionUrls=' + sessionUrls;

</script>

Calling the OpenCPU server and putting result in dataLayer

//Tag Read from OpenCPU
<script src="//code.jquery.com/jquery-1.10.2.min.js"></script>
<script src="//www.opencpu.org/js/archive/opencpu-0.4.js"></script>
<script>
//set CORS to call "predictClickOpenCPU" package on public server
ocpu.seturl("//MarkEdmondson1234.ocpu.io/predictClickOpenCPU/R")

//split character into array
var mydata = {{sessionUrls}}.split(",");

console.log(mydata);

//call R function: predictClickOpenCPU::predictMarkov(pageview_names=mydata)
var req = ocpu.rpc("predictMarkov", {
    pageview_names : mydata
  }, function(output){
    dataLayer.push({
              'event' : 'openCPUcall',
              'prediction': output.page[0],
              'probability': output.probability[0]
            });
    console.log("R returned: " + output);
});

//optional
req.fail(function(){
    console.log("R returned an error: " + req.responseText);
  });

</script>