6 March 2016
For our chosen model, need individual user session info e.g. need a userId
library(googleAnalyticsR) ga_auth() gaId <- xxxx # Your View ID ## In this case, dimension3 contains userId in format: ## u={cid}&t={hit-timestamp} raw <- google_analytics(gaId, start = "2016-02-01", end = "2016-02-01", metrics = c("pageviews"), dimensions = c("dimension3", "pagePath"))
For Adobe Analytics use the RSiteCatalyst
library.
dimension3 | pagePath | pageviews |
---|---|---|
u=100116318.1454322382&t=1454322382033 | /example/809 | 1 |
u=100116318.1454322382&t=1454322412130 | /example/1212 | 1 |
u=100116318.1454322382&t=1454322431492 | /example/339 | 1 |
u=100116318.1454322382&t=1454322441120 | /example/1494 | 1 |
u=100116318.1454322382&t=1454322450156 | /example/339 | 1 |
u=100116318.1454322382&t=1454322461871 | /example/1703 | 1 |
Group per user the pages they visited in order of timestamp
library(tidyr); library(dplyr) ## put dimension3: u={cid}&t={timestamp} in own columns "cid" and "timestamp" processed <- raw %>% extract(col = dimension3, into = c("cid","timestamp"), regex = "u=(.+)&t=(.+)") ## javascript to R timestamp processed$timestamp <- as.POSIXct(as.numeric(processed$timestamp) / 1000, origin = "1970-01-01") ## find users with session length > 1 e.g. not a bounce visits nonbounce <- processed %>% group_by(cid) %>% summarise(session_length = n()) %>% filter(session_length > 1) processed <- nonbounce %>% left_join(processed)
cid | sessionLen | timestamp | pagePath | pageviews |
---|---|---|---|---|
1005103157.1454327958 | 2 | 2016-02-01 11:59:18 | /example/1 | 1 |
1005103157.1454327958 | 2 | 2016-02-01 12:02:42 | /example/155 | 1 |
1010303050.1454327644 | 2 | 2016-02-01 11:54:03 | /example/144 | 1 |
1010303050.1454327644 | 2 | 2016-02-01 12:00:03 | /example/80 | 1 |
1011007665.1454333263 | 2 | 2016-02-01 13:27:43 | /example/1359 | 1 |
Our model library clickstream
needs a vector of sequential pageviews per userId. You may also need to aggregate the pages - unique for each website so not included here.
## for each cid, make a string of pagePath in timestamp order sequence <- processed %>% group_by(cid) %>% summarise(sequence = paste(pagePath, collapse = ",")) sequence <- paste(sequence$cid, sequence$sequence, sep=",")
## example entries of page sequence per user for clickstream example_sequence[[1]]
## [1] "100116318.1454322382,/example/809,/example/1212,/example/339,/example/1494,/example/339,/example/1703,/example/1703,/example/1722,/example/1703"
Create a Markov chain model of first order
library(clickstream) # fitting a simple Markov chain and predicting the next click csf <- tempfile() writeLines(sequence, csf) cls <- readClickstreams(csf, header = TRUE) ## 1612 users - computing time: 285 seconds model <- fitMarkovChain(cls, verbose=TRUE) ## save model for use on OpenCPU save(model, file="./data/model.rda")
Almost instant predictions now we have built the model object.
## see ?clickstream for details ## make predictions > predict(model, new("Pattern", sequence = c("/example/96","/example/213","/example/107"))) ## prediction output Sequence: /example/251 Probability: 0.5657379 Absorbing Probabilities: None 1 0
OpenCPU allows webhooks to Github: updates the model everytime you push to Github!
Create a small custom package with the model data and the function to predict pageviews
predictMarkov <- function(pageview_names) { ## model loaded on package load states <- invisible(clickstream::states(model)) pv_n <- pageview_names[pageview_names %in% states] startPattern <- new("Pattern", sequence = pv_n) predict <- predict(model, startPattern) list(page = predict@sequence, probability = predict@probability) }
My Github package using this function is here
Test your OpenCPU calls here: https://public.opencpu.org/ocpu/test/
Hopefully you can think of something better than a popup…
Questions?