BigQuery Queries

Simple Queries

You can pass in queries that have results under ~ 100000 rows using this command:

bqr_query("big-query-r","samples",
          "SELECT COUNT(repository.url) FROM [publicdata:samples.github_nested]")

More than that, and the API starts to hang and you are limited by your download bandwidth.

Asynchronous Queries

For bigger queries, asynchronous queries save the results to another BigQuery table. You can check the progress of the job via bqr_get_job

library(bigQueryR)

## Auth with a project that has at least BigQuery and Google Cloud Storage scope
bqr_auth()

## make a big query
job <- bqr_query_asynch("your_project", 
                        "your_dataset",
                        "SELECT * FROM blah LIMIT 9999999", 
                        destinationTableId = "bigResultTable")
                        
## poll the job to check its status
## its done when job$status$state == "DONE"
bqr_get_job("your_project", job$jobReference$jobId)

##once done, the query results are in "bigResultTable"

You may now want to download this data. For large datasets, this is best done via extracting the BigQuery result to Google Cloud Storage, then downloading the data from there.

You can create a bucket at Google Cloud Storage at https://console.cloud.google.com/storage/browser, or you can use library(googleCloudStorageR)

Once created, you can extract your data via the below:

## Create the data extract from BigQuery to Cloud Storage
job_extract <- bqr_extract_data("your_project",
                                "your_dataset",
                                "bigResultTable",
                                "your_cloud_storage_bucket_name")
                                
## poll the extract job to check its status
## its done when job$status$state == "DONE"
bqr_get_job("your_project", job_extract$jobReference$jobId)

## to download via a URL and not logging in via Google Cloud Storage interface:
## Use an email that is Google account enabled
## Requires scopes:
##  https://www.googleapis.com/auth/devstorage.full_control
##  https://www.googleapis.com/auth/cloud-platform
## set via options("bigQueryR.scopes") and reauthenticate if needed

download_url <- bqr_grant_extract_access(job_extract, "your@email.com")

## download_url may be multiple if the data is > 1GB
> [1] "https://storage.cloud.google.com/big-query-r-extracts/extract-20160311112410-000000000000.csv"
> [2] "https://storage.cloud.google.com/big-query-r-extracts/extract-20160311112410-000000000001.csv"
> [3] "https://storage.cloud.google.com/big-query-r-extracts/extract-20160311112410-000000000002.csv"