Turn audio into text

  encoding = c("LINEAR16", "FLAC", "MULAW", "AMR", "AMR_WB", "OGG_OPUS",
  sampleRateHertz = NULL,
  languageCode = "en-US",
  maxAlternatives = 1L,
  profanityFilter = FALSE,
  speechContexts = NULL,
  asynch = FALSE,
  customConfig = NULL



File location of audio data, or Google Cloud Storage URI


Encoding of audio data sent


Sample rate in Hertz of audio data. Valid values 8000-48000. Optimal and default if left NULL is 16000


Language of the supplied audio as a BCP-47 language tag


Maximum number of recognition hypotheses to be returned. 0-30


If TRUE will attempt to filter out profanities


An optional character vector of context to assist the speech recognition


If your audio_source is greater than 60 seconds, set this to TRUE to return an asynchronous call


[optional] A RecognitionConfig object that will be converted from a list to JSON via toJSON - see RecognitionConfig documentation. The languageCode will be taken from this functions arguments if not present since it is required.


A list of two tibbles: $transcript, a tibble of the transcript with a confidence; $timings, a tibble that contains startTime, endTime per word. If maxAlternatives is greater than 1, then the transcript will return near-duplicate rows with other interpretations of the text. If asynch is TRUE, then an operation you will need to pass to gl_speech_op to get the finished result.


Google Cloud Speech API enables developers to convert audio to text by applying powerful neural network models in an easy to use API. The API recognizes over 80 languages and variants, to support your global user base. You can transcribe the text of users dictating to an application’s microphone, enable command-and-control through voice, or transcribe audio files, among many other use cases. Recognize audio uploaded in the request, and integrate with your audio storage on Google Cloud Storage, by using the same technology Google uses to power its own products.


Audio encoding of the data sent in the audio message. All encodings support only 1 channel (mono) audio. Only FLAC and WAV include a header that describes the bytes of audio that follow the header. The other encodings are raw audio bytes with no header. For best results, the audio source should be captured and transmitted using a lossless encoding (FLAC or LINEAR16). Recognition accuracy may be reduced if lossy codecs, which include the other codecs listed in this section, are used to capture or transmit the audio, particularly if background noise is present.

Read more on audio encodings here https://cloud.google.com/speech/docs/encoding


startTime - Time offset relative to the beginning of the audio, and corresponding to the start of the spoken word.

endTime - Time offset relative to the beginning of the audio, and corresponding to the end of the spoken word.

word - The word corresponding to this set of information.

See also


if (FALSE) { test_audio <- system.file("woman1_wb.wav", package = "googleLanguageR") result <- gl_speech(test_audio) result$transcript result$timings result2 <- gl_speech(test_audio, maxAlternatives = 2L) result2$transcript result_brit <- gl_speech(test_audio, languageCode = "en-GB") ## make an asynchronous API request (mandatory for sound files over 60 seconds) asynch <- gl_speech(test_audio, asynch = TRUE) ## Send to gl_speech_op() for status or finished result gl_speech_op(asynch) ## Upload to GCS bucket for long files > 60 seconds test_gcs <- "gs://mark-edmondson-public-files/googleLanguageR/a-dream-mono.wav" gcs <- gl_speech(test_gcs, sampleRateHertz = 44100L, asynch = TRUE) gl_speech_op(gcs) ## Use a custom configuration my_config <- list(encoding = "LINEAR16", diarizationConfig = list( enableSpeakerDiarization = TRUE, minSpeakerCount = 2, maxSpeakCount = 3 )) # languageCode is required, so will be added if not in your custom config gl_speech(my_audio, languageCode = "en-US", customConfig = my_config) }