Article

Making speech-to-text work with React Native and Expo

How we created a voice search with Google Cloud

Share

In this post, I will outline our thought process behind the development of a voice search feature in a recent React Native project. I will also walk through how the final feature works with some annotated code examples. If you want to cut straight to the example repos, see the Expo React Native app with voice search example and the audio-to-text Google Cloud function example.


We Like Expo

We’ve developed React Native apps with Expo and without. Expo is a valuable toolset that removes frustrating layers from the development process and provides easy bridging to device system features. Instead of spending time combing through Xcode and Android Studio, we can spend time on user-facing features. Expo also simplifies a time-consuming and tedious build process. Once you use Expo, it is hard to go back to debugging things that should be simple (like loading fonts) and Googling esoteric errors (although that can’t be totally avoided).

Stand Clear of the Closing Doors Please

When you board the Expo train, you need to be all in. The Expo SDK provides tons of access to system functionality such as the camera, calendar and accelerometer. But it does have limitations. If you need functionality that lands outside of the features in the SDK, you’ll need to eject and rebuild those features with native code or by using an existing package that does this for you (think “link”). Halfway through the project, this can be a daunting undertaking. All that retesting with likely blow your budget.(This is why we are excited about Unimodules and the possibility to use parts of the Expo API.)

To Expo or not to Expo

When we started a recent React Native project, we weighed using Expo or not. Most of the project requirements we could accomplish within Expo, but one gave us pause: voice search. If we ejected the app, we could probably use react-native-voice. But we didn’t want to sacrifice the Expo gains for one feature. So we decided to find another way to build a voice search and keep Expo. iOS Caveat: This setup works specifically for iOS and .wav files.

The Voice Search Pipeline

Expo has a text-to-speech API, but not a speech-to-text. We decided to use the Expo Permissions and Audio API, but find a different solution for speech-to-text. After looking into a number of possibilities, we decided to use Google’s Cloud Speech-to-Text to translate audio files into text. 

Here are the steps in our voice search:

  • The device asks for permission to use the microphone using Expo’s Permissions API

  • Expo’s Audio API is used to record an audio file of the user’s speech.

  • The audio file is sent to a Google Cloud function, which in turn sends it to the Google Speech API.

  • The Speech API returns a text translation of the audio.

  • The audio file is deleted.

  • This text is fed into the Algolia search.

React Native Example App

Here is what the example app looks like: The audio file is recorded when you hold the button. Here is the function to record the audio:

startRecording = async () => { const { status } = await Permissions.askAsync(Permissions.AUDIO_RECORDING); if (status !== 'granted') return; this.setState({ isRecording: true }); // some of these are not applicable, but are required await Audio.setAudioModeAsync({ allowsRecordingIOS: true, interruptionModeIOS: Audio.INTERRUPTION_MODE_IOS_DO_NOT_MIX, playsInSilentModeIOS: true, shouldDuckAndroid: true, interruptionModeAndroid: Audio.INTERRUPTION_MODE_ANDROID_DO_NOT_MIX, playThroughEarpieceAndroid: true, }); const recording = new Audio.Recording(); try { await recording.prepareToRecordAsync(recordingOptions); await recording.startAsync(); } catch (error) { console.log(error); this.stopRecording(); } this.recording = recording;}

One of the tricky parts of this setup is getting audio file recording options that work with the options passed to the Google Speech API.  This setup works specifically for iOS and .wav files, but the Android properties still need to be passed in. I found .wav files work best with a format of LINEAR16 at 41000 Hz. If you get the wrong combination of options, your audio file will return an empty response from the Google Speech API. Which is very frustrating.

const recordingOptions = { // android not currently in use, but parameters are required android: { extension: '.m4a', outputFormat: Audio.RECORDING_OPTION_ANDROID_OUTPUT_FORMAT_MPEG_4, audioEncoder: Audio.RECORDING_OPTION_ANDROID_AUDIO_ENCODER_AAC, sampleRate: 44100, numberOfChannels: 2, bitRate: 128000, }, ios: { extension: '.wav', audioQuality: Audio.RECORDING_OPTION_IOS_AUDIO_QUALITY_HIGH, sampleRate: 44100, numberOfChannels: 1, bitRate: 128000, linearPCMBitDepth: 16, linearPCMIsBigEndian: false, linearPCMIsFloat: false, },};

Okay now we have the audio file. We'll POST it as form data to the Google Cloud function:

getTranscription = async () => { this.setState({ isFetching: true }); try { const info = await FileSystem.getInfoAsync(this.recording.getURI()); console.log(`FILE INFO: ${JSON.stringify(info)}`); const uri = info.uri; const formData = new FormData(); formData.append('file', { uri, type: 'audio/x-wav', // could be anything name: 'speech2text' }); const response = await fetch(config.CLOUD_FUNCTION_URL, { method: 'POST', body: formData }); const data = await response.json(); this.setState({ query: data.transcript }); } catch(error) { console.log('There was an error', error); this.stopRecording(); this.resetRecording(); } this.setState({ isFetching: false });}

The Cloud function is written in Node 8 and uses the npm package busboy to temporarily write the file in memory. The file is then sent to the Google Speech-to-Text API with these hard-coded options:

const ENCODING = 'LINEAR16';const SAMPLE_RATE_HERTZ = 41000;const LANGUAGE = 'en-US';

The function then sends the file to the Speech API which hopefully returns the translated text 🤞.

Algolia

In this setup, Algolia just ingests the text for the search. But a lot can be done behind the scenes to configure Algolia to translate natural spoken language into matching search results. That is a topic for another blog post.

The End

Hopefully this example setup can help someone else. If you have any questions, hit me up on Twitter @kellypacker.


Resources:


Great things start with a conversation

Want to stay in the loop? Sign-up for our quarterly newsletter and we’ll send you updates with a mix of our latest content.