Speech Recognition – Speech to Text in Python using Google API, Wit.AI, IBM, CMUSphinx

Chatbots, Python Development, Machine Learning, Natural Language Processing (NLP)

Speech Recognition – Speech to Text in Python using Google API, Wit.AI, IBM, CMUSphinx

Speech Recognition is a part of Natural Language Processing which is a subfield of Artificial Intelligence. In Speech Recognition, spoken words/sentences are translated into text by computer. It is also known as Speech to Text (STT).

If you are looking to get started with building Speech Recognition / Audio Transcribe in Python then this small tutorial could be very helpful and will provide basic insights to get started.

Why Python?

Python has rich libraries to offer which will make your life fairly easier while developing complex applications. Many of the things you will find pre-built and you can build your functionality on top of it.

For speech recognition too, Python has many libraries to make your development process easy and faster.

And one more thing, if you are familiar with C/C++ or PHP or any other basic language then learning Python becomes pretty easy. It has got easy learning curve.

Usage of Speech Recognition

Speech recognition could be very useful in number of applications. Especially in personal assistant bot, dictation, voice command based control system, audio transcriptions, quick notes with audio support, voice based authentication, etc.

Let’s Get started

Install Python

It is good if you are little familiar with Python. If not, then no worries. It will take little longer but you should be able to reach to the end successfully with some extra efforts.

For this tutorial, we’ll be using Python 3.x. Let’s start from level 0 by installing python. If you have Python already installed on your system then you can skip this step and jump on to next one.

Now, to install Python there could be multiple ways. Either you can install Python standalone or install distribution like Anaconda which comes with Python.

To install python, run “sudo apt install python3.7” if you are on Ubuntu or follow https://www.youtube.com/watch?v=dX2-V2BocqQ if you are on Windows.

Or if you select to use Anaconda, then you can follow the instructions at https://conda.io/docs/user-guide/install/index.html

For below tutorial, we have used python3.x on Ubuntu 18.04.

Install required libraries

There are some excellent libraries available which you can use to build your speech recognition. For this tutorial, we are going to use

  • SpeechRecognition
  • pyaudio

Install packages using following commands (if pip3 is not already installed then first install it by “sudo apt install python3-pip” command):

pip3 install SpeechRecognition

Now, before installing pyaudio for your audio input/output stream, make sure you install portaudio with the following command

sudo apt-get install portaudio19-dev

“portaudio” is a python independent C library, so it can’t be installed using pip. If you don’t have portaudio installed, you might encounter the following error:
ERROR: Failed building wheel for pyaudio.

Run below command to install pyaudio python library after “portaudio” is installed successfully.

pip3 install pyaudio

Now, your general setup is ready.

In SpeechRecognition library, there are different methods for recognizing speech from an audio source using various APIs. These APIs use different third party services to detect speech. We are going to explore below methods of SpeechRecognition library:

  • recognize_google() for Google Web Speech API: Using Google Web Speech API (this API comes by default upto some functionalities)
  • recognize_google_cloud() for Google Cloud Speech API: Using Google Cloud Speech API.
  • recognize_sphinx() for CMUSphinx: Using CMU Sphinx – requires installing PocketSphinx
  • recognize_wit() for WIT.AI: Using speech recognition service provided by wit.ai
  • IBM Speech to Text: SpeechRecognition’s method recgonize_ibm() didn’t work due to credential issue as IBM has udpated the credential system. So we didn’t use it. Instead we used IBM’s library for that.

1. GOOGLE WEB SPEECH API

If you want to use Google Web Speech API, then you don’t need to install any extra packages/libraries apart from the ones mentioned above.

Below is the code snippet for Speech to text using Google Web Speech API with input of audio by Microphone:

import speech_recognition as sr

r = sr.Recognizer()
speech = sr.Microphone(device_index=0)
# for recognizing speech
with speech as source:
    print("say something!…")
    audio = r.adjust_for_ambient_noise(source)
    audio = r.listen(source)
# Speech recognition using Google Speech Recognition
try:
    recog = r.recognize_google(audio, language = 'en-US')
    # for testing purposes, we're just using the default API key
    # to use another API key, use r.recognize_google(audio)
    # instead of r.recognize_google(audio)

    print("You said: " + recog)
except sr.UnknownValueError:
    print("Google Speech Recognition could not understand audio")
except sr.RequestError as e:
    print("Could not request results from Google Speech Recognition service; {0}".format(e))

If you want to transcribe speech from an audio file then you can do it easily by providing audio file and process it through Google Web Speech API.

Below is the code snippet for Speech to text using Google Web Speech API to transcribe audio file:

import speech_recognition as sr
r = sr.Recognizer()
file = sr.AudioFile('FILE_NAME.wav')
# for transcripting audio
with file as source:
    audio = r.record(source)
#  Speech recognition using Google Speech Recognition
try:
    recog = r.recognize_google(audio, language = 'en-US')
    # for testing purposes, we're just using the default API key
    # to use another API key, use r.recognize_google(audio, key="GOOGLE_SPEECH_RECOGNITION_API_KEY")
    # instead of `r.recognize_google(audio)`` 
    print("You said: " + recog)
except sr.UnknownValueError:
    print("Google Speech Recognition could not understand audio")
except sr.RequestError as e:
    print("Could not request results from Google Speech Recognition service; {0}".format(e)) 

It is as simple as that.

But you must remember that the audio format for the audio is limited to .wav, AIFF, AIFF-C, FLAC only.

A very crucial advantage of the Google Cloud Speech API is its accuracy. But the limitation of this Web Speech API is that you can make maximum 50 requests per day. Also, the default access provided by Google can be revoked at any time. So it is not advisable to use this in a production level project.

2. GOOGLE CLOUD SPEECH TO TEXT API

Another option provided by Google is their Speech To Text API service which can be used for live projects. To use Google Cloud Speech To Text API, we will need to install libraries required to make it work.

Install below libraries

pip3 install google-api-python-client
pip3 install google-cloud-speech
pip3 install oauth2client

After installing required libraries, we need to create a project in Google Cloud and download Service Account Key JSON file.

To download Service Account JSON key file open Google Cloud Platform:

  • Click on Hamburger menu on top left
  • Select IAM & Admin
  • Select Service Accounts
  • Then download JSON key by clicking on 3 dots and Create Key button

After downloading the key, place it in the same directory as your code file.

Now, we are ready to make calls to Google Cloud Speech To Text API.

Below is the code snippet for Speech to text using Google Cloud Speech API with input of audio by Microphone:

import speech_recognition as sr
import os

os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="SERVICE_ACCOUNT_KEY.JSON"

r = sr.Recognizer()
file = sr.Microphone(device_index=0 )

with file as source:
    print("say something!!.....")
    audio = r.adjust_for_ambient_noise(source)
    audio = r.listen(source)
    
try:
    recog = r.recognize_google_cloud(audio, language = 'en-US')
    print("You said: " + recog)
except sr.UnknownValueError as u:
    print(u)
    print("Google Cloud Speech Recognition could not understand audio")
except sr.RequestError as e:
    print("Could not request results from Google Cloud Speech Recognition service; {0}".format(e))  

Below is the code snippet for Speech to text using Google Cloud Speech API to transcribe audio file:

import speech_recognition as sr
import os

os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="SERVICE_ACCOUNT_KEY.JSON"

r = sr.Recognizer()
file = sr.AudioFile('AUDIO_FILE.wav')

with file as source:
    audio = r.record(source)

try:
    recog = r.recognize_google_cloud(audio, language = 'en-US')
    print("You said: " + recog)
except sr.UnknownValueError as u:
    print(u)
    print("Google Cloud Speech Recognition could not understand audio")
except sr.RequestError as e:
    print("Could not request results from Google Cloud Speech Recognition service; {0}".format(e))  

Google Cloud Speech API is free up to 60 minutes. For more usage, your account will be charged as per their pricing model.

3. CMUSPHINX

So, another alternative to the Google API is CMUSphinx. It is an offline speech recognition API which is its USP. But it is lesser accurate than Google Cloud Speech API.

If you are using cmusphinx, you need to install the following packages or you will get a building wheel error due to missing swig file. To overcome this, use the command below.

sudo apt-get install -y python python-dev python-pip build-essential swig git libpulse-dev

then:

pip3 install pocketsphinx

pocketsphinx will be successfully installed.

Below is the code snippet for Speech to text using PocketSphinx with input of audio by Microphone:

import speech_recognition as sr

r = sr.Recognizer()
speech = sr.Microphone(device_index=0)
# for speech recognition
with speech as source:
    print("say something!…")
    audio = r.adjust_for_ambient_noise(source)
    audio = r.listen(source)
# recognize speech using Sphinx
try:
    recog = r.recognize_sphinx(audio)  
    print("Sphinx thinks you said '" + recog + "'")  
except sr.UnknownValueError:  
    print("Sphinx could not understand audio")  
except sr.RequestError as e:  
    print("Sphinx error; {0}".format(e))

Below is the code snippet for Speech to text using PocketSpinx to transcribe audio file:

import speech_recognition as sr
r = sr.Recognizer()
file = sr.AudioFile('FILE_NAME.wav')
# for audio transcription
with file as source:
    audio = r.record(source)
# recognize audio using Sphinx
try:
    recog = r.recognize_sphinx(audio)  
    print("Sphinx thinks you said '" + recog + "'")  
except sr.UnknownValueError:  
    print("Sphinx could not understand audio")  
except sr.RequestError as e:  
    print("Sphinx error; {0}".format(e))

Inaccuracy is a major drawback of the PocketSphinx API.

4. IBM Watson Speech to Text

The IBM Watson Speech to Text API is also a major speech recognition engine that can be incorporated in an application that requires speech recognition or audio transcription.

To begin with IBM’s API, you first need to have an IBM Cloud account. Once you have created your account, follow the following steps.

  1. from the top left navigation menu on the dashboard, go to Resources list.
  2. then, click on Create Resource.
  3. then, in the categories section, select AI and select Speech to Text. Create your service without changing anything.
  4. after your service is created, click on the service.
  5. select manage from the navigation menu and click on show credentials.
  6. copy the api key and url generated.

Run the following commands in the terminal.

pip3 install ibm-watson

Below is the code snippet for Speech to text using IBM Speech To Text service with input of audio by Microphone:

import speech_recognition as sr
from ibm_watson import SpeechToTextV1
import json
r = sr.Recognizer()
speech = sr.Microphone()
speech_to_text = SpeechToTextV1(
    iam_apikey = "YOUR_API_KEY",
    url = "YOUR_URL"
)
with speech as source:
    print("say something!!…")
    audio_file = r.adjust_for_ambient_noise(source)
    audio_file = r.listen(source)
speech_recognition_results = speech_to_text.recognize(audio=audio_file.get_wav_data(), content_type='audio/wav').get_result()
print(json.dumps(speech_recognition_results, indent=2))

You will need to update your API key and URL.

Below is the code snippet for Speech to text using IBM Speech to Text service to transcribe audio file:

import speech_recognition as sr
from ibm_watson import SpeechToTextV1
import json
r = sr.Recognizer()
speech_to_text = SpeechToTextV1(
    iam_apikey = "YOUR_API_KEY",
    url = "YOUR_URL"
)
with open('FILE_NAME.wav', 'rb') as audio_file:
    speech_recognition_results = speech_to_text.recognize(
        audio=audio_file,
        content_type='audio/wav'
    ).get_result()
print(json.dumps(speech_recognition_results, indent=2))

5. WIT.AI

Wit.ai is a natural language interface for applications capable of turning sentences into structured data. It is also quite accurate for speech recognition and audio transcription.

Follow the below steps to use wit.ai:

  1. on your browser access wit.ai
  2. login using your github account. if you dont have one, create one.
  3. click on the MYFirstApp directory, then go to settings.
  4. there you will find your Server Access Token or Client Access Token. Copy it, as it will be required for authentication in the python script.

Run the following command to install wit packages.

pip3 install wit

wit will be successfully be installed.

Below is the code snippet for Speech to text using WIT.AI with input of audio by Microphone:

import speech_recognition as sr
r = sr.Recognizer()
speech = sr.Microphone()
with speech as source:
    print("say something!!….")
    audio = r.adjust_for_ambient_noise(source)
    audio = r.listen(source)
try:
    recog = r.recognize_wit(audio, key = "your key")
    print("You said: " + recog)
except sr.UnknownValueError:
    print("could not understand audio")
except sr.RequestError as e:
    print("Could not request results ; {0}".format(e))

Below is the code snippet for Speech to text using WIT.AI to transcribe audio file:

import speech_recognition as sr
r = sr.Recognizer()
file = sr.AudioFile("FILE_NAME.wav")
with file as source:
    audio = r.record(source)
try:
    recog = r.recognize_wit(audio, key = "your key")
    print("You said: " + recog)
except sr.UnknownValueError:
    print("could not understand audio")
except sr.RequestError as e:
    print("Could not request results ; {0}".format(e))

Conclusion: This is a pretty basic level of speech recognition, far from production ready. We have created this tutorial to get you started with Speech Recognition in Python. Many find it daunting when they start and they drop it altogether. But as you can see, it’s not that difficult.

Using this basic knowledge, we can now think of better ways to make it production ready and use it in real life application. Stay tuned for more tutorials we will be sharing to exhibit how we used this speech recognition in actual applications.

2 Responses

  1. Adilson says:

    Hi I was curious if I need this to transcibe my podcast to text. I was looking for solution on wit.ai, but at the moment no results. I got to find your blog. I am processing my audiofiles through auphonic and want to connect a service with it. When googling i came accros only the coding part, but could not understand it if it would do what i am looking for. Hope you can help me to point me in the right direction. PErhaps i am going about this with too much over thinking.

    • Hello Adilson,
      Thanks for visiting our blog and posing your query.

      To help you with that, we would like to have more information to your question.
      – Which issue are you facing in wit.ai?
      – Looks like ‘auphonic’ is also offering speech to text. Which service do you want to connect with auphonic?

      Either reply with your answers on this comment or you can email your answers on pragnakalp (at) gmail (dot) com and we’ll take it further from there.

Leave a Reply

Your email address will not be published. Required fields are marked *