Using Text To Speech and Speech Recognition in Windows Phone 8

With Windows Phone 8 Microsoft added an API for speech recognition and synthesis (TTS).
The combination of these APIs allows you to create conversations with the user, asking them for input with TTS and listening for their replies.

To use this feature you need to add the following capabilities to your app (in the WMAppManifest.xml file)
(like all speech recognition solutions, the actual processing is done on the server side, the phone just streams the audio to the server and gets the text result back. that’s why networking must be enabled  for the app)

Recognizing Speech

Here’s a very simple piece of code that will show the speech recognition UI to the user:

private SpeechRecognizerUI _recoWithUI;

private async void SimpleRecognition()
//initialize the recognizer
_recoWithUI = new SpeechRecognizerUI();

//show the recognizer UI (and prompt the user for speech input)
var recoResult = await _recoWithUI.RecognizeWithUIAsync();

And here’s what a result looks like when using the default grammar

ResultStatus: Succeeded
RecognitionResult: {
RuleName: ""
Semantics: null
Text: "Recognize this text."
TextConfidence: High
Details: {
ConfidenceScore: 0.8237646
RuleStack: COM Object

(this is the object hierarchy, I’m showing it as a json objects for clarity)

TextConfidence can have the following values: Rejected, Low, Medium, and High. You use this value to figure out how close the returned text is to what the user actually said. If you want the actual score, you can use ConfidenceScore.

Text is the recognized text. it will be empty if the TextConfidence is Rejected.

RuleName is the name of the custom grammar used for this recognition (since I didn’t use one, the value is empty here)

Semantics are related to SRGS grammars (an XML file that defines a more complex grammar you want to use). I will not get into this more advanced option in this post. 

Prompting the user and adding a custom grammar

The longer the user’s input is the harder it is to get accurate recognition. It makes sense to design your application in such a way where the user only needs to give you short answers.

In the following code snippet I’m showing the user a text prompt asking a specific question and I’m showing the possible answers.
I’m also adding the possible answers as a programmatic list grammar. This is the simples way of adding a grammar and it will increase the accuracy and force the API to match only the words I added in the grammar.
This means that if the user said Lace or Ace it will still be recognized as Race.

I’m also disabling the Readout. If you’re creating a speech conversation with your user, the readout might get tedious and slow the user down. repeating every recognition with “Heard you say: “ and the recognized text takes too long and will get annoying. You might want to enable it and certain situations but I would leave it off by default.
You can also hide the confirmation prompt which displays the same text (Heard you say…) by setting Settings.ShowConfirmation to false.

private SpeechRecognizerUI _recoWithUI;

private async void SpeechRecognition()
//initialize the recognizer
_recoWithUI = new SpeechRecognizerUI();

var huntTypes = new[]{"Race", "Explore"};
_recoWithUI.Recognizer.Grammars.AddGrammarFromList("huntTypes", huntTypes);

//prompt the user
_recoWithUI.Settings.ListenText = "What type of hunt do you want to play?";

//show the possible answers in the example text
_recoWithUI.Settings.ExampleText = @"Say 'Race' or 'Explore'";

//disable the readout of recognized text
_recoWithUI.Settings.ReadoutEnabled = false;

//show the recognizer UI (and prompt the user for speech input)
var recoResult = await _recoWithUI.RecognizeWithUIAsync();


Adding a spoken prompt

To create a conversation, you might want to add a spoken text before prompting the user for their response.
You can do that by simple adding the following two lines before the call to RecognizeWithUIAsync

var synth = new SpeechSynthesizer();
await synth.SpeakTextAsync("What type of hunt do you want to play?");

Notice that I’m waiting for the text to be spoken before showing the recognizer UI. This means the current screen will still be visible while the text is spoken and as soon as it ends the recognizer UI will show up and the prompt sound will be played. If you don’t include the await on that line, the text will be playing while the recognizer is already listening.

A better solution would’ve been to include a TTS option for the prompt in the recognizer. I couldn’t find such an option.
Another way to solve this is to create your own UI and use SpeechRecognizer and RecognizeAsync with your own UI.

Here’s a quick code sample:

private async void RecognizeWithNoUI()
var recognizer = new SpeechRecognizer();
var huntTypes = new[]{"Race", "Explore"};
recognizer.Grammars.AddGrammarFromList("huntTypes", huntTypes);

var synth = new SpeechSynthesizer();
await synth.SpeakTextAsync("Do you want to play a Race, or, Explore, hunt?");

var result = await recognizer.RecognizeAsync();

The main difference is that there’s no automatic handling of failed recognitions. The RecognizeWithUIAsync call tells the user “Sorry, didn’t get that” and asks them to speak again, with the no UI option, you need to handle that yourself using the TextConfidence value.


As you can see, it’s very easy and straightforward to add speech recognition and synthesis to your app. Combined with Voice Commands you can create an experience that lets the user launch and control your app without touching their phone. If you’re using voice commands you can start this experience on that target page only, so when the user launches the app with a voice command you will prompt them with TTS and get their replies with speech and when they launch their app from the apps list or tile it will show a normal user interface.