Using Text To Speech and Speech Recognition in Windows Phone 8

With Windows Phone 8 Microsoft added an API for speech recognition and synthesis (TTS). The combination of these APIs allows you to create conversations with the user, asking them for input with TTS and listening for their replies. To use this feature you need to add the following capabilities to your app (in the WMAppManifest.xml file) IDCAPSPEECHRECOGNITION IDCAPMICROPHONE IDCAP_NETWORKING (like all speech recognition solutions, the actual processing is done on the server side, the phone just streams the audio to the server and gets the text result back. that’s why networking must be enabled  for the app) Recognizing Speech Here’s a very simple piece of code that will show the speech recognition UI to the user:

private SpeechRecognizerUI _recoWithUI;private async void SimpleRecognition(){ //initialize the recognizer _recoWithUI = new SpeechRecognizerUI(); //show the recognizer UI (and prompt the user for speech input) var recoResult = await _recoWithUI.RecognizeWithUIAsync();}

And here’s what a result looks like when using the default grammar

ResultStatus: SucceededRecognitionResult: { RuleName: "" Semantics: null Text: "Recognize this text." TextConfidence: High Details: { ConfidenceScore: 0.8237646 RuleStack: COM Object }}

(this is the object hierarchy, I’m showing it as a json objects for clarity) TextConfidence can have the following values: Rejected, Low, Medium, and High. You use this value to figure out how close the returned text is to what the user actually said. If you want the actual score, you can use ConfidenceScore. Text is the recognized text. it will be empty if the TextConfidence is Rejected. RuleName is the name of the custom grammar used for this recognition (since I didn’t use one, the value is empty here) Semantics are related to SRGS grammars (an XML file that defines a more complex grammar you want to use). I will not get into this more advanced option in this post.  Prompting the user and adding a custom grammar The longer the user’s input is the harder it is to get accurate recognition. It makes sense to design your application in such a way where the user only needs to give you short answers. In the following code snippet I’m showing the user a text prompt asking a specific question and I’m showing the possible answers. I’m also adding the possible answers as a programmatic list grammar. This is the simples way of adding a grammar and it will increase the accuracy and force the API to match only the words I added in the grammar. This means that if the user said Lace or Ace it will still be recognized as Race. I’m also disabling the Readout. If you’re creating a speech conversation with your user, the readout might get tedious and slow the user down. repeating every recognition with “Heard you say: “ and the recognized text takes too long and will get annoying. You might want to enable it and certain situations but I would leave it off by default. You can also hide the confirmation prompt which displays the same text (Heard you say…) by setting Settings.ShowConfirmation to false.

private SpeechRecognizerUI _recoWithUI;private async void SpeechRecognition(){ //initialize the recognizer _recoWithUI = new SpeechRecognizerUI(); var huntTypes = new[]{"Race", "Explore"}; _recoWithUI.Recognizer.Grammars.AddGrammarFromList("huntTypes", huntTypes); //prompt the user _recoWithUI.Settings.ListenText = "What type of hunt do you want to play?"; //show the possible answers in the example text _recoWithUI.Settings.ExampleText = @"Say 'Race' or 'Explore'"; //disable the readout of recognized text _recoWithUI.Settings.ReadoutEnabled = false; //show the recognizer UI (and prompt the user for speech input) var recoResult = await _recoWithUI.RecognizeWithUIAsync();}

  Adding a spoken prompt To create a conversation, you might want to add a spoken text before prompting the user for their response. You can do that by simple adding the following two lines before the call to RecognizeWithUIAsync

var synth = new SpeechSynthesizer();await synth.SpeakTextAsync("What type of hunt do you want to play?");

Notice that I’m waiting for the text to be spoken before showing the recognizer UI. This means the current screen will still be visible while the text is spoken and as soon as it ends the recognizer UI will show up and the prompt sound will be played. If you don’t include the await on that line, the text will be playing while the recognizer is already listening. A better solution would’ve been to include a TTS option for the prompt in the recognizer. I couldn’t find such an option. Another way to solve this is to create your own UI and use SpeechRecognizer and RecognizeAsync with your own UI. Here’s a quick code sample:

private async void RecognizeWithNoUI(){ var recognizer = new SpeechRecognizer(); var huntTypes = new[]{"Race", "Explore"}; recognizer.Grammars.AddGrammarFromList("huntTypes", huntTypes); var synth = new SpeechSynthesizer(); await synth.SpeakTextAsync("Do you want to play a Race, or, Explore, hunt?"); var result = await recognizer.RecognizeAsync();}

The main difference is that there’s no automatic handling of failed recognitions. The RecognizeWithUIAsync call tells the user “Sorry, didn’t get that” and asks them to speak again, with the no UI option, you need to handle that yourself using the TextConfidence value.   As you can see, it’s very easy and straightforward to add speech recognition and synthesis to your app. Combined with Voice Commands you can create an experience that lets the user launch and control your app without touching their phone. If you’re using voice commands you can start this experience on that target page only, so when the user launches the app with a voice command you will prompt them with TTS and get their replies with speech and when they launch their app from the apps list or tile it will show a normal user interface.

Voice Commands in Windows Phone 8 - Updating Phrase Lists at Run Time

In the previous post, I created a simple voice commands file and registered it with the WP OS. My command was “Find {huntTypes} Hunts”. huntTypes is a phrase list that was hardcoded in the VCD file with the following values: “New”, “Nearby” and “My”. This means the user can say “Play The Hunt, Find New Hunts”. I don’t really want to maintain this phrase list in the VCD file, instead I want to call a web service and get a list of types that my backend search can handle and update the phrase list. It’s very simple to update the phrase list, you simply need to get the command set that includes the phrase list you want to update and call UpdatePhraseListAsync.

private void AddToPhraseList(){ var vcs = VoiceCommandService.InstalledCommandSets["USEnglish"]; vcs.UpdatePhraseListAsync("huntTypes", new[] {"New", "Nearby", "My", "San Francisco", "New York"});}

Now when the user says “Play The Hunt, Find San Francisco Hunts” the query string will include the added phrase list item

[0]: {[voiceCommandName, findHunts]}[1]: {[reco, Play The Hunt Find San Francisco Hunts]}[2]: {[huntTypes, San Francisco]}

Of course, in the real app I would get the list of phrases from some other source, instead of hardcoding it like I did in this sample.

Integrating your app with the Windows Phone 8 Voice Commands

One of the new APIs in Windows Phone 8 is the Voice Commands API. This API allows you to integrate your own app with the main voice command functionality so when the phone’s start menu is held and the voice prompt comes up your app can be launched from it. How it works To add support for voice commands in your app you need to add a Voice Command Definition file to your project and register that file with the OS the first time your app launches. Once you do that, your app can be launched with the commands you defined. When a command is launched, the target page associated with that command will be opened directly (just like the search extensions work in WP7).To use this feature you need to add the following capabilities to your app (in the WMAppManifest.xml file)IDCAPSPEECHRECOGNITIONIDCAPMICROPHONEIDCAP_NETWORKING (like all speech recognition solutions, the actual processing is done on the server side, the phone just streams the audio to the server and gets the text result back. that’s why networking must be enabled  for the app) The Voice Command Definition File The Voice Command Definition file, or VCD file, defines the commands your app will support. You can add a new VCD file from the Add New Item menu. Here’s the VCD file I’m using in this sample:

<?xml version="1.0" encoding="utf-8"?><VoiceCommands xmlns=""> <CommandSet xml:lang="en-us" Name="USEnglish"> <CommandPrefix> Play The Hunt </CommandPrefix> <Example> Find Nearby Hunts </Example> <Command Name="findHunts"> <Example> Find New Hunts </Example> <ListenFor> Find {huntTypes} [Hunts]</ListenFor> <Feedback> Searching For Hunts </Feedback> <Navigate Target="/findhunts.xaml"/> </Command> <PhraseList Label="huntTypes"> <Item> New </Item> <Item> Nearby </Item> <Item> My </Item> </PhraseList> </CommandSet></VoiceCommands>

In this VCD I’m defining a single command set for English. You can define additional command sets for other languages as well. Let’s dig into this simple definition Each command set can have a prefix.  When a user holds the start button and wants to launch a command in your app they will start the command with your prefix. The prefix for the command set is optional. If it’s omitted your app name will be used as the prefix. You might want to use a prefix if your app name is too long or hard to pronounce. You can also use it if you have multiple command sets for different languages. Tip: If your app name is hard to pronounce, try using a prefix that phonetically corresponds to the app’s name. The example text will show up in that initial dialog (not always, from my tests it seems to show up after the user already used your voice commands)

  The next part of the XML file is a Command. You can have multiple commands, in this sample I only have one called findHunts. The example text here will be shown when the command wasn’t found (the user has to tap the Tap Here link to see it)

Now we get to the most important parts of the command, the ListenFor and Navigate fields. The ListenFor is the text you’re expecting for the command. As you can see I’m using brackets to signify special meaning. The square brackets mean these words are optional (so in my example the user can say ‘Find New Hunts’ or just ‘Find New’). The curly brackets are a phrase list. In this case the list is included in the XML file. You can also modify these lists at run time so if for example you want to use a list of products, you can get them from a web service and add them to the phrase list. The Feedback is the text that will be displayed and spoken (using TTS) when this command is recognized.

The Navigate field defines the page you want opened with the recognized parameters Register the VCD file You’re supposed to register the VCD file once, on your app’s first launch. Here’s a piece of code to do that:

private async void RegisterVoiceCommands(){ await VoiceCommandService.InstallCommandSetsFromFileAsync( new Uri("ms-appx:///VoiceCommands.xml", UriKind.RelativeOrAbsolute));}

You can also check the InstalledCommandSets to see if you commands were already registered. Handle the command in the target page When a voice command for your apps is recognized, the target page will be launched and the URI for that page will include the recognized parameters in it. The sample from Microsoft uses OnNavigatedTo to extract the query parameters from the NavigationContext. If the page was opened from a voice command, a parameter named voiceCommandName will be in the query string. Testing for it’s existence is a good way to check if you got voice recognition results. Here are the query string parameters you get when the user says “Play The Hunt, Find New Hunts”

[0]: {[voiceCommandName, findHunts]}[1]: {[reco, open Play The Hunt Find New Hunts]}[2]: {[huntTypes, New]}

  If you have more than one command with the same target page, you can check voiceCommandName to figure out which command the user launched. The reco parameter contains the whole recognized text. But it can only contain text that was an expected command. You can’t just get the text spoken by the user. This is a big limitation, it means you have to launch the app with a command first and use TTS and Speech Recognition to carry on the conversation with your user. (I would prefer being able to say AppName, Remind me to call Someone tomorrow at 10 AM, instead it will have to be AppName, Add Call Reminder, or Remind Me To Call but the name and time will have to be asked for) This is the same way hands free texting works in WP7, which works great but feels to verbose sometimes. The huntTypes parameter contains the recognized word from my phrase list. I haven’t tried multiple phrase lists in one command, if that works, you’ll probably get more than one here. This means you don’t have to check the reco string at all. you can just use the phrase list to figure out what you want to do. At this point, you can execute whatever code you need based on this command. I’m using Caliburn Microfor my projects, so I won’t be overloading OnNavigatedTo in real projects. Instead, I’ll just add these three query string parameters as properties to my view model and Caliburn Micro will automagically parse the query string. If you’re not using Caliburn Micro, you’re writing more code than you have to…