Speech recognition for industrial applications

ARM Cortex M-based MCUs take industrial applications hands-free.

Takeaways

  • ARM® Cortex® M-based MCUs can recognize more than 100 multi-word commands.
  • Dual-chip architectures make it easy to add speech-recognition options to existing products.
  • Speech-enabled MCUs simplify adding voice capabilities, even for the uninitiated.

 

When it comes to industrial applications, resources are limited, time equals money and errors can be costly. Finding economical, reliable methods to streamline employee tasks presents companies with an opportunity to gain a competitive advantage. One option is to use speech recognition as an interface tool. The approach is increasingly common in consumer electronics, and now moving into the industrial sphere.

 

Utilities companies, for example, need to send inspectors to review an area prior to starting an excavation. These inspections can be laborious and detailed, and are further complicated by the need to enter data repeatedly into the computer while canvassing an area on foot. Increasingly, companies are building electronics into a vest worn by the employees. The addition of a speech-recognition engine would allow inspectors to log their data by just reciting a command or two, while minimizing equipment and keeping their hands free for tools, etc. It’s a promising solution, at least as long as the cost is acceptable.

 

Speech recognition tends to be associated with sophisticated processors and large amounts of memory. Today, ARM Cortex M-based MCUs running optimized software present an effective, economical alternative. It is possible to build a state-of-the-art automated speech recognition (ASR) system capable of recognizing a few hundred commands using an ARM Cortex M-based MCU with embedded flash memory and SRAM. These types of low-cost, low-power MCUs can be integrated in systems to simplify reporting, diagnostic, and even command functions.

 

After cost, performance, or more specifically speed and accuracy, is another factor to consider. When it comes to speech recognition, speed and accuracy are related–a longer processing time delivers better accuracy. Performance is also affected by MCU speed and the number of commands. An ARM® Cortex®-M4 processor running at 160 MHz is capable of recognizing a few hundred commands in real time. For a wearable application, an M3 processor can handle around 40 commands, which typically satisfies this more modest use case. Accuracy is more difficult to quantify because it also depends on the background environment, the complexity of the recognition grammar, and on the speakers. For native (unaccented) US English speakers using a close-talking microphone, these MCU-based systems can achieve a sentence error rate of less than 5%.

 

The term speech recognition brings to mind sophisticated systems like Dragon Naturally Speaking or Siri on an iPhone. These systems can be highly functional but they can also be unstable, not to mention so far in error as to be ridiculous. Neither issue is acceptable for industrial applications, which tend to be high reliability and often safety critical.

 

Given the above, it can be hard to understand how a speech-enabled MCU—or speech recognition at all–can deliver an effective industrial solution. Properly defining the problem is an important first step. Natural-language systems attempt to recognize spoken phrases with a broad vocabulary, which is an intimidating task–by some estimates, the English language alone contains more than 1 million words. In contrast, the MCU-based system is designed to recognize around 100 multi-word commands, which it can do in real time and with good accuracy. With such a small subset, it is possible to perform a complete verification of the command list to ensure that it’s robust. It’s a much simpler task than that of natural-language systems, making it more than feasible for an MCU.

 

Hardware options

The system can be implemented in one of two ways: with a single MCU that supports both the speech recognition and the user application or with two MCUs, one dedicated to the application and the other dedicated to speech recognition (see figure 1).

Figure 1: In a single-chip design, the MCU leverages a codec and additional ASR software to perform speech recognition in addition to the basic application.

Figure 1: In a single-chip design, the MCU leverages a codec and additional ASR software to perform speech recognition in addition to the basic application.

The single MCU minimizes cost and footprint and can be a good fit for certain applications. The problem is that speech recognition consumes resources. Running the host application simultaneously with the voice recognition software can require multithreading, or rewriting the host application software to restrict calling the speech APIs to specified times. In either case, the integration process becomes more difficult, and the system may not have sufficient MIPS or memory to handle both speech and application.

 

In the dual-chip architecture, user applications run on a host processor and a separate speech-enabled MCU performs voice recognition. Communication between the host processor and the speech-enabled MCU takes place over an SPI link (see figure 2). This architecture provides modularity. With minimal effort, a system can be configured to run with or without a speech recognition interface. The approach also eliminates resource contention between the host application and the speech software, because the latter runs on a dedicated MCU. This flexibility allows the system cost to be optimized for both configurations.

 

Figure 2: A dual-chip architecture relegates all speech recognition operations to a dedicated MCU. This modular approach allows speech capability to be easily added and removed from a design platform.

Figure 2: A dual-chip architecture relegates all speech recognition operations to a dedicated MCU. This modular approach allows speech capability to be easily added and removed from a design platform.

In the dual-chip design, all speech-recognition tasks are segregated from the application. There is no concern about processing or memory consumption, which improves performance and simplifies integration. In particular, this can be a good approach for adding speech-recognition capabilities to an existing design or for developing a product that allows the customer to choose a design with or without speech capabilities. Integrating is as easy as adding a set of APIs to the host software allow it to communicate with the speech MCU. On the downside, the use of two MCUs increases cost, footprint, and power consumption.

 

The power issue can be mitigated by using the power-management options provided by ARM Cortex M-based processors. For an ARM Cortex M4 system, the whole chipset in active mode only draws 100 mA. With the dual-chip design, the speech-enabled MCU and associated systems could be powered down when it is not in use, driving current draw down to less than 1 mA. A user-controlled switch could bring the speech engine live in a matter of seconds.

 

 

Of course, not every user wants to or can press a button before they talk. Another option is to keep the speech-enabled MCU in listening mode, awaiting a wake-up command. During this time, it would operate at a slower speed. It would draw more power than the shutdown option, but less than full-speed operation.

 

Embedded designers can also optimize power consumption for the single-chip architecture. Obviously, the processor can’t be shut down entirely but it can run more slowly. For example, if the host processor only requires 40 MIPS, the processor could run at that speed to save power and then increase to 100 MIPS for speech recognition.

 

A final hardware issue to consider is how to confirm that the intended command has been received. Displaying the recognized command would be one option, but displays consume space, power, and processing capability. If they are not already a part of the product, that confirmation device would add to the BOM and requires the user to perform a visual check. An alternative would be for the system to verify recognition by repeating the command audibly. This would require minimal processing power and effort from the end user. As with most things in engineering, the best choice really comes down to determining the optimal fit for the constraints of the application.

 

Software design

Let’s take a closer look at the software side of speech recognition, beginning with the case of the single MCU running both speech software and user application (see figure 3). We start with a set of speech objects (acoustic model, dictionary, and grammar) that are generated from a set of user-defined commands. The speech recognition software (ASR engine) will apply these speech objects during the recognition process. The task of compiling the user-defined commands into the speech objects takes place off line.

 

Figure 3: For a single-chip speech-enabled system, the codec sends data to the audio driver, where it can be accessed by the speech-recognition software (ASR engine). The ASR engine leverages the speech objects previously generated from user-defined commands to determine a match for the audio data, delivering the results to the application.

Figure 3: For a single-chip speech-enabled system, the codec sends data to the audio driver, where it can be accessed by the speech-recognition software (ASR engine). The ASR engine leverages the speech objects previously generated from user-defined commands to determine a match for the audio data, delivering the results to the application.

The right hand side of figure 3 shows the software hierarchy that runs on the speech-enabled MCU. The audio driver receives data from the codec and stores it in memory. The ASR engine accesses this data and finds the user-defined command from the speech objects that best matches the audio data. This hypothesis is then passed to the application as the recognition result. The user application interacts with the ASR engine and the audio driver through a set of APIs.

 

In the dual-chip configuration, the host uses the speech client module to send the ASR commands to the MCU, which performs the speech recognition and returns the results. The only software component required to be installed on the host processor is the speech client module (see figure 4).

In the dual-MCU architecture, a speech-enabled MCU supports a series of ASR software components that include the speech objects, a speech-server module, and an ASR engine. The host MCU supports the user application plus a speech client module used to interface with the speech-enabled MCU.

Figure 4. In the dual-MCU architecture, a speech-enabled MCU supports a series of ASR software components that include the speech objects, a speech-server module, and an ASR engine. The host MCU supports the user application plus a speech client module used to interface with the speech-enabled MCU.

The command flow between the host and the MCU is straightforward (see figure 5). A speech client module runs on the host and a corresponding server module runs on the MCU. The client module converts function calls into messages that are sent over the SPI link. The server module converts these messages back to function calls.

 

Figure 5: Communication between the speech-enabled MCU and the host MCU is straightforward, taking place over an SPI connection.

Figure 5: Communication between the speech-enabled MCU and the host MCU is straightforward, taking place over an SPI connection.

 

Inside the speech application

We’ll finish with a quick review of the steps required to build the speech application. In the case of the single-MCU architecture, it starts with writing a list of commands (grammar) specific to the application and compiling them to generate the speech objects (grammar, dictionary, and acoustic model). The speech object grammar file can be based on the JavaScript grammar format, for example. It is used during the search phase of the speech recognition process to generate all possible sequences of words allowed by the grammar.

 

The speech object dictionary file is a phonetic representation of all the words used in the command set. It is generated using a variety of techniques including look up tables and grapheme-to-phoneme conversion algorithms. The speech object dictionary file is also used during the search phase of the speech-recognition process to generate all possible sequences of phonemes (sound units) allowed by the grammar.

 

The acoustic model consists of a set of files that provide the mathematical descriptions of the sound units. These models are trained using hundreds of hours of audio recordings from many speakers in different settings. Each language has a unique acoustic model.

 

The search algorithm used by an ASR engine  matches the spoken phrase to the speech objects. This is done by adjusting the pruning thresholds at various points of the search. Relaxing the pruning thresholds allows more hypotheses to be evaluated, increasing the recognition accuracy at the expense of more computation. Tightening the pruning thresholds, on the other hand, restricts the number of potential matches to be evaluated but speeds up the computation. Another way to control the search algorithm is to directly specify the number of hypotheses that are evaluated at each point.

 

Practical speech recognition

All of this may sound complex. Indeed, next to cost, the effort required to integrate a speech recognition interface with a new or existing product can be the biggest barrier to adoption. Given that most product designers and engineers are not speech recognition experts, designing the interface and integrating it into the product needs to be straightforward and intuitive. Speech-enabled MCUs can simplify the process.

 

Products now emerging in the marketplace integrate ASR engines and include compilers. In these cases, the job of porting the ASR software to the hardware platform is complete, the drivers are in place, and the memory system is embedded or packaged with the MCU. The integration effort should only involve learning the ASR APIs and connecting the MCU SPI port to the host. With an effective software development kit, integrating voice capabilities can be straightforward.

 

In an increasingly complex and demanding industrial market, speech-enabled interfaces provide a useful alternative to the end-user.  ARM Cortex M-based MCUs can deliver speech recognition capability either as stand-alone modules in a dual-chip architecture or as part of the functionality of a single-chip design. Embedded engineers have a choice of options that deliver the flexibility they require to achieve the best solution for their application. For minimal time and cost investments, they can serve customer needs and differentiate their product in the marketplace.

Get More from Core & Code Subscribe
image_pdf

Leave a Reply

Your email address will not be published. Required fields are marked *


Other stories in this issue

Lead Feature

MCUs vs. MPUs: Choose the right one for your industrial application

MPUs offer more functionality and faster time to market, while MCUs provide a smaller, more cost-efficient solution.