Bus Analyzer uncovers root cause of failure in flash-enabled systems

Analysis of granular operational data helps developers more easily fix the flaws in their systems

Takeaways

  • Final production usage of flash memories can vary significantly from the intended usage. This variation can impact the expected data retention time and the data reliability.
  • Customer system tests can fail to identify out-of-spec usage of flash memories.
  • With the proper tools, like Cypress’s Bus Analyzer (BA2), a real understanding of how the system is using the memories can be determined. This feedback allows developers to retune their systems to get the best performance.

 

As the intelligence, connectivity and functionality of white goods increases, manufacturers find themselves challenged to differentiate their products while satisfying customer expectations. Consumers may be eager to swap out their smart phone every six months but they expect that smart refrigerator—including its email function—to last for 10 years or more. They also demand low cost, even for high-end appliances.

 

Nearly all of those appliances, whether they are ovens that can download recipes or smart thermostats that adapt to your habits, require non-volatile storage, typically economical flash memory. Without deep experience with embedded design, however, developers may inadvertently operate flash memory out of spec, triggering early failure. To help appliance manufacturing teams uncover their errors, Cypress has developed a custom tool for exploring device operation and isolating the root cause of failure (see figure 1). The insights delivered by this instrument make it easier to build embedded systems that can satisfy cost and lifetime expectations for not just white goods, but across a broad range of applications.

 

V2N2 analyzer picture 1

 

Figure 1: Bus analyzer can monitor the performance of a range of memory types without altering system function, assisting customers in establishing root cause of failure. In addition to digital I/O, it features analog channels to track characteristics like temperature, voltage and current.

 

Flash memory basics

It is difficult to beat the cost of flash memory devices when it comes to non-volatile storage. Low cost, however, doesn’t mean they’re easy to use. First-time users may be surprised at how challenging they can be to control. In contrast to some memory technologies, you cannot simply write your data to a flash device. Rather, data is written into the flash memory through a series of commands, also known as programming. Depending on the type of flash device selected, the program unit size (PUS) can be anywhere from one byte to a 4k byte page. The erase unit size (EUS) is typically a multiple of the PUS. The complexity doesn’t end there. The EUs have a limit to the number of times they can be erased, typically from 10K to 1M times, depending on the type of flash. We call this the erase cycle limit, or ECL.

 

There are important considerations to keep in mind if you’re working with data that will be frequently changing in your flash device. Many users rely on a technique called wear leveling (WL) to distribute this erase cycling over the entire device; however, this means that your data will be written to a different location every time—yet another complexity.

 

As technology continues to improve, flash die get smaller, densities increase and costs go down. With these changes, the ECL of the devices can be affected. In these cases, the management of the memory must adapt. Designs that do not accommodate the newer flash device specifications could experience unexpected failures at some point during their lifetime.

 

The problem is that once a program has been written and tested, it’s not uncommon for it to be used for a long time. In most cases, that's the right thing to do. Flash memory specifications may change, however, as new generations of flash become available. If the software is not updated to reflect the new flash specifications and the system is tested with the original flash software and original system tests, the process may not be enough to uncover a mismatch of the usage verses the technology. Unfortunately, failures related to operation outside of specifications can be difficult to catch. This can lead to products that test out fine in the factory but fail once they are out in the field.

 

Operating a flash memory device outside of its design specifications is never recommended, but it does not always result in an error or a failure during the production test. That said, earlier generations of products may have been more tolerant of running outside of spec than are present-day devices. In addition, the tolerance of a device for out-of-spec usage at the beginning of its life may be better than the tolerance level at the end of its life.

 

Finding and diagnosing the problem

There are a number of types of out-of-spec usage, but here are the more common ones:

 

  • Allowing an erase cycle rate for one or more erase units that causes those sectors to reach end of life (EOL) sooner than expected
  • Programming pages multiple times between erases
  • Operating at temperatures outside of device specifications
  • Improperly interpreting flash status

Any of these examples could slip through typical product testing, only to become a problem once in the field. It may take years for out-of-spec flash memory usage to contribute to a failure, but ownership timelines in the white-goods market make it more likely that this would be an issue. At the same time, finding the root cause of system failures for products that have been in the field for a long time can be very difficult. The problems occur under varying and uncontrolled conditions, which means that reproducing them in a lab can be problematic. Development boards and software may be hard to find and setting up the system can be a challenge. Development tools are often being used on current projects so they may be unavailable. Meanwhile, familiarity with a project decreases with every passing year.

 

These hurdles can be overcome, but then the developer must reproduce the failure and determine what component or components failed. If the flash memory is determined to be the failing component, it is often sent to the manufacturer for analysis.

 

Flash semiconductor companies are able to extract a lot of information out of a device, and the devices lend themselves to extensive forensic investigation. There is sufficient state transition data retained in the flash, along with voltage levels of the individual cells, to provide some insight. Additionally, the flash manufacturer may have encountered similar problems during product qualification (which involves overstressing the device) and would be familiar with the failure mode, which increases the likelihood of diagnosing the root cause.

 

The device analysis may reveal that there was, in fact, a failure, or it could return no fault found (NFF). It can be difficult to believe the findings in the case of NFF, but frequently the problem is not the part but the fact that it is being operated outside of usage parameters. The part meets specifications but it is being operated at the wrong voltage or wrong temperature, being cycled too frequently, etc. There are steps the engineer should take before submitting a device for analysis to help eliminate system issues as the cause of failure, such as:

 

  • Review the signal integrity.
  • Review the timing to make sure it meets the specifications.
  • Review the flash drivers.

The problem could also be caused by higher-level software, or a third-party block driver (BD) for which you have no source code. Additionally, the problem may be the result of how the end customer is using the system.

 

Some debugging methods may actually introduce other variables into your troubleshooting. Instrumenting the code will change the timing of the system. Logging additional information to flash for debugging purposes is likely to complicate the issue. Logic analyzers record high resolution data, but generally only record hundreds of milliseconds of data, so you’d need to perform multiple iterations with a logic analyzer to narrow down possible causes.

 

Searching out root cause

After considering the factors above, you now might be wondering what the best method would be for finding the root cause of your issue. At Cypress, we’ve had extensive experience in this area while working with our customers, but to truly serve them, we realized that we needed more effective test equipment. We determined that we needed a tool that would:

 

  • Capture all flash transactions over a long period of time.
  • Allow insertion into a system that reproduces the failure without affecting system operation or performance.
  • Have the ability to evaluate system behavior for out-of-spec usage of the flash.
  • Support parallel NOR flash, SPI NOR flash and NAND flash memory devices.
  • Attach to common packages.

When we created our specialized wish list, there were no companies that had this type of test equipment available. So, around 2009, Spansion (now part of Cypress) developed its first-generation bus analyzer for memory diagnostics. It was designed to support parallel NOR flash, SPI NOR flash and NAND flash up to around 40 MHz. This first attempt was not perfect but it worked well enough to let us know that we were on the right track (see figure 2).

 

V2N2 analyzer picture 2

 

Figure 2: Bus analyzer allows us to monitor read/write operations and export the data for analysis.

 

With the bus analyzer data, we could associate the results from the device analysis lab with the root cause, which was frequently out-of-spec usage. Troubleshooting is an iterative process. The log files and analysis from the bus analyzer allowed customers to narrow their focus from the memory to their own systems. At that point, it was generally easy to isolate and update the offending software.

 

Although the first-generation bus analyzer represented a significant step forward, it fell short in a couple of areas, so we got to work on the next generation. In addition to the original bus analyzer features, it needed:

 

  • Additional analog channels for temperature, voltage and current measurements.
  • The ability to support additional memories like embedded multi-media card (eMMC), Hyperflash and others, at their maximum bus speeds.

It has taken us a little over a year, but we now have the Bus Analyzer 2 (BA2; see figure 3). We have been using it for 12 months to perform failure testing for customers across a range of applications. We focused on performance of the analyzer and not so much on aesthetics or portability, because the tool was intended for internal use only. Like the original bus analyzer or any test equipment, it can be a bit challenging to connect to the target system, but the results are usually well worth the trouble.

 

V2N2 analyzer picture 3

 

Figure 3: Intended for in-house use to assist customers in diagnosing failures, the bus analyzer is designed to interface with the system without impacting performance or results.

 

One of the fundamentals of engineering is that you can’t solve the problem until you know what the problem is. Flash memory is an essential storage technology for a wide range of applications but it’s easy for users to run into trouble if they accidentally use it out of spec. Our bus analyzer enables developers to more quickly home in on the root cause of failure in embedded systems. Although this particular article focuses on white goods, the advice applies equally to embedded systems for markets like automotive, industrial, wearables, medical devices and more.

 

Also in this issue:

AS-MCUs bring TFT HMIs to cost-sensitive home appliances

RCCA turns failures into future success

Answers to your data-retention specs and testing questions

Accelerate product development with Bluetooth® low energy modules

PSoC controllers speed design of smart home appliances

Prequalified APIs and software keep white goods safe

How to implement liquid-level measurement using capacitive sensing technology

Get More from Core & Code Subscribe
image_pdf

Leave a Reply

Your email address will not be published. Required fields are marked *


Other stories in this issue

feature

Prequalified APIs and software keep white goods safe

Safe firmware and test routines let developers focus on their product features