Sunday, March 15, 2020

Why does the cycle life of EEPROM and Flash need to be considered in your design?

Most microcontrollers nowadays come with up to a few kilobytes of EEPROM, and this memory is frequently used to store user settings and sometimes for basic logging. Flash-based microcontrollers have EEPROM because Flash memory must be erased in large blocks (pages) before writing, which is fine for program memory, but inconvenient for settings where only a few bytes need to be changed and storing a whole page in RAM to re-write may be inconvenient.

Here's the parameters for the various types of memory in a popular microcontroller:

One thing you may be quick to forget about is the number of write/erase cycles. "100,000 cycles, that's a big number, great, don't need to worry about it."

That is a bad approach. Cycle life is important; especially if you don't want your company to gain a reputation for making "landfill electronics", "lagdroid phones", or "cars that frequently need expensive electrical repairs out of warranty..."

I'm not going to demonstrate using a program which repeatedly writes the same location for ever more; I will focus on practical examples.

A practical demonstration

This is the Dashboard Display 1.0:

Using a GPS module, it measures the current speed, and the distance and time taken for each trip. I'll cover the project in more detail later. It's also the forerunner to the much more advanced project I did for my university dissertation, Global Positioning System and European On-Board Diagnostics-based Automotive Monitoring System.

The Dashboard Display was designed to be continuously powered, with connection to the vehicle via a short car radio extension cable with spliced-in cables running to the multi-pin connector on the front. These connections allowed the backlight to automatically come on and dim when required. The trip distance and time were stored in EEPROM just in case the power was disconnected, however.

The power was disconnected after six months, and this is what came up when power was restored:

How did that happen? My first guess was that I had made a type casting error in C, but that error usually results in losing 1s off the left of a variable, not gaining them. Additionally, the problem came back after resetting the trip counter and cycling the power again.

I examined my code and found that it saved the time and distance to EEPROM once a second, even when the trip counter was not running (it paused itself whenever no movement was detected). As a result, the same eight EEPROM locations were written around 15000000 times, or 150 times the maximum rating. I wrote a little test program to test it, and these 8 locations were well and truly banjaxed, being unable to hold the freshly written data for even the few microseconds it took for my program to read back the values.

All 8 bytes were affected, since the microcontroller cannot erase individual bits, and does not have the intelligence to not erase unchanged bytes unless programmed to check them. Interestingly, the bytes still passed testing when a sequence of 11111111 was written, and only failed when patterns of alternating bits were written, suggesting that neighbouring bits influence failing bits. The failing bits were always read back as 0 when alternating bit sequences were written.

Where was the mistake made?

It's easy to put the blame on the piece of code which was supposed to not write the trip values when the trip counter was paused but didn't, but that's the wrong attitude. It's true that the EEPROM would have suffered much less wear had that mistake not been made, and it would have taken a much longer time for any problems to surface, but the mistake was in fact the decision to write the trip data every second. That decision was made to ensure accuracy of the data should power be lost temporarily, with no consideration made for longevity. Had the code been written correctly (according to specification), it would have needed only 27 hours of continous driving to exceed the rated lifespan of the EEPROM. The rated lifespan is usually worst case (highest voltage and temperature), but this should not be acceptable.

The frequency of saving data to EEPROM must be carefully considered when designing a product. In this case, it's unlikely that power would be lost mid-trip, and it would probably be acceptable to save the data only when the trip is stopped or reset, which is how the Global Positioning System and European On-Board Diagnostics-based Automotive Monitoring System handles this problem.

If you really need to write often

With ever increasing flash memory density and cost pressures, the cycle life of flash memory can be as low as 1000 cycles. Wear levelling is a technique used by modern flash-based mass storage and works well as long as there is a good amount of "unused" space. This technique can also be used on a microcontroller, but is not without its challenges.

Other solutions include using FRAM, which is similar to EEPROM except with practically unlimited cycle life, and battery-backed SRAM.

But the best solution is to minimise how often you write data. And if you're writing logging data, do you really need to?

A commercial example

Here's an example of how such a simple oversight can affect even the biggest companies:

Photo by Tokumeigakarinoaoshima - licensed under CC BY-SA

The MCU is the main computer in Tesla cars, and it uses eMMC Flash memory. The software writes a lot of logging data, and this in conjunction with the relatively small 8GB Flash chip results in the Flash getting worn out in just a few years on heavily used cars. The only official repair is replacement of the MCU costing £2500+ out of warranty.

At least in this case, this problem can be mitigated somewhat with software updates. Many embedded projects do not have that luxury.

Learn more about this issue here: https://teslaownersgroup.co.uk/kb/emmc-chip-failure-what-is-it-should-i-be-worried-what-can-i-do

Do you know of any other commercial examples? Let me know in the comments.

No comments:

Post a Comment

Insertomatic 6000 Part 4: Finishing up

Summary of the Insertomatic 6000 One Raspberry Pi 3 + three Pi Zero 2 Ws fully networked using USB Gadget 6 analogue RF channels in t...