Chipkill Advanced ECC - Overview of How It Works
by Bob Day (daybob@gmail.com)
Copyright (C) February, 2005 by Bob Day. All rights reserved.
Chipkill ECC as implemented on an Opteron CPU is a single symbol correction, double symbol detection (SSC/DSD) technique that operates on 4-bit symbols. Chipkill will detect and correct a single erroneous symbol, will detect that an error has occurred when two symbols are erroneous, and may detect that an error has occurred when more than two symbols are erroneous. Chipkill works well on memory modules that are built with 4-bit width (x4) device chips because a 4-bit group on such a chip corresponds to a symbol, and when data is read or written each device chip is the source or destination for 4 bits. Modules with greater bit width device chips may also be used, but only a subset of errors will be detected.
The Opteron’s Chipkill algorithm operates on 128 data bits at a time. In addition, for each 128 data bits, 16 error check bits (called a syndrome) are required. Consequently, each read or write access of memory involves 144 bits and is performed over a pair of memory modules. The modules must be double sided single bank x4 ECC modules, with 9 device chips per side. (Each memory read or writes accesses 4 bits from each of 36 device chips, for a total of 144 bits.) In the BIOS, the “interleaved memory” option must be set.
As stated above, the Opteron’s Chipkill algorithm operates on 4-bit symbols. If a single symbol is erroneous, the algorithm will correct any or all bit errors (including all 4 bits) within that symbol. If two symbols are erroneous, the algorithm will detect that an error occurred, but will not correct it. If more than two symbols are erroneous, an error may or may not be detected. Because each symbol on a chip is part of only one 144 bit data word that is read or written during a memory access, Chipkill can correct all errors on an x4 device chip. As a consequence, if one device chip that participates in a memory access is completely dead it will effectively be bypassed, and the memory will continue to operate as if it were parity memory.
The Chipkill scheme corrects the most probable kinds of memory errors – single bit errors that occasionally occur due to electrical noise, and errors caused by cosmic rays, which may affect a string of bits within a device chip, but are extremely unlikely to affect more than one device chip simultaneously.
I can personally vouch for the following mainboard, CPU, and memory combinations, which give you a Chipkill ECC memory system:
Opteron 142 CPU, ASUS SK8V mainboard, two Crucial CT6472Y335.18LFC4 memory modules.
Opteron 246HE CPU, Arima HDAMB mainboard, two Crucial CT6472Y40B.18LFG4 memory modules.