Redis崩溃

jieforest · 发表于 2012-12-5 08:52

How to test on crashes
===

My first idea was to test memory incrementally inside the allocator.
Like, from time to time, if you allocate some memory, run a fast memory test on it and log it on the Redis log and in the INFO output if a problem was detected.

In theory it is nice, and I even implemented the idea. The problem is, it is completely unreliable. If the broken memory is allocated for something that is never deallocated later, it will never be tested again. Worse than that, it takes a lot of time to test the whole memory incrementally small piece after small piece, and what about testing every single location? The allocator itself uses "internal" memory that is never exposed to the user, and we are missing all these pages.

Bad idea... and... the implementation I wrote was not working at all as the CPU cache made it completely useless, as testing small pieces of memory incrementally results in a constant cache hit.

The second try was definitely better, and was simply to test the whole space of allocated memory, but only when a crash happens.

At first this looks pretty hard: at least you need to get a lot more help from the memory allocator you are using. I don't think jemalloc has an out of the box way to report the memory regions allocated so far. Also if we are crashing, I'm not sure how reliable asking the allocator to report memory regions could be.
As a result of a single bit error, it is very easy to see the error propagating at random locations.

There are other problems. After a crash we want to be able to dump a core that is meaningful. If during the memory test we fill our memory with patterns, the core file will be completely useless. This means that the memory test needed to be conceived so that at the end of the test the memory was left untouched.

jieforest · 发表于 2012-12-5 08:52

The proc filesystem /proc/<pid>/maps
===

The Linux implementation of the proc filesystem makes Linux an exceptionally introspective operating system. A developer needs minimal efforts to be able to access informations that are usually non exposed to the user space. Actually the effort required is so small as to parse a text file.

The "maps" file of the proc filesystem shows line after line all the memory mapped regions for a given process and their permissions (read, write, execute).
The sum of all the reported regions is the whole address space that can be accessed in some way by the specified process.

Some of this maps are "special" maps created by the kernel itself, like the process stack. Other maps are memory mapped files like dynamic libraries, and others are simply regions of memory allocated by malloc(), either using the sbrk() syscall or an anonymous mmap().

jieforest · 发表于 2012-12-5 08:53

The following two lines are examples of maps
(obtained with "cat /proc/self/maps)

7fb1b699b000-7fb1b6b4e000 r-xp 00000000 08:05 15735004 /lib/x86_64-linux-gnu/libc-2.15.so
7fb1b6f5f000-7fb1b6f62000 rw-p 00000000 00:00 0

复制代码

The first part of each line is the address range of the memory mapped area, followed by permissions "rwxp" (read, write, execute, private), the second is the offset in case of a memory mapped file, then there is the device id, that is 00:00 for anonymous maps, and finally the inode and file name for memory mapped files.

We are interested to check all the heap allocated memory that is readable and writable, so a simple grep will do the trick:

$ cat /proc/self/maps | grep 00:00 | grep rw
01cbb000-01cdc000 rw-p 00000000 00:00 0 [heap]
7f46859f9000-7f46859fe000 rw-p 00000000 00:00 0
7f4685c05000-7f4685c08000 rw-p 00000000 00:00 0
7f4685c1e000-7f4685c20000 rw-p 00000000 00:00 0
7fffe7048000-7fffe7069000 rw-p 00000000 00:00 0 [stack]

复制代码

jieforest · 发表于 2012-12-5 08:53

Here we can find both the heap allocated using the setbrk() system call, and the heap allocated using anonymous maps. However there is one thing that we don't want to scan, that is the stack, otherwise the function that is testing the memory itself would easily crash.

So thanks to the Linux proc filessystem the first problem is no longer a big issue, and we use some code like this:

int memtest_test_linux_anonymous_maps(void) {
FILE *fp = fopen("/proc/self/maps","r");
... some more var declaration ...
while(fgets(line,sizeof(line),fp) != NULL) {
char *start, *end, *p = line;
start = p;
p = strchr(p,'-');
if (!p) continue;
*p++ = '\0';
end = p;
p = strchr(p,' ');
if (!p) continue;
*p++ = '\0';
if (strstr(p,"stack") ||
strstr(p,"vdso") ||
strstr(p,"vsyscall")) continue;
if (!strstr(p,"00:00")) continue;
if (!strstr(p,"rw")) continue;
start_addr = strtoul(start,NULL,16);
end_addr = strtoul(end,NULL,16);
size = end_addr-start_addr;
start_vect[regions] = start_addr;
size_vect[regions] = size;
printf("Testing %lx %lu\n", start_vect[regions], size_vect[regions]);
regions++;
}
... code to actually test the found memory regions ...
/* NOTE: It is very important to close the file descriptor only now
* because closing it before may result into unmapping of some memory
* region that we are testing. */
fclose(fp);
}

复制代码

jieforest · 发表于 2012-12-6 15:14

CPU cache
===

The other problem we have is the CPU cache. If we try to write something to a given memory address, and read it back to check if it is ok, we are actually only stressing the CPU cache and never hitting the memory that is supposed to be tested.

Actually writing a memory test that bypasses the CPU cache, without the need to resort to CPU specific tricks (like memory type range registers), is easy:

1) Fill all the addressable memory from the first to the last with the pattern you are testing.
2) Check all the addressable memory, from the first to the last, to see if the pattern can be read back as expected.

Because we do it in two passes, as long as the size of the memory we are testing is larger than the CPU cache, we should be able to test the memory in a reliable way.

But there is a problem: this test destroys the content of the memory, that is not acceptable in our case, remember that we want to be able to provide a meaningful core dump if needed, after the crash?

On the other side, writing a memory test that does not destroy the memory content, but that is not able to bypass the cache, is also easy: for each location save the value of the location on the stack, test the location writing patterns and reading the patterns back, and finally set the correct value back to the tested location. However this test is completely useless as long as we are not able to disable the CPU cache.

jieforest · 发表于 2012-12-6 15:14

How to write a memory test that:

A) Is able to bypass the cache.
B) Does not destroy the memory content.
C) Is able to, at least, to test every memory bit in the two possible states.

Well that's what I asked myself during the past weekend, and I found a simple solution that works as expected (as tested in a computer with broken memory, thanks Kosma! See credits at the end of the post).

This is the algorithm:

1) Take a CRC64 checksum of the whole memory.
2) Invert the content of every location from the first to the last (With "invert" I mean, bitwise complement, so that every 1 is turned into 0, and every 0 turned into 1).
3) Swap every adjacent location content. So I swap the content at addresses 0 and 1, 2 and 3, ... and so forth.
4) Swap again (step 3).
5) Invert again (step 2).
6) Take a CRC64 checksum of the whole memory.
7) Swap again.
8) Swap again.
9) Take a CRC64 checksum of the whole memory again.

jieforest · 发表于 2012-12-6 15:15

If the CRC64 obtained at step 1, 6, and 9 are not the same, there is some memory error.

Now let's check why this is supposed to work: It is trivial to see how if memory is working as expected, after the steps I get back the original memory content, since I swap four times, and invert two times. However what happens if there are memory errors?

Let's do a test considering memory locations of just two bits for simplicity. So I've something like:

01|11|11|00
|
+----- this bit is broken and is always set to 1.
(step 1: CRC64)
After step 2: 10|00|10|11 (note that the broken bit is still 1 instead of 0)
After step 3: 00|10|11|10 (adjacent locations swapped)
After step 4: 10|00|10|11 (swapped again)
After step 5: 01|11|11|00 (inverted again, so far, no errors detected)
(step 6: CRC64)
After step 7: 11|01|10|11
After step 8: 01|11|11|10 (error!)
(step 9: CRC64)

复制代码

jieforest · 发表于 2012-12-6 15:15

The CRC64 obtained at step 9 will differ.
Now let's check the case of a bit always set to 0.

01|11|01|00
|
+----- this bit is broken and is always set to 0.
(step 1: CRC64)
After step 2: 10|00|00|11 (invert)
After step 3: 00|10|01|00 (swap)
After step 4: 10|00|00|01 (swap)
After step 5: 01|11|01|10 (invert)
(step 6: CRC64)
After step 7: 11|01|00|01 (swap)
After step 8: 01|11|01|00 (swap)
(step 9: CRC64)

复制代码

This time is the CRC64 obtained at step 6 that will differ.
You can check what happens if you flip bits in the adjacent location, but either at step 6 or 9 you should be always able to see a different checksum.

So basically this test does two things: first it amplifies the error using an adjacent location as helper, then use checksums to detect the error. The steps are performed always as "read + write" operations acting sequentially from the first to the last memory location to disable as much as possible the CPU cache.

jieforest · 发表于 2012-12-6 15:15

The kernel could do it better
===

After dealing with many crash reports that are actually due to memory errors, I'm starting to think that kernels are missing an incredible opportunity to make computers more reliable.

What Redis is doing could be done incrementally, a few pages per second, by the kernel with no impacts to actual performance. And the kernel is in a particularly good position:

1) It could detect the error easily bypassing the cache.
2) It could perform more interesting value retaining error tests writing patterns to pages that will be reused much later in time, and checking if the pattern matches before the page is reused.
3) The error could be logged in the system logs, making the user aware before a problem happens.
4) It could exclude the broken page from being used again, resulting in safer computing.

I hope to see something like that in the Linux kernel in the future.

jieforest · 发表于 2012-12-7 14:02

The life of a lonely cosmic ray
===

A bit flipping at random is not a problem solely related to broken memory. Perfectly healthy memory is also subject, with a small probability, to bit flipping because of cosmic rays.

We are talking, of course, of non error correcting memory. The more costly ECC memory can correct a single bit error and can detect two bits errors halting the system. However many (most?) servers are currently using non ECC memory, and it is not clear if Amazon EC2 and other cloud providers are using or not ECC memory (it is not ok that this information is not clearly available in my opinion, given the cost of services like EC2 and the possible implications of not using ECC memory).

According to a few sources, including IBM, Intel and Corsair, a computer with a few GB of memory of non-ECC memory is likely to incur to *several* memory errors every year.
Of course you can't detect errors with a memory test if the bit flipping was caused by a cosmic ray hitting your memory, so to cut a long story short:

Users reporting isolated, impossible to understand and reproduce crashes, not using ECC memory, can't be taken as a proof that there is a bug even if the fast on-crash memory test passes, if the more accurate redis --test-memory passes, and even if they run memtest86 for several days after the event.

Anyway not every bit flipped is going to trigger a crash after all, because as somebody on stack overflow said:

"Most of the memory contains data, where the flip won't be that visiblp"

The bytes representing 'p' and 'e' are not just 1 bit away, but I find the sentence to be fun anyway.