Friday, March 25, 2016

kzalloc Crash

<< Prev: Thin Air       Next: Linux Upgrade >>


So there I am, merrily testing incremental changes to the driver, when BOOM! The test system restarted.

OK, so it must have been something I just did.

Except, all I just did was rearrange some code.  I still had a bunch of C functions and structs in my C++ class, and I was in the process of eliminating them in favor of the default versions included in the driver code.  I could see compile errors, but how did this cause a runtime error?

Well, one thing I had done was strip out a lot of the debugging messages I had added to the firmware parsing logic. I didn't really think that had caused the failure, but it meant I didn't know how far the process got before it died. So I added all those debug messages back in, along with a little delay at each step so I could read the messages before the restart.

Built: fine. Sent to test machine: fine. Loaded kext: fine. Log messages started pouring out. Everything looked fine. Then it finished, all good.

It worked. And that was the worst, because how could adding log messages or delays fix the problem? If it was one of those completely random errors, I was going to be devastated, because I'm not good enough to track that kind of thing down yet. I couldn't even begin to guess what I might have done to introduce a timing-sensitive or threading-sensitive kind of error.

I unloaded the kext, made some trivial change to add still more debugging, and tried again.

This time it failed, with loads of errors in the firmware parsing. It was fortunate that I was viewing the log remotely, so when the machine died, I still had the text in my terminal:

...
AppleIntelWiFiMVM       0xbcx 19 IWL_UCODE_TLV_SEC_RT
AppleIntelWiFiMVM iwl_store_ucode_sec BAD COUNTER -1912832000
...
AppleIntelWiFiMVM    0xa7428x 20 IWL_UCODE_TLV_SEC_INIT
AppleIntelWiFiMVM iwl_store_ucode_sec BAD COUNTER -1810778608
...
AppleIntelWiFiMVM   0x19343cx 21 IWL_UCODE_TLV_SEC_WOWLAN
AppleIntelWiFiMVM iwl_store_ucode_sec BAD COUNTER 1449124394
...
AppleIntelWiFiMVM Allocate ucode buffers...
AppleIntelWiFiMVM  ERR failed to allocate pci memory
AppleIntelWiFiMVM Firmware parse failed :(

So I poked around a little, starting with these counters with completely bizarre values. Indeed, they were supposed to just increment by one each time a relevant block was found in the firmware. How do you add one by one with a meaningful size file and end up with -1912832000? Except another one was 1449124394. What the heck?

Oh. Oh yeah. How about, where is that counter initialized?

Nowhere, that's where.

So in Java, there are two possibilities for an int. Number one it's a variable automatically initialized to zero (such as an instance variable), or number two you're forced to manually initialize it (such as a method variable). Either way, the compiler has your back.

In C, I guess, not so much.

As best I can figure, the code was assuming the counter was initialized to zero. But if that wasn't done automatically... Then it must have been assuming the memory it received was all zeroes, such that whatever memory the counter int came from started out being zero.

I backed up farther, until I found where I had replaced kzalloc with IOMalloc. IOMalloc doesn't seem to guarantee that the memory will be zeroed. It turns out that kzalloc does. The "z" in kzalloc is not the zone allocator, but the fact that the memory will be zeroed.

Was this plausible? Well, when I booted and loaded the kext, it received clean memory and worked fine. When I then unloaded it and loaded it again, I already know that the memory I get is coming from a zone allocator. So let's imagine it allocates some page of memory, issues chunks from that page to my kext, and then I give them back on unload. Then I load again, asking for chunks of the same size again. There's the page still there, and now with several chunks marked available. My kext gets some of the same chunks back again, or maybe some that some other program used, but either way no longer clean -- they have whatever state they had when freed before. So on the second load, those counters are initialized to garbage contents, and my kext blows up. Pretty much seems like what happened, so yeah, it's plausible.

Then I had to poke around for a memory allocation routine that does zero the memory on OS X. It doesn't help that there are about seventeen options (including at least, kmem_alloc and friends, kalloc and friends, zalloc, OSMalloc and friends, MALLOC, MALLOC_ZONE, IOMalloc and friends, and in C++ code, new). I finally settled on using the MALLOC macro:
MALLOC(pieces, struct iwl_firmware_pieces *,
               sizeof (*pieces), M_TEMP, M_ZERO);

The M_TEMP is because the call requires a "type" and that's a sane default. The M_ZERO calls for zeroed memory. As a bonus, I don't need to pass the allocation size to the FREE macro, so I could whack the extra code I added just to track allocation sizes.

I tried with my new and improved memory allocation, and I could load and unload several times with no problem. It's hard to prove that was the problem and solution, but I haven't seen the same bizarre counter errors since then.

The only issue is that MALLOC requires a parameter of the pointer type, which kzalloc doesn't. So I can't just make the substitution with another macro, I have to put in an actual code change every time. It's possible I should consider using a different allocation and free call combined with memset or something to make this a macro fix instead of a code fix -- but that's a subject for another time.

Oh yeah, and, who thought it was a good idea for an operating system to have this many ways to allocate memory? Sheesh!

<< Prev: Thin Air       Next: Linux Upgrade >>

No comments:

Post a Comment