Wednesday, August 19, 2009

OS X panics

A recent kernel panic on my Macbook looks quite familiar - I think it's been happening now and then since I got the machine in late 2007. This time, and probably most of the other times, it was showing video on Youtube, which suggests there might be a bug in the video driver (or err... the video card :/), but there's not much information to base a confident guess on.

The crash report looks like:
Tue Aug 18 17:58:36 2009
panic(cpu 1 caller 0x00194B15): "pmap_flush_tlbs() timeout: "
"cpu(s) failing to respond to interrupts, pmap=0x46c7ae0 cpus_to_respond=0x1"
@/SourceCache/xnu/xnu-1228.12.14/osfmk/i386/pmap.c:4582
Backtrace (CPU 1), Frame : Return Address (4 potential args on stack)
0x343f3be8 : 0x12b4c6 (0x45ec20 0x343f3c1c 0x13355c 0x0)
0x343f3c38 : 0x194b15 (0x465018 0x46c7ae0 0x1 0x195234)
0x343f3ca8 : 0x197eb4 (0x46c7ae0 0x26b46000 0x0 0x0)
0x343f3d88 : 0x16087d (0x46c7ae0 0x26b46000 0x0 0x71f9)
0x343f3de8 : 0x163170 (0x1c67c40 0x46c7ae0 0x26b46000 0x0)
0x343f3f58 : 0x1ab39c (0x4f7983c 0x26b46000 0x0 0x3)
0x343f3fc8 : 0x1a15fd (0x68b8900 0x0 0x1a40b5 0x68b8900)
No mapping exists for frame pointer
Backtrace terminated-invalid frame pointer 0xbfff7e88

BSD process name corresponding to current thread: firefox-bin

Mac OS version:
9J61

Kernel version:
Darwin Kernel Version 9.7.0: Tue Mar 31 22:52:17 PDT 2009;
root:xnu-1228.12.14~1/RELEASE_I386
System model name: MacBook2,1 (Mac-F4208CAA)

System uptime in nanoseconds: 810215756731744


A quick Google turned up mostly useless assertions on forums like "it's probably a faulty logic board or memory stick - reboot into single user mode and run memtest". In my experience, when memory goes bad, it's much more impressive than an occasional (something like once every month or two) panic. Searching a bit more carefully, I turned up some kernel source code released by Apple (cool!):

xnu-1228/osfmk/i386/pmap.c -> pmap_flush_tlbs(...)

if (cpus_to_signal) {
deadline = mach_absolute_time() + LockTimeOut;
/*
* Wait for those other cpus to acknowledge
*/
for (cpu = 0, cpu_bit = 1; cpu < real_ncpus; cpu++, cpu_bit <<= 1) {
while ((cpus_to_signal & cpu_bit) != 0) {
if (!cpu_datap(cpu)->cpu_running ||
cpu_datap(cpu)->cpu_tlb_invalid == FALSE ||
!CPU_CR3_IS_ACTIVE(cpu)) {
cpus_to_signal &= ~cpu_bit;
break;
}
if (mach_absolute_time() > deadline) {
force_immediate_debugger_NMI = TRUE;
panic("pmap_flush_tlbs() timeout: "
"cpu %d failing to respond to interrupts, pmap=%p cpus_to_signal=%lx",
cpu, pmap, cpus_to_signal);
}
cpu_pause();
}
if (cpus_to_signal == 0)
break;
}
}


I won't make any pretense at understanding most of this, but it looks like, for that panic to occur, the condition (!cpu_datap(cpu)->cpu_running || cpu_datap(cpu)->cpu_tlb_invalid == FALSE || !CPU_CR3_IS_ACTIVE(cpu)) is never met, so we keep waiting and checking until our deadline passes.
The timeout appears to be 12500000 mach time units, which apparently are equivalent to nanoseconds on my laptop - so 0.0125s, or 1/80th of a second.

So if we revisit that condition, we can assume that we get to a stage where, for at least 1/80th of a second, the inverse of the test is always true (or happens to be, every time it's checked - could be a race condition I suppose?): (cpu_running && cpu_tlb_invalid && CPU_CR3_IS_ACTIVE(cpu))...

I'm too lazy to look up what CPU_CR3_IS_ACTIVE(cpu) (a macro that expands to ("(cpu_datap(cpu)->cpu_active_cr3 & 1) == 0)") signifies, but the whole CR3 mess is some kind of i386 control register for dealing with virtual memory paging. Beyond these most basic observations, I'm pretty clueless about the problem, so would love to hear from anyone if you have any ideas.

2 comments:

  1. The newer version of the kernel (xnu-1486.2.11) has a better explanation as to whats going on.

    From pmap.c:

    ...

    /*
    * Called with pmap locked, we:
    * - scan through per-cpu data to see which other cpus need to flush
    * - send an IPI to each non-idle cpu to be flushed
    * - wait for all to signal back that they are inactive or we see that
    * they are in an interrupt handler or at a safe point
    * - flush the local tlb is active for this pmap
    * - return ... the caller will unlock the pmap
    */
    void
    pmap_flush_tlbs(pmap_t pmap)
    {

    ...
    // signal cpu to flush tlb
    if (CPU_CR3_IS_ACTIVE(cpu)) {
    cpus_to_signal |= cpu_bit;
    i386_signal_cpu(cpu, MP_TLB_FLUSH, ASYNC);
    }

    ...

    // check again to see if CR3 is active (as in the code illustrated in your post)

    Info here:
    http://en.wikipedia.org/wiki/Inter-processor_interrupt
    http://en.wikipedia.org/wiki/Translation_lookaside_buffer
    http://en.wikipedia.org/wiki/Control_register#CR3

    So what seems to be happening that during a tlb flush, either the cpu being signaled is not receiving or losing the interrupt being sent to it.

    Read what Linus Torvalds has to say about xAPIC and edge-triggered interrupts:
    http://www.opensubscriber.com/message/linux-kernel@vger.kernel.org/5347967.html

    I think IPIs are edge triggered. If this is true then it could explain the non-determinism of the KP.

    Anyway all this is just from source code and google... I was researching this for a friend who was getting a similar sort of error on boot on an OSX86 setup. Hope it is of some interest.

    Regards,
    echoetch

    ReplyDelete
  2. @echoetch:

    Thanks, that's some interesting info! I particularly like this quote in Torvalds' post:

    "Now, edge-triggered interrupts are a _lot_ harder to mask, because the
    Intel APIC is an unbelievable piece of sh*t, and has the edge-detect logic
    _before_ the mask logic, so if a edge happens _while_ the device is
    masked, you'll never ever see the edge ever again (unmasking will not
    cause a new edge, so you simply lost the interrupt)."

    Maybe something like that is going on.
    That said, it don't think it's happened since my hard disk crashed a little while ago and I clean installed Leopard onto a new one - maybe it was fixed somewhere along the line.

    ReplyDelete