Saturday, April 9, 2011

Limiting page faults on the HTC Desire HD

After several happy weeks with my HTC Desire HD, I made a rather shocking discovery this week:

HTC put some hack into the kernel to kill a process after encountering more than 10 page faults !

It was really hard to believe until I discovered this in arch/arm/mm/fault.c:

void
__do_user_fault(struct task_struct *tsk, unsigned long addr,
                unsigned int fsr, unsigned int sig, int code,
                struct pt_regs *regs)
{
        struct siginfo si;
        struct task_struct *g, *p, *selected = NULL;

#ifdef CONFIG_DEBUG_USER
        if (user_debug & UDBG_SEGV) {
                printk(KERN_DEBUG "%s: unhandled page fault (%d) at 0x%08lx, code 0x%03x\n",
                       tsk->comm, sig, addr, fsr);
                show_pte(tsk->mm, addr);
                show_regs(regs);
        }
#endif
        if (sig == SIGSEGV)
                tsk->segfault_count++;

        if (tsk->segfault_count > 10) {
                tsk->segfault_count = 0;
                printk(KERN_ERR "unhandled page fault at 0x%08lx, code 0x%03x\n",
                        addr, fsr);
                show_pte(tsk->mm, addr);
                show_regs(regs);

                do_each_thread(g, p) {
                        task_lock(p);
                        if (p == tsk)
                                selected = g;
                        task_unlock(p);
                } while_each_thread(g, p);

                if (selected) {
                        printk(KERN_ERR "%s: triggered too many segfaults, force killing parent: %s\n",
                                tsk->comm, selected->comm);
                        force_sig(SIGKILL, selected);
                        return;
                }
        }

        tsk->thread.address = addr;
        tsk->thread.error_code = fsr;
        tsk->thread.trap_no = 14;
        si.si_signo = sig;
        si.si_errno = 0;
        si.si_code = code;
        si.si_addr = (void __user *)addr;
        force_sig_info(sig, &si, tsk);
}

However, looking at the kernel sources wasn't that easy as it may look like - first, you have to actually get them. One of my coworkers pointed me at this interesting article:
http://www.freedom-to-tinker.com/blog/sjs/htc-willfully-violates-gpl-t-mobiles-new-g2-android-phone - and there's a link in it to a download location.

My kernel is 2.6.32.21-g1e30168, but for some reason, this doesn't work in Germany.

The entire story started about a month ago, where I discovered some very weird crashes while debugging on my device that nobody else in the team had. It was extremely frustrating for me and I kept thinking why me, what am I doing wrong here ?

After some investigation, it pretty much looked like something during variable evaluation was killing the app. We could insert a breakpoint, run to it, but it stopped at the breakpoint and I tried to evaluate some variables, the app crashes ... silently, without anything in adb logcat.

I wrote a simple test app and a soft debugger client application (both can be found in the martins-playground module on github) and soon discovered that we were only crashing during single-threaded invokes.

So what's the difference here ? Well, there's this bug in SDB - single-stepping isn't always disabled during single-threaded invokes. However, this should only impact performance and never cause the app to actually crash. But it gave me the idea ...

As a next step, I wrote a small native C application called NativeTest which installs a SIGSEGV handler and then creates 100 page faults in a loop.

The same thing can also be accomplished with something like this:


    for (int i = 0; i < 1000; i++) {
        try {
            object o = null;
            o.GetType ();
        } catch {
        }
    }

Compile something like that with Mono for Android and execute it on the device - it should not crash.

Luckily, Mono's JIT already has a feature called "explicit null checks" and I also have a patch for the Soft Debugger to check some variable instead of using page faults for single-step and breakpoint events.

However, I'm not entirely sure whether this really covers all possible scenarios where a page fault would normally be handled gracefully. And seeing something like this in their kernel also makes me a bit worried that there might be other surprises.