Monday, August 8, 2011

Ndk-gdb and service processes don't (currently) mix

Got a bug from one of our partners complaining that they had two almost identical Android projects, but one of them was debuggable and the other one wasn't. After checking all the usual culprits, we finally narrowed it down to a single line in the manifest of the undebuggable project (names have been changed, obviously):

<service android:name="com.somemiddleware.SomeUsefulService" android:process=":com.somemiddleware.useful.process" />

It was, of course, very tempting to blame the SomeMiddleware company and its purportedly Useful service, but it wasn't their fault at all. Instead, it turned out to be a very small error in one of the ndk's awk scripts. It turns out that the script that finds the pid of the process you're supposed to be debugging (extract-pid.awk) was, for various reasons, doing a substring match on the process name. This almost always works, because there's almost always only one process running under your package name at any one time. But it doesn't always work.

In this case we were bitten by a quirk of the Android service architecture. The manifest is allowed to specify a process name under which to run the service by using the android:process attribute. Normally this just specifies the exact name of the process that the service should run under. But there's a special case: if the process name so specified begins with a colon, then the service process is considered "private" and its name is concatenated with that of the main process. So instead of running as "com.somemiddleware.useful.process" the service runs as "com.mycompany.mygame:com.somemiddleware.useful.process."

See the problem? Now we have two processes that match our package id: our main game process that runs as "com.mycompany.mygame" and the middleware service that runs as "com.mycompany.mygame:com.somemiddleware.useful.process" -- and guess which one gets matched by the ndk-gdb awk script? 

The fix to extract-pid.awk is simple, and I've already submitted a patch, so hopefully that'll get accepted and shipped in the next NDK. Meanwhile, if you happen to run into this bug, you can either patch the script yourself, or you can just eliminate the colon from the beginning of your service process name while you're debugging.

If you're interested, here's a gnu diff version of the patch:

< # NOTE: For some reason, simply using $9 == PACKAGE does not work
< #       with this script, so use pattern matching instead.
< #
>     RS = "\r\n"
<     gsub("\\.","\\.",PACKAGE)
>     #gsub("\\.","\\.",PACKAGE)
< $9 ~ PACKAGE {
> $9 == PACKAGE {