Monday, September 19, 2011

Bug in LockFreePipe

Dmitry Vyukov, one of my fellow engineers at Google, pointed out a bug in the implementation of LockFreePipe that you can find here and there on the internet, for example here. That reminded me that Bruce Dawson (who coauthored the original with me) had discovered this bug a while ago and tried to explain it to me, but at the time we didn't have the code in front of us and I didn't completely grok what he was trying to tell me.

Anyway, this particular bug is simple: in Read(), after the call to memcpy(), there should be a barrier before the read pointer is updated. Otherwise the read pointer update can become visible before the memcpy is finished, giving the writer an opportunity to overwrite the memory before it's read.

        unsigned long cbTailBytes = min( bytesLeft, c_cbBufferSize - actualReadOffset );
#ifdef _XBOX_VER
        XMemCpy( pbDest, m_pbBuffer + actualReadOffset, cbTailBytes );
        memcpy( pbDest, m_pbBuffer + actualReadOffset, cbTailBytes );
        bytesLeft -= cbTailBytes;

        if( bytesLeft )
#ifdef _XBOX_VER
            XMemCpy( pbDest + cbTailBytes, m_pbBuffer, bytesLeft );
            memcpy( pbDest + cbTailBytes, m_pbBuffer, bytesLeft );

// OOPS: no barrier here. These operations could be reordered
// by the compiler or by the CPU. Should have lwsync or _ReadWriteBarrier here.
        readOffset += cbDest;
        m_readOffset = readOffset;

Dmitry also pointed out that a store-release is meaningless without a load-acquire, so there should also be a barrier in Write() after m_readOffset is read:

   bool __forceinline          Write( const void* pvSrc, unsigned long cbSrc )
        // Reading the read offset here has the same caveats as reading
        // the write offset had in the Read() function above. 
        DWORD readOffset = m_readOffset;
// OOPS: need lwsync or _ReadWriteBarrier here if we want it to work
        DWORD writeOffset = m_writeOffset;

Needless to say, I'm humiliated :-) and curious to know why this worked so well in the past. My first guess was that MSVC did not reorder because the read pointer was marked volatile. Of course, that raises the question of whether any of our barriers were necessary, if volatile is so magic. Well, yes, they were; MSVC volatile semantics at the time prevented compiler reordering, but didn't insert any barriers at the CPU level. The write could still have become visible any time before the read was issued. So why didn't it?

I'd say we got lucky. Just because the CPU can reorder stores before reads, doesn't mean that it does. There's a good reason for the PowerPC to delay stores--it's something that can usually be deferred until later, because in general nobody's really waiting for a store to complete. (If the data is getting reused, it's probably in one of the PPC's copious number of registers. And if it's not... see "load-hit-store.") There's seldom a good reason for a read to get delayed, because the read is very frequently a dependency of the next instruction in the queue. I'm speaking very generally and in a fashion that involves waving my hands quite a lot, but the point is that the window for failure on this is pretty small. Not, however, small enough to be nonexistent, which is what you need if you're doing lock-free work. Bummer.

The sad thing is that I really like this class. It's small and lightweight, doesn't do any allocation, and is basically perfect for things like passing data in and out of audio routines where blocking would cause glitches. I originally wrote it to dump sample data out of custom DSP routines on the Xbox 360, and it worked awesomely well for that purpose. But after talking to Dmitry, I realize that it really only worked  as well as it did because of quirks in the 360's CPU architecture. It's not a generic piece of code.

In any case, if you're using this class, please fix it by inserting memory barriers as described in the code snippet above. And stay tuned--there may be other bugs that I haven't found yet.

Tuesday, September 6, 2011

Checking In...

I haven't been good about posting to this blog lately, probably because I originally started it to talk about the Android NDK, and I haven't been doing much of that lately. I've been jumping around to different parts of Android, trying to learn more about the entire system. But I've made what Tim Bray says is a fundamental blogging mistake, which is to only write about what you think your audience wants to hear. Oops. I guess I should really start writing about what I've been doing, whether or not it's NDK-related (or even Android related).

What I've mostly been working on, besides the inevitable "partner support" email queue, is a Google Tasks app. Yeah, I know, not groundbreaking. But all of the apps I've found so far do it wrong. They don't authenticate with my Google account token--instead, they ask me for my actual Google credentials, which I'm not going to give them. And they don't sync correctly. I want a Tasks app that's as solid as the Gmail app, and since I have the code to the Gmail app, I think I ought to be able to do it. I've got most of the account management stuff worked out, and I have access to the GTasks API (which is, by the way, kind of incomplete), and now I'm writing a ContentProvider so that I can hook it to a SyncAdapter and make background sync work. That's taken me down a path of inventing a lightweight ORM so that I can write a ContentProvider without having to do a whole lot of redundant keyboarding. (I almost said "typing," but that's an overloaded term in this context.)

So that's probably what I'll be writing about for the next little while: my quixotic attempt to do a simple Android app The Right Way. I want it to be a shining example in every possible way. I've already found a few places where that's going to be impossible, because no Right Way exists. But that's a subject for another post...