Doing this optimization so late was looking increasingly brittle as it must interact with AtomicExpandPass earlier at the IR level. I reimplemented it at that level in D5422, and it turned out a lot cleaner. http://reviews.llvm.org/D5091