gberry added a comment. One other question: have you explored vectorizing this recurrence as a shuffle+insertelement instead? That would avoid the need for any extra memory dependency checking, and would avoid introducing more loads in the loop. http://reviews.llvm.org/D16197