[llvm] r263982 - AMDGPU: Add SIWholeQuadMode pass

Tue Mar 22 10:58:24 PDT 2016

On Mon, Mar 21, 2016 at 06:01:11PM -0500, Nicolai Hähnle wrote:
> Hi Tom,
> 
> I have committed r264000 without review to fix this issue after 
> re-running all tests. It fixes Valgrind issues that mirror the sanitizer 
> reports.
> 
> I hope this is okay with you.
> 

Yeah, that's no problem.

> Thanks,
> Nicolai
> 
> On 21.03.2016 17:39, Nicolai Hähnle wrote:
> > I'm looking into it now.
> >
> > On 21.03.2016 17:26, Vitaly Buka wrote:
> >> Probably caused by this CL:
> >> http://lab.llvm.org:8011/builders/sanitizer-x86_64-linux-fast/builds/11271
> >>
> >>
> >> http://lab.llvm.org:8011/builders/sanitizer-x86_64-linux-fast/builds/11271/steps/check-llvm%20msan/logs/stdio
> >>
> >> http://lab.llvm.org:8011/builders/sanitizer-x86_64-linux-fast/builds/11271/steps/check-llvm%20asan/logs/stdio
> >>
> >>
> >> On Mon, Mar 21, 2016 at 1:33 PM Nicolai Haehnle via llvm-commits
> >> <llvm-commits at lists.llvm.org <mailto:llvm-commits at lists.llvm.org>> wrote:
> >>
> >>     Author: nha
> >>     Date: Mon Mar 21 15:28:33 2016
> >>     New Revision: 263982
> >>
> >>     URL: http://llvm.org/viewvc/llvm-project?rev=263982&view=rev
> >>     Log:
> >>     AMDGPU: Add SIWholeQuadMode pass
> >>
> >>     Summary:
> >>     Whole quad mode is already enabled for pixel shaders that compute
> >>     derivatives, but it must be suspended for instructions that cause a
> >>     shader to have side effects (i.e. stores and atomics).
> >>
> >>     This pass addresses the issue by storing the real (initial) live mask
> >>     in a register, masking EXEC before instructions that require exact
> >>     execution and (re-)enabling WQM where required.
> >>
> >>     This pass is run before register coalescing so that we can use
> >>     machine SSA for analysis.
> >>
> >>     The changes in this patch expose a problem with the second machine
> >>     scheduling pass: target independent instructions like COPY implicitly
> >>     use EXEC when they operate on VGPRs, but this fact is not encoded in
> >>     the MIR. This can lead to miscompilation because instructions are
> >>     moved past changes to EXEC.
> >>
> >>     This patch fixes the problem by adding use-implicit operands to
> >>     target independent instructions. Some general codegen passes are
> >>     relaxed to work with such implicit use operands.
> >>
> >>     Reviewers: arsenm, tstellarAMD, mareko
> >>
> >>     Subscribers: MatzeB, arsenm, llvm-commits
> >>
> >>     Differential Revision: http://reviews.llvm.org/D18162
> >>
> >>     Added:
> >>          llvm/trunk/lib/Target/AMDGPU/SIWholeQuadMode.cpp
> >>          llvm/trunk/test/CodeGen/AMDGPU/wqm.ll
> >>     Modified:
> >>          llvm/trunk/lib/Target/AMDGPU/AMDGPU.h
> >>          llvm/trunk/lib/Target/AMDGPU/AMDGPUInstrInfo.h
> >>          llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
> >>          llvm/trunk/lib/Target/AMDGPU/CMakeLists.txt
> >>          llvm/trunk/lib/Target/AMDGPU/SIInstrInfo.cpp
> >>          llvm/trunk/lib/Target/AMDGPU/SIInstrInfo.h
> >>          llvm/trunk/lib/Target/AMDGPU/SILowerControlFlow.cpp
> >>          llvm/trunk/lib/Target/AMDGPU/SIRegisterInfo.h
> >>
> >>     Modified: llvm/trunk/lib/Target/AMDGPU/AMDGPU.h
> >>     URL:
> >>
> >> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/AMDGPU/AMDGPU.h?rev=263982&r1=263981&r2=263982&view=diff
> >>
> >>
> >> ==============================================================================
> >>
> >>     --- llvm/trunk/lib/Target/AMDGPU/AMDGPU.h (original)
> >>     +++ llvm/trunk/lib/Target/AMDGPU/AMDGPU.h Mon Mar 21 15:28:33 2016
> >>     @@ -44,6 +44,7 @@ FunctionPass *createSIFoldOperandsPass()
> >>       FunctionPass *createSILowerI1CopiesPass();
> >>       FunctionPass *createSIShrinkInstructionsPass();
> >>       FunctionPass *createSILoadStoreOptimizerPass(TargetMachine &tm);
> >>     +FunctionPass *createSIWholeQuadModePass();
> >>       FunctionPass *createSILowerControlFlowPass();
> >>       FunctionPass *createSIFixControlFlowLiveIntervalsPass();
> >>       FunctionPass *createSIFixSGPRCopiesPass();
> >>     @@ -70,6 +71,9 @@ extern char &SILowerI1CopiesID;
> >>       void initializeSILoadStoreOptimizerPass(PassRegistry &);
> >>       extern char &SILoadStoreOptimizerID;
> >>
> >>     +void initializeSIWholeQuadModePass(PassRegistry &);
> >>     +extern char &SIWholeQuadModeID;
> >>     +
> >>       void initializeSILowerControlFlowPass(PassRegistry &);
> >>       extern char &SILowerControlFlowPassID;
> >>
> >>
> >>     Modified: llvm/trunk/lib/Target/AMDGPU/AMDGPUInstrInfo.h
> >>     URL:
> >>
> >> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/AMDGPU/AMDGPUInstrInfo.h?rev=263982&r1=263981&r2=263982&view=diff
> >>
> >>
> >> ==============================================================================
> >>
> >>     --- llvm/trunk/lib/Target/AMDGPU/AMDGPUInstrInfo.h (original)
> >>     +++ llvm/trunk/lib/Target/AMDGPU/AMDGPUInstrInfo.h Mon Mar 21
> >>     15:28:33 2016
> >>     @@ -62,7 +62,6 @@ public:
> >>                                      int64_t Offset1, int64_t Offset2,
> >>                                      unsigned NumLoads) const override;
> >>
> >>     -
> >>         /// \brief Return a target-specific opcode if Opcode is a pseudo
> >>     instruction.
> >>         /// Return -1 if the target-specific opcode for the pseudo
> >>     instruction does
> >>         /// not exist. If Opcode is not a pseudo instruction, this is
> >>     identity.
> >>
> >>     Modified: llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
> >>     URL:
> >>
> >> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp?rev=263982&r1=263981&r2=263982&view=diff
> >>
> >>
> >> ==============================================================================
> >>
> >>     --- llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (original)
> >>     +++ llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp Mon Mar 21
> >>     15:28:33 2016
> >>     @@ -57,6 +57,7 @@ extern "C" void LLVMInitializeAMDGPUTarg
> >>         initializeSIAnnotateControlFlowPass(*PR);
> >>         initializeSIInsertNopsPass(*PR);
> >>         initializeSIInsertWaitsPass(*PR);
> >>     +  initializeSIWholeQuadModePass(*PR);
> >>         initializeSILowerControlFlowPass(*PR);
> >>       }
> >>
> >>     @@ -346,6 +347,7 @@ void GCNPassConfig::addPreRegAlloc() {
> >>           insertPass(&MachineSchedulerID, &RegisterCoalescerID);
> >>         }
> >>         addPass(createSIShrinkInstructionsPass(), false);
> >>     +  addPass(createSIWholeQuadModePass());
> >>       }
> >>
> >>       void GCNPassConfig::addFastRegAlloc(FunctionPass *RegAllocPass) {
> >>
> >>     Modified: llvm/trunk/lib/Target/AMDGPU/CMakeLists.txt
> >>     URL:
> >>
> >> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/AMDGPU/CMakeLists.txt?rev=263982&r1=263981&r2=263982&view=diff
> >>
> >>
> >> ==============================================================================
> >>
> >>     --- llvm/trunk/lib/Target/AMDGPU/CMakeLists.txt (original)
> >>     +++ llvm/trunk/lib/Target/AMDGPU/CMakeLists.txt Mon Mar 21
> >> 15:28:33 2016
> >>     @@ -63,6 +63,7 @@ add_llvm_target(AMDGPUCodeGen
> >>         SIRegisterInfo.cpp
> >>         SIShrinkInstructions.cpp
> >>         SITypeRewriter.cpp
> >>     +  SIWholeQuadMode.cpp
> >>         )
> >>
> >>       add_subdirectory(AsmParser)
> >>
> >>     Modified: llvm/trunk/lib/Target/AMDGPU/SIInstrInfo.cpp
> >>     URL:
> >>
> >> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/AMDGPU/SIInstrInfo.cpp?rev=263982&r1=263981&r2=263982&view=diff
> >>
> >>
> >> ==============================================================================
> >>
> >>     --- llvm/trunk/lib/Target/AMDGPU/SIInstrInfo.cpp (original)
> >>     +++ llvm/trunk/lib/Target/AMDGPU/SIInstrInfo.cpp Mon Mar 21 15:28:33
> >>     2016
> >>     @@ -1248,6 +1248,19 @@ MachineInstr *SIInstrInfo::convertToThre
> >>                        .addImm(0); // omod
> >>       }
> >>
> >>     +bool SIInstrInfo::isSchedulingBoundary(const MachineInstr *MI,
> >>     +                                       const MachineBasicBlock *MBB,
> >>     +                                       const MachineFunction &MF)
> >>     const {
> >>     +  // Target-independent instructions do not have an implicit-use of
> >>     EXEC, even
> >>     +  // when they operate on VGPRs. Treating EXEC modifications as
> >>     scheduling
> >>     +  // boundaries prevents incorrect movements of such instructions.
> >>     +  const TargetRegisterInfo *TRI =
> >> MF.getSubtarget().getRegisterInfo();
> >>     +  if (MI->modifiesRegister(AMDGPU::EXEC, TRI))
> >>     +    return true;
> >>     +
> >>     +  return AMDGPUInstrInfo::isSchedulingBoundary(MI, MBB, MF);
> >>     +}
> >>     +
> >>       bool SIInstrInfo::isInlineConstant(const APInt &Imm) const {
> >>         int64_t SVal = Imm.getSExtValue();
> >>         if (SVal >= -16 && SVal <= 64)
> >>
> >>     Modified: llvm/trunk/lib/Target/AMDGPU/SIInstrInfo.h
> >>     URL:
> >>
> >> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/AMDGPU/SIInstrInfo.h?rev=263982&r1=263981&r2=263982&view=diff
> >>
> >>
> >> ==============================================================================
> >>
> >>     --- llvm/trunk/lib/Target/AMDGPU/SIInstrInfo.h (original)
> >>     +++ llvm/trunk/lib/Target/AMDGPU/SIInstrInfo.h Mon Mar 21 15:28:33
> >> 2016
> >>     @@ -149,6 +149,10 @@ public:
> >>
> >> MachineBasicBlock::iterator &MI,
> >>                                             LiveVariables *LV) const
> >>     override;
> >>
> >>     +  bool isSchedulingBoundary(const MachineInstr *MI,
> >>     +                            const MachineBasicBlock *MBB,
> >>     +                            const MachineFunction &MF) const
> >> override;
> >>     +
> >>         static bool isSALU(const MachineInstr &MI) {
> >>           return MI.getDesc().TSFlags & SIInstrFlags::SALU;
> >>         }
> >>
> >>     Modified: llvm/trunk/lib/Target/AMDGPU/SILowerControlFlow.cpp
> >>     URL:
> >>
> >> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/AMDGPU/SILowerControlFlow.cpp?rev=263982&r1=263981&r2=263982&view=diff
> >>
> >>
> >> ==============================================================================
> >>
> >>     --- llvm/trunk/lib/Target/AMDGPU/SILowerControlFlow.cpp (original)
> >>     +++ llvm/trunk/lib/Target/AMDGPU/SILowerControlFlow.cpp Mon Mar 21
> >>     15:28:33 2016
> >>     @@ -78,7 +78,7 @@ private:
> >>         void SkipIfDead(MachineInstr &MI);
> >>
> >>         void If(MachineInstr &MI);
> >>     -  void Else(MachineInstr &MI);
> >>     +  void Else(MachineInstr &MI, bool ExecModified);
> >>         void Break(MachineInstr &MI);
> >>         void IfBreak(MachineInstr &MI);
> >>         void ElseBreak(MachineInstr &MI);
> >>     @@ -215,7 +215,7 @@ void SILowerControlFlow::If(MachineInstr
> >>         MI.eraseFromParent();
> >>       }
> >>
> >>     -void SILowerControlFlow::Else(MachineInstr &MI) {
> >>     +void SILowerControlFlow::Else(MachineInstr &MI, bool ExecModified) {
> >>         MachineBasicBlock &MBB = *MI.getParent();
> >>         DebugLoc DL = MI.getDebugLoc();
> >>         unsigned Dst = MI.getOperand(0).getReg();
> >>     @@ -225,6 +225,15 @@ void SILowerControlFlow::Else(MachineIns
> >>                 TII->get(AMDGPU::S_OR_SAVEEXEC_B64), Dst)
> >>                 .addReg(Src); // Saved EXEC
> >>
> >>     +  if (ExecModified) {
> >>     +    // Adjust the saved exec to account for the modifications
> >>     during the flow
> >>     +    // block that contains the ELSE. This can happen when WQM mode
> >>     is switched
> >>     +    // off.
> >>     +    BuildMI(MBB, &MI, DL, TII->get(AMDGPU::S_AND_B64), Dst)
> >>     +            .addReg(AMDGPU::EXEC)
> >>     +            .addReg(Dst);
> >>     +  }
> >>     +
> >>         BuildMI(MBB, &MI, DL, TII->get(AMDGPU::S_XOR_B64), AMDGPU::EXEC)
> >>                 .addReg(AMDGPU::EXEC)
> >>                 .addReg(Dst);
> >>     @@ -488,7 +497,6 @@ bool SILowerControlFlow::runOnMachineFun
> >>         SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();
> >>
> >>         bool HaveKill = false;
> >>     -  bool NeedWQM = false;
> >>         bool NeedFlat = false;
> >>         unsigned Depth = 0;
> >>
> >>     @@ -498,17 +506,24 @@ bool SILowerControlFlow::runOnMachineFun
> >>           MachineBasicBlock *EmptyMBBAtEnd = NULL;
> >>           MachineBasicBlock &MBB = *BI;
> >>           MachineBasicBlock::iterator I, Next;
> >>     +    bool ExecModified = false;
> >>     +
> >>           for (I = MBB.begin(); I != MBB.end(); I = Next) {
> >>             Next = std::next(I);
> >>
> >>             MachineInstr &MI = *I;
> >>     -      if (TII->isWQM(MI) || TII->isDS(MI))
> >>     -        NeedWQM = true;
> >>
> >>             // Flat uses m0 in case it needs to access LDS.
> >>             if (TII->isFLAT(MI))
> >>               NeedFlat = true;
> >>
> >>     +      for (const auto &Def : I->defs()) {
> >>     +        if (Def.isReg() && Def.isDef() && Def.getReg() ==
> >>     AMDGPU::EXEC) {
> >>     +          ExecModified = true;
> >>     +          break;
> >>     +        }
> >>     +      }
> >>     +
> >>             switch (MI.getOpcode()) {
> >>               default: break;
> >>               case AMDGPU::SI_IF:
> >>     @@ -517,7 +532,7 @@ bool SILowerControlFlow::runOnMachineFun
> >>                 break;
> >>
> >>               case AMDGPU::SI_ELSE:
> >>     -          Else(MI);
> >>     +          Else(MI, ExecModified);
> >>                 break;
> >>
> >>               case AMDGPU::SI_BREAK:
> >>     @@ -599,12 +614,6 @@ bool SILowerControlFlow::runOnMachineFun
> >>           }
> >>         }
> >>
> >>     -  if (NeedWQM && MFI->getShaderType() == ShaderType::PIXEL) {
> >>     -    MachineBasicBlock &MBB = MF.front();
> >>     -    BuildMI(MBB, MBB.getFirstNonPHI(), DebugLoc(),
> >>     TII->get(AMDGPU::S_WQM_B64),
> >>     -            AMDGPU::EXEC).addReg(AMDGPU::EXEC);
> >>     -  }
> >>     -
> >>         if (NeedFlat && MFI->IsKernel) {
> >>           // TODO: What to use with function calls?
> >>           // We will need to Initialize the flat scratch register pair.
> >>
> >>     Modified: llvm/trunk/lib/Target/AMDGPU/SIRegisterInfo.h
> >>     URL:
> >>
> >> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/AMDGPU/SIRegisterInfo.h?rev=263982&r1=263981&r2=263982&view=diff
> >>
> >>
> >> ==============================================================================
> >>
> >>     --- llvm/trunk/lib/Target/AMDGPU/SIRegisterInfo.h (original)
> >>     +++ llvm/trunk/lib/Target/AMDGPU/SIRegisterInfo.h Mon Mar 21
> >>     15:28:33 2016
> >>     @@ -72,9 +72,12 @@ public:
> >>         }
> >>
> >>         bool isSGPRReg(const MachineRegisterInfo &MRI, unsigned Reg)
> >> const {
> >>     +    const TargetRegisterClass *RC;
> >>           if (TargetRegisterInfo::isVirtualRegister(Reg))
> >>     -      return isSGPRClass(MRI.getRegClass(Reg));
> >>     -    return getPhysRegClass(Reg);
> >>     +      RC = MRI.getRegClass(Reg);
> >>     +    else
> >>     +      RC = getPhysRegClass(Reg);
> >>     +    return isSGPRClass(RC);
> >>         }
> >>
> >>         /// \returns true if this class contains VGPR registers.
> >>
> >>     Added: llvm/trunk/lib/Target/AMDGPU/SIWholeQuadMode.cpp
> >>     URL:
> >>
> >> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/AMDGPU/SIWholeQuadMode.cpp?rev=263982&view=auto
> >>
> >>
> >> ==============================================================================
> >>
> >>     --- llvm/trunk/lib/Target/AMDGPU/SIWholeQuadMode.cpp (added)
> >>     +++ llvm/trunk/lib/Target/AMDGPU/SIWholeQuadMode.cpp Mon Mar 21
> >>     15:28:33 2016
> >>     @@ -0,0 +1,465 @@
> >>     +//===-- SIWholeQuadMode.cpp - enter and suspend whole quad mode
> >>     -----------===//
> >>     +//
> >>     +//                     The LLVM Compiler Infrastructure
> >>     +//
> >>     +// This file is distributed under the University of Illinois Open
> >>     Source
> >>     +// License. See LICENSE.TXT for details.
> >>     +//
> >>
> >> +//===----------------------------------------------------------------------===//
> >>
> >>     +//
> >>     +/// \file
> >>     +/// \brief This pass adds instructions to enable whole quad mode
> >>     for pixel
> >>     +/// shaders.
> >>     +///
> >>     +/// Whole quad mode is required for derivative computations, but it
> >>     interferes
> >>     +/// with shader side effects (stores and atomics). This pass is run
> >>     on the
> >>     +/// scheduled machine IR but before register coalescing, so that
> >>     machine SSA is
> >>     +/// available for analysis. It ensures that WQM is enabled when
> >>     necessary, but
> >>     +/// disabled around stores and atomics.
> >>     +///
> >>     +/// When necessary, this pass creates a function prolog
> >>     +///
> >>     +///   S_MOV_B64 LiveMask, EXEC
> >>     +///   S_WQM_B64 EXEC, EXEC
> >>     +///
> >>     +/// to enter WQM at the top of the function and surrounds blocks of
> >>     Exact
> >>     +/// instructions by
> >>     +///
> >>     +///   S_AND_SAVEEXEC_B64 Tmp, LiveMask
> >>     +///   ...
> >>     +///   S_MOV_B64 EXEC, Tmp
> >>     +///
> >>     +/// In order to avoid excessive switching during sequences of Exact
> >>     +/// instructions, the pass first analyzes which instructions must
> >>     be run in WQM
> >>     +/// (aka which instructions produce values that lead to derivative
> >>     +/// computations).
> >>     +///
> >>     +/// Basic blocks are always exited in WQM as long as some successor
> >>     needs WQM.
> >>     +///
> >>     +/// There is room for improvement given better control flow
> >> analysis:
> >>     +///
> >>     +///  (1) at the top level (outside of control flow statements, and
> >>     as long as
> >>     +///      kill hasn't been used), one SGPR can be saved by
> >>     recovering WQM from
> >>     +///      the LiveMask (this is implemented for the entry block).
> >>     +///
> >>     +///  (2) when entire regions (e.g. if-else blocks or entire
> >> loops) only
> >>     +///      consist of exact and don't-care instructions, the switch
> >>     only has to
> >>     +///      be done at the entry and exit points rather than
> >>     potentially in each
> >>     +///      block of the region.
> >>     +///
> >>
> >> +//===----------------------------------------------------------------------===//
> >>
> >>     +
> >>     +#include "AMDGPU.h"
> >>     +#include "AMDGPUSubtarget.h"
> >>     +#include "SIInstrInfo.h"
> >>     +#include "SIMachineFunctionInfo.h"
> >>     +#include "llvm/CodeGen/MachineDominanceFrontier.h"
> >>     +#include "llvm/CodeGen/MachineDominators.h"
> >>     +#include "llvm/CodeGen/MachineFunction.h"
> >>     +#include "llvm/CodeGen/MachineFunctionPass.h"
> >>     +#include "llvm/CodeGen/MachineInstrBuilder.h"
> >>     +#include "llvm/CodeGen/MachineRegisterInfo.h"
> >>     +#include "llvm/IR/Constants.h"
> >>     +
> >>     +using namespace llvm;
> >>     +
> >>     +#define DEBUG_TYPE "si-wqm"
> >>     +
> >>     +namespace {
> >>     +
> >>     +enum {
> >>     +  StateWQM = 0x1,
> >>     +  StateExact = 0x2,
> >>     +};
> >>     +
> >>     +struct InstrInfo {
> >>     +  char Needs = 0;
> >>     +  char OutNeeds = 0;
> >>     +};
> >>     +
> >>     +struct BlockInfo {
> >>     +  char Needs = 0;
> >>     +  char InNeeds = 0;
> >>     +  char OutNeeds = 0;
> >>     +};
> >>     +
> >>     +struct WorkItem {
> >>     +  const MachineBasicBlock *MBB = nullptr;
> >>     +  const MachineInstr *MI = nullptr;
> >>     +
> >>     +  WorkItem() {}
> >>     +  WorkItem(const MachineBasicBlock *MBB) : MBB(MBB) {}
> >>     +  WorkItem(const MachineInstr *MI) : MI(MI) {}
> >>     +};
> >>     +
> >>     +class SIWholeQuadMode : public MachineFunctionPass {
> >>     +private:
> >>     +  const SIInstrInfo *TII;
> >>     +  const SIRegisterInfo *TRI;
> >>     +  MachineRegisterInfo *MRI;
> >>     +
> >>     +  DenseMap<const MachineInstr *, InstrInfo> Instructions;
> >>     +  DenseMap<const MachineBasicBlock *, BlockInfo> Blocks;
> >>     +  SmallVector<const MachineInstr *, 2> ExecExports;
> >>     +
> >>     +  char scanInstructions(const MachineFunction &MF,
> >>     std::vector<WorkItem>& Worklist);
> >>     +  void propagateInstruction(const MachineInstr &MI,
> >>     std::vector<WorkItem>& Worklist);
> >>     +  void propagateBlock(const MachineBasicBlock &MBB,
> >>     std::vector<WorkItem>& Worklist);
> >>     +  char analyzeFunction(const MachineFunction &MF);
> >>     +
> >>     +  void toExact(MachineBasicBlock &MBB, MachineBasicBlock::iterator
> >>     Before,
> >>     +               unsigned SaveWQM, unsigned LiveMaskReg);
> >>     +  void toWQM(MachineBasicBlock &MBB, MachineBasicBlock::iterator
> >>     Before,
> >>     +             unsigned SavedWQM);
> >>     +  void processBlock(MachineBasicBlock &MBB, unsigned LiveMaskReg,
> >>     bool isEntry);
> >>     +
> >>     +public:
> >>     +  static char ID;
> >>     +
> >>     +  SIWholeQuadMode() :
> >>     +    MachineFunctionPass(ID) { }
> >>     +
> >>     +  bool runOnMachineFunction(MachineFunction &MF) override;
> >>     +
> >>     +  const char *getPassName() const override {
> >>     +    return "SI Whole Quad Mode";
> >>     +  }
> >>     +
> >>     +  void getAnalysisUsage(AnalysisUsage &AU) const override {
> >>     +    AU.setPreservesCFG();
> >>     +    MachineFunctionPass::getAnalysisUsage(AU);
> >>     +  }
> >>     +};
> >>     +
> >>     +} // End anonymous namespace
> >>     +
> >>     +char SIWholeQuadMode::ID = 0;
> >>     +
> >>     +INITIALIZE_PASS_BEGIN(SIWholeQuadMode, DEBUG_TYPE,
> >>     +                      "SI Whole Quad Mode", false, false)
> >>     +INITIALIZE_PASS_END(SIWholeQuadMode, DEBUG_TYPE,
> >>     +                      "SI Whole Quad Mode", false, false)
> >>     +
> >>     +char &llvm::SIWholeQuadModeID = SIWholeQuadMode::ID;
> >>     +
> >>     +FunctionPass *llvm::createSIWholeQuadModePass() {
> >>     +  return new SIWholeQuadMode;
> >>     +}
> >>     +
> >>     +// Scan instructions to determine which ones require an Exact
> >>     execmask and
> >>     +// which ones seed WQM requirements.
> >>     +char SIWholeQuadMode::scanInstructions(const MachineFunction &MF,
> >>     +                                       std::vector<WorkItem>
> >>     &Worklist) {
> >>     +  char GlobalFlags = 0;
> >>     +
> >>     +  for (auto BI = MF.begin(), BE = MF.end(); BI != BE; ++BI) {
> >>     +    const MachineBasicBlock &MBB = *BI;
> >>     +
> >>     +    for (auto II = MBB.begin(), IE = MBB.end(); II != IE; ++II) {
> >>     +      const MachineInstr &MI = *II;
> >>     +      unsigned Opcode = MI.getOpcode();
> >>     +      char Flags;
> >>     +
> >>     +      if (TII->isWQM(Opcode) || TII->isDS(Opcode)) {
> >>     +        Flags = StateWQM;
> >>     +      } else if (TII->get(Opcode).mayStore() &&
> >>     +                 (MI.getDesc().TSFlags & SIInstrFlags::VM_CNT)) {
> >>     +        Flags = StateExact;
> >>     +      } else {
> >>     +        // Handle export instructions with the exec mask valid
> >> flag set
> >>     +        if (Opcode == AMDGPU::EXP && MI.getOperand(4).getImm() != 0)
> >>     +          ExecExports.push_back(&MI);
> >>     +        continue;
> >>     +      }
> >>     +
> >>     +      Instructions[&MI].Needs = Flags;
> >>     +      Worklist.push_back(&MI);
> >>     +      GlobalFlags |= Flags;
> >>     +    }
> >>     +  }
> >>     +
> >>     +  return GlobalFlags;
> >>     +}
> >>     +
> >>     +void SIWholeQuadMode::propagateInstruction(const MachineInstr &MI,
> >>     +                                           std::vector<WorkItem>&
> >>     Worklist) {
> >>     +  const MachineBasicBlock &MBB = *MI.getParent();
> >>     +  InstrInfo &II = Instructions[&MI];
> >>     +  BlockInfo &BI = Blocks[&MBB];
> >>     +
> >>     +  // Control flow-type instructions that are followed by WQM
> >>     computations
> >>     +  // must themselves be in WQM.
> >>     +  if ((II.OutNeeds & StateWQM) && !(II.Needs & StateWQM) &&
> >>     +      (MI.isBranch() || MI.isTerminator() || MI.getOpcode() ==
> >>     AMDGPU::SI_KILL))
> >>     +    II.Needs = StateWQM;
> >>     +
> >>     +  // Propagate to block level
> >>     +  BI.Needs |= II.Needs;
> >>     +  if ((BI.InNeeds | II.Needs) != BI.InNeeds) {
> >>     +    BI.InNeeds |= II.Needs;
> >>     +    Worklist.push_back(&MBB);
> >>     +  }
> >>     +
> >>     +  // Propagate backwards within block
> >>     +  if (const MachineInstr *PrevMI = MI.getPrevNode()) {
> >>     +    char InNeeds = II.Needs | II.OutNeeds;
> >>     +    if (!PrevMI->isPHI()) {
> >>     +      InstrInfo &PrevII = Instructions[PrevMI];
> >>     +      if ((PrevII.OutNeeds | InNeeds) != PrevII.OutNeeds) {
> >>     +        PrevII.OutNeeds |= InNeeds;
> >>     +        Worklist.push_back(PrevMI);
> >>     +      }
> >>     +    }
> >>     +  }
> >>     +
> >>     +  // Propagate WQM flag to instruction inputs
> >>     +  assert(II.Needs != (StateWQM | StateExact));
> >>     +  if (II.Needs != StateWQM)
> >>     +    return;
> >>     +
> >>     +  for (const MachineOperand &Use : MI.uses()) {
> >>     +    if (!Use.isReg() || !Use.isUse())
> >>     +      continue;
> >>     +
> >>     +    // At this point, physical registers appear as inputs or outputs
> >>     +    // and following them makes no sense (and would in fact be
> >>     incorrect
> >>     +    // when the same VGPR is used as both an output and an input
> >>     that leads
> >>     +    // to a NeedsWQM instruction).
> >>     +    //
> >>     +    // Note: VCC appears e.g. in 64-bit addition with carry -
> >>     theoretically we
> >>     +    // have to trace this, in practice it happens for 64-bit
> >>     computations like
> >>     +    // pointers where both dwords are followed already anyway.
> >>     +    if (!TargetRegisterInfo::isVirtualRegister(Use.getReg()))
> >>     +      continue;
> >>     +
> >>     +    for (const MachineOperand &Def :
> >> MRI->def_operands(Use.getReg())) {
> >>     +      const MachineInstr *DefMI = Def.getParent();
> >>     +      InstrInfo &DefII = Instructions[DefMI];
> >>     +
> >>     +      // Obviously skip if DefMI is already flagged as NeedWQM.
> >>     +      //
> >>     +      // The instruction might also be flagged as NeedExact. This
> >>     happens when
> >>     +      // the result of an atomic is used in a WQM computation. In
> >>     this case,
> >>     +      // the atomic must not run for helper pixels and the WQM
> >>     result is
> >>     +      // undefined.
> >>     +      if (DefII.Needs != 0)
> >>     +        continue;
> >>     +
> >>     +      DefII.Needs = StateWQM;
> >>     +      Worklist.push_back(DefMI);
> >>     +    }
> >>     +  }
> >>     +}
> >>     +
> >>     +void SIWholeQuadMode::propagateBlock(const MachineBasicBlock &MBB,
> >>     +                                     std::vector<WorkItem>&
> >> Worklist) {
> >>     +  BlockInfo &BI = Blocks[&MBB];
> >>     +
> >>     +  // Propagate through instructions
> >>     +  if (!MBB.empty()) {
> >>     +    const MachineInstr *LastMI = &*MBB.rbegin();
> >>     +    InstrInfo &LastII = Instructions[LastMI];
> >>     +    if ((LastII.OutNeeds | BI.OutNeeds) != LastII.OutNeeds) {
> >>     +      LastII.OutNeeds |= BI.OutNeeds;
> >>     +      Worklist.push_back(LastMI);
> >>     +    }
> >>     +  }
> >>     +
> >>     +  // Predecessor blocks must provide for our WQM/Exact needs.
> >>     +  for (const MachineBasicBlock *Pred : MBB.predecessors()) {
> >>     +    BlockInfo &PredBI = Blocks[Pred];
> >>     +    if ((PredBI.OutNeeds | BI.InNeeds) == PredBI.OutNeeds)
> >>     +      continue;
> >>     +
> >>     +    PredBI.OutNeeds |= BI.InNeeds;
> >>     +    PredBI.InNeeds |= BI.InNeeds;
> >>     +    Worklist.push_back(Pred);
> >>     +  }
> >>     +
> >>     +  // All successors must be prepared to accept the same set of
> >>     WQM/Exact
> >>     +  // data.
> >>     +  for (const MachineBasicBlock *Succ : MBB.successors()) {
> >>     +    BlockInfo &SuccBI = Blocks[Succ];
> >>     +    if ((SuccBI.InNeeds | BI.OutNeeds) == SuccBI.InNeeds)
> >>     +      continue;
> >>     +
> >>     +    SuccBI.InNeeds |= BI.OutNeeds;
> >>     +    Worklist.push_back(Succ);
> >>     +  }
> >>     +}
> >>     +
> >>     +char SIWholeQuadMode::analyzeFunction(const MachineFunction &MF) {
> >>     +  std::vector<WorkItem> Worklist;
> >>     +  char GlobalFlags = scanInstructions(MF, Worklist);
> >>     +
> >>     +  while (!Worklist.empty()) {
> >>     +    WorkItem WI = Worklist.back();
> >>     +    Worklist.pop_back();
> >>     +
> >>     +    if (WI.MI)
> >>     +      propagateInstruction(*WI.MI, Worklist);
> >>     +    else
> >>     +      propagateBlock(*WI.MBB, Worklist);
> >>     +  }
> >>     +
> >>     +  return GlobalFlags;
> >>     +}
> >>     +
> >>     +void SIWholeQuadMode::toExact(MachineBasicBlock &MBB,
> >>     +                              MachineBasicBlock::iterator Before,
> >>     +                              unsigned SaveWQM, unsigned
> >> LiveMaskReg)
> >>     +{
> >>     +  if (SaveWQM) {
> >>     +    BuildMI(MBB, Before, DebugLoc(),
> >>     TII->get(AMDGPU::S_AND_SAVEEXEC_B64),
> >>     +            SaveWQM)
> >>     +        .addReg(LiveMaskReg);
> >>     +  } else {
> >>     +    BuildMI(MBB, Before, DebugLoc(), TII->get(AMDGPU::S_AND_B64),
> >>     +            AMDGPU::EXEC)
> >>     +        .addReg(AMDGPU::EXEC)
> >>     +        .addReg(LiveMaskReg);
> >>     +  }
> >>     +}
> >>     +
> >>     +void SIWholeQuadMode::toWQM(MachineBasicBlock &MBB,
> >>     +                            MachineBasicBlock::iterator Before,
> >>     +                            unsigned SavedWQM)
> >>     +{
> >>     +  if (SavedWQM) {
> >>     +    BuildMI(MBB, Before, DebugLoc(), TII->get(AMDGPU::COPY),
> >>     AMDGPU::EXEC)
> >>     +        .addReg(SavedWQM);
> >>     +  } else {
> >>     +    BuildMI(MBB, Before, DebugLoc(), TII->get(AMDGPU::S_WQM_B64),
> >>     +            AMDGPU::EXEC)
> >>     +        .addReg(AMDGPU::EXEC);
> >>     +  }
> >>     +}
> >>     +
> >>     +void SIWholeQuadMode::processBlock(MachineBasicBlock &MBB, unsigned
> >>     LiveMaskReg,
> >>     +                                   bool isEntry) {
> >>     +  auto BII = Blocks.find(&MBB);
> >>     +  if (BII == Blocks.end())
> >>     +    return;
> >>     +
> >>     +  const BlockInfo &BI = BII->second;
> >>     +
> >>     +  if (!(BI.InNeeds & StateWQM))
> >>     +    return;
> >>     +
> >>     +  // This is a non-entry block that is WQM throughout, so no need
> >> to do
> >>     +  // anything.
> >>     +  if (!isEntry && !(BI.Needs & StateExact) && BI.OutNeeds !=
> >>     StateExact)
> >>     +    return;
> >>     +
> >>     +  unsigned SavedWQMReg = 0;
> >>     +  bool WQMFromExec = isEntry;
> >>     +  char State = isEntry ? StateExact : StateWQM;
> >>     +
> >>     +  auto II = MBB.getFirstNonPHI(), IE = MBB.end();
> >>     +  while (II != IE) {
> >>     +    MachineInstr &MI = *II;
> >>     +    ++II;
> >>     +
> >>     +    // Skip instructions that are not affected by EXEC
> >>     +    if (MI.getDesc().TSFlags & (SIInstrFlags::SALU |
> >>     SIInstrFlags::SMRD) &&
> >>     +        !MI.isBranch() && !MI.isTerminator())
> >>     +      continue;
> >>     +
> >>     +    // Generic instructions such as COPY will either disappear by
> >>     register
> >>     +    // coalescing or be lowered to SALU or VALU instructions.
> >>     +    if (TargetInstrInfo::isGenericOpcode(MI.getOpcode())) {
> >>     +      if (MI.getNumExplicitOperands() >= 1) {
> >>     +        const MachineOperand &Op = MI.getOperand(0);
> >>     +        if (Op.isReg()) {
> >>     +          if (TRI->isSGPRReg(*MRI, Op.getReg())) {
> >>     +            // SGPR instructions are not affected by EXEC
> >>     +            continue;
> >>     +          }
> >>     +        }
> >>     +      }
> >>     +    }
> >>     +
> >>     +    char Needs = 0;
> >>     +    char OutNeeds = 0;
> >>     +    auto InstrInfoIt = Instructions.find(&MI);
> >>     +    if (InstrInfoIt != Instructions.end()) {
> >>     +      Needs = InstrInfoIt->second.Needs;
> >>     +      OutNeeds = InstrInfoIt->second.OutNeeds;
> >>     +
> >>     +      // Make sure to switch to Exact mode before the end of the
> >>     block when
> >>     +      // Exact and only Exact is needed further downstream.
> >>     +      if (OutNeeds == StateExact && (MI.isBranch() ||
> >>     MI.isTerminator())) {
> >>     +        assert(Needs == 0);
> >>     +        Needs = StateExact;
> >>     +      }
> >>     +    }
> >>     +
> >>     +    // State switching
> >>     +    if (Needs && State != Needs) {
> >>     +      if (Needs == StateExact) {
> >>     +        assert(!SavedWQMReg);
> >>     +
> >>     +        if (!WQMFromExec && (OutNeeds & StateWQM))
> >>     +          SavedWQMReg =
> >>     MRI->createVirtualRegister(&AMDGPU::SReg_64RegClass);
> >>     +
> >>     +        toExact(MBB, &MI, SavedWQMReg, LiveMaskReg);
> >>     +      } else {
> >>     +        assert(WQMFromExec == (SavedWQMReg == 0));
> >>     +        toWQM(MBB, &MI, SavedWQMReg);
> >>     +        SavedWQMReg = 0;
> >>     +      }
> >>     +
> >>     +      State = Needs;
> >>     +    }
> >>     +
> >>     +    if (MI.getOpcode() == AMDGPU::SI_KILL)
> >>     +      WQMFromExec = false;
> >>     +  }
> >>     +
> >>     +  if ((BI.OutNeeds & StateWQM) && State != StateWQM) {
> >>     +    assert(WQMFromExec == (SavedWQMReg == 0));
> >>     +    toWQM(MBB, MBB.end(), SavedWQMReg);
> >>     +  } else if (BI.OutNeeds == StateExact && State != StateExact) {
> >>     +    toExact(MBB, MBB.end(), 0, LiveMaskReg);
> >>     +  }
> >>     +}
> >>     +
> >>     +bool SIWholeQuadMode::runOnMachineFunction(MachineFunction &MF) {
> >>     +  SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();
> >>     +
> >>     +  if (MFI->getShaderType() != ShaderType::PIXEL)
> >>     +    return false;
> >>     +
> >>     +  Instructions.clear();
> >>     +  Blocks.clear();
> >>     +  ExecExports.clear();
> >>     +
> >>     +  TII = static_cast<const SIInstrInfo
> >>     *>(MF.getSubtarget().getInstrInfo());
> >>     +  TRI = static_cast<const SIRegisterInfo
> >>     *>(MF.getSubtarget().getRegisterInfo());
> >>     +  MRI = &MF.getRegInfo();
> >>     +
> >>     +  char GlobalFlags = analyzeFunction(MF);
> >>     +  if (!(GlobalFlags & StateWQM))
> >>     +    return false;
> >>     +
> >>     +  MachineBasicBlock &Entry = MF.front();
> >>     +  MachineInstr *EntryMI = Entry.getFirstNonPHI();
> >>     +
> >>     +  if (GlobalFlags == StateWQM) {
> >>     +    // For a shader that needs only WQM, we can just set it once.
> >>     +    BuildMI(Entry, EntryMI, DebugLoc(), TII->get(AMDGPU::S_WQM_B64),
> >>     +            AMDGPU::EXEC).addReg(AMDGPU::EXEC);
> >>     +    return true;
> >>     +  }
> >>     +
> >>     +  // Handle the general case
> >>     +  unsigned LiveMaskReg =
> >>     MRI->createVirtualRegister(&AMDGPU::SReg_64RegClass);
> >>     +  BuildMI(Entry, EntryMI, DebugLoc(), TII->get(AMDGPU::COPY),
> >>     LiveMaskReg)
> >>     +      .addReg(AMDGPU::EXEC);
> >>     +
> >>     +  for (const auto &BII : Blocks)
> >>     +    processBlock(const_cast<MachineBasicBlock &>(*BII.first),
> >>     LiveMaskReg,
> >>     +                 BII.first == &*MF.begin());
> >>     +
> >>     +  return true;
> >>     +}
> >>
> >>     Added: llvm/trunk/test/CodeGen/AMDGPU/wqm.ll
> >>     URL:
> >>
> >> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/AMDGPU/wqm.ll?rev=263982&view=auto
> >>
> >>
> >> ==============================================================================
> >>
> >>     --- llvm/trunk/test/CodeGen/AMDGPU/wqm.ll (added)
> >>     +++ llvm/trunk/test/CodeGen/AMDGPU/wqm.ll Mon Mar 21 15:28:33 2016
> >>     @@ -0,0 +1,348 @@
> >>     +;RUN: llc < %s -march=amdgcn -mcpu=verde -verify-machineinstrs |
> >>     FileCheck %s --check-prefix=CHECK --check-prefix=SI
> >>     +;RUN: llc < %s -march=amdgcn -mcpu=tonga -verify-machineinstrs |
> >>     FileCheck %s --check-prefix=CHECK --check-prefix=VI
> >>     +
> >>     +; Check that WQM isn't triggered by image load/store intrinsics.
> >>     +;
> >>     +;CHECK-LABEL: {{^}}test1:
> >>     +;CHECK-NOT: s_wqm
> >>     +define <4 x float> @test1(<8 x i32> inreg %rsrc, <4 x i32> %c) #0 {
> >>     +main_body:
> >>     +  %tex = call <4 x float> @llvm.amdgcn.image.load.v4i32(<4 x i32>
> >>     %c, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 0, i1 0)
> >>     +  call void @llvm.amdgcn.image.store.v4i32(<4 x float> %tex, <4 x
> >>     i32> %c, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 0, i1 0)
> >>     +  ret <4 x float> %tex
> >>     +}
> >>     +
> >>     +; Check that WQM is triggered by image samples and left untouched
> >>     for loads...
> >>     +;
> >>     +;CHECK-LABEL: {{^}}test2:
> >>     +;CHECK-NEXT: ; %main_body
> >>     +;CHECK-NEXT: s_wqm_b64 exec, exec
> >>     +;CHECK: image_sample
> >>     +;CHECK-NOT: exec
> >>     +;CHECK: _load_dword v0,
> >>     +define float @test2(<8 x i32> inreg %rsrc, <4 x i32> inreg
> >>     %sampler, float addrspace(1)* inreg %ptr, <4 x i32> %c) #0 {
> >>     +main_body:
> >>     +  %c.1 = call <4 x float> @llvm.SI.image.sample.v4i32(<4 x i32> %c,
> >>     <8 x i32> %rsrc, <4 x i32> %sampler, i32 15, i32 0, i32 0, i32 0,
> >>     i32 0, i32 0, i32 0, i32 0)
> >>     +  %c.2 = bitcast <4 x float> %c.1 to <4 x i32>
> >>     +  %c.3 = extractelement <4 x i32> %c.2, i32 0
> >>     +  %gep = getelementptr float, float addrspace(1)* %ptr, i32 %c.3
> >>     +  %data = load float, float addrspace(1)* %gep
> >>     +  ret float %data
> >>     +}
> >>     +
> >>     +; ... but disabled for stores (and, in this simple case, not
> >>     re-enabled).
> >>     +;
> >>     +;CHECK-LABEL: {{^}}test3:
> >>     +;CHECK-NEXT: ; %main_body
> >>     +;CHECK-NEXT: s_mov_b64 [[ORIG:s\[[0-9]+:[0-9]+\]]], exec
> >>     +;CHECK-NEXT: s_wqm_b64 exec, exec
> >>     +;CHECK: image_sample
> >>     +;CHECK: s_and_b64 exec, exec, [[ORIG]]
> >>     +;CHECK: store
> >>     +;CHECK-NOT: exec
> >>     +define <4 x float> @test3(<8 x i32> inreg %rsrc, <4 x i32> inreg
> >>     %sampler, float addrspace(1)* inreg %ptr, <4 x i32> %c) #0 {
> >>     +main_body:
> >>     +  %tex = call <4 x float> @llvm.SI.image.sample.v4i32(<4 x i32> %c,
> >>     <8 x i32> %rsrc, <4 x i32> %sampler, i32 15, i32 0, i32 0, i32 0,
> >>     i32 0, i32 0, i32 0, i32 0)
> >>     +  %tex.1 = bitcast <4 x float> %tex to <4 x i32>
> >>     +  %tex.2 = extractelement <4 x i32> %tex.1, i32 0
> >>     +  %gep = getelementptr float, float addrspace(1)* %ptr, i32 %tex.2
> >>     +  %wr = extractelement <4 x float> %tex, i32 1
> >>     +  store float %wr, float addrspace(1)* %gep
> >>     +  ret <4 x float> %tex
> >>     +}
> >>     +
> >>     +; Check that WQM is re-enabled when required.
> >>     +;
> >>     +;CHECK-LABEL: {{^}}test4:
> >>     +;CHECK-NEXT: ; %main_body
> >>     +;CHECK-NEXT: s_mov_b64 [[ORIG:s\[[0-9]+:[0-9]+\]]], exec
> >>     +;CHECK-NEXT: s_wqm_b64 exec, exec
> >>     +;CHECK: v_mul_lo_i32 [[MUL:v[0-9]+]], v0, v1
> >>     +;CHECK: s_and_b64 exec, exec, [[ORIG]]
> >>     +;CHECK: store
> >>     +;CHECK: s_wqm_b64 exec, exec
> >>     +;CHECK: image_sample v[0:3], [[MUL]], s[0:7], s[8:11] dmask:0xf
> >>     +define <4 x float> @test4(<8 x i32> inreg %rsrc, <4 x i32> inreg
> >>     %sampler, float addrspace(1)* inreg %ptr, i32 %c, i32 %d, float
> >>     %data) #0 {
> >>     +main_body:
> >>     +  %c.1 = mul i32 %c, %d
> >>     +  %gep = getelementptr float, float addrspace(1)* %ptr, i32 %c.1
> >>     +  store float %data, float addrspace(1)* %gep
> >>     +  %tex = call <4 x float> @llvm.SI.image.sample.i32(i32 %c.1, <8 x
> >>     i32> %rsrc, <4 x i32> %sampler, i32 15, i32 0, i32 0, i32 0, i32 0,
> >>     i32 0, i32 0, i32 0)
> >>     +  ret <4 x float> %tex
> >>     +}
> >>     +
> >>     +; Check a case of one branch of an if-else requiring WQM, the other
> >>     requiring
> >>     +; exact.
> >>     +;
> >>     +; Note: In this particular case, the save-and-restore could be
> >>     avoided if the
> >>     +; analysis understood that the two branches of the if-else are
> >> mutually
> >>     +; exclusive.
> >>     +;
> >>     +;CHECK-LABEL: {{^}}test_control_flow_0:
> >>     +;CHECK-NEXT: ; %main_body
> >>     +;CHECK-NEXT: s_mov_b64 [[ORIG:s\[[0-9]+:[0-9]+\]]], exec
> >>     +;CHECK-NEXT: s_wqm_b64 exec, exec
> >>     +;CHECK: %ELSE
> >>     +;CHECK: s_and_saveexec_b64 [[SAVED:s\[[0-9]+:[0-9]+\]]], [[ORIG]]
> >>     +;CHECK: store
> >>     +;CHECK: s_mov_b64 exec, [[SAVED]]
> >>     +;CHECK: %IF
> >>     +;CHECK: image_sample
> >>     +define float @test_control_flow_0(<8 x i32> inreg %rsrc, <4 x i32>
> >>     inreg %sampler, float addrspace(1)* inreg %ptr, i32 %c, i32 %z,
> >>     float %data) #0 {
> >>     +main_body:
> >>     +  %cmp = icmp eq i32 %z, 0
> >>     +  br i1 %cmp, label %IF, label %ELSE
> >>     +
> >>     +IF:
> >>     +  %tex = call <4 x float> @llvm.SI.image.sample.i32(i32 %c, <8 x
> >>     i32> %rsrc, <4 x i32> %sampler, i32 15, i32 0, i32 0, i32 0, i32 0,
> >>     i32 0, i32 0, i32 0)
> >>     +  %data.if = extractelement <4 x float> %tex, i32 0
> >>     +  br label %END
> >>     +
> >>     +ELSE:
> >>     +  %gep = getelementptr float, float addrspace(1)* %ptr, i32 %c
> >>     +  store float %data, float addrspace(1)* %gep
> >>     +  br label %END
> >>     +
> >>     +END:
> >>     +  %r = phi float [ %data.if, %IF ], [ %data, %ELSE ]
> >>     +  ret float %r
> >>     +}
> >>     +
> >>     +; Reverse branch order compared to the previous test.
> >>     +;
> >>     +;CHECK-LABEL: {{^}}test_control_flow_1:
> >>     +;CHECK-NEXT: ; %main_body
> >>     +;CHECK-NEXT: s_mov_b64 [[ORIG:s\[[0-9]+:[0-9]+\]]], exec
> >>     +;CHECK-NEXT: s_wqm_b64 exec, exec
> >>     +;CHECK: %IF
> >>     +;CHECK: image_sample
> >>     +;CHECK: %Flow
> >>     +;CHECK-NEXT: s_or_saveexec_b64 [[SAVED:s\[[0-9]+:[0-9]+\]]],
> >>     +;CHECK-NEXT: s_and_b64 exec, exec, [[ORIG]]
> >>     +;CHECK-NEXT: s_and_b64 [[SAVED]], exec, [[SAVED]]
> >>     +;CHECK-NEXT: s_xor_b64 exec, exec, [[SAVED]]
> >>     +;CHECK-NEXT: %ELSE
> >>     +;CHECK: store
> >>     +;CHECK: %END
> >>     +define float @test_control_flow_1(<8 x i32> inreg %rsrc, <4 x i32>
> >>     inreg %sampler, float addrspace(1)* inreg %ptr, i32 %c, i32 %z,
> >>     float %data) #0 {
> >>     +main_body:
> >>     +  %cmp = icmp eq i32 %z, 0
> >>     +  br i1 %cmp, label %ELSE, label %IF
> >>     +
> >>     +IF:
> >>     +  %tex = call <4 x float> @llvm.SI.image.sample.i32(i32 %c, <8 x
> >>     i32> %rsrc, <4 x i32> %sampler, i32 15, i32 0, i32 0, i32 0, i32 0,
> >>     i32 0, i32 0, i32 0)
> >>     +  %data.if = extractelement <4 x float> %tex, i32 0
> >>     +  br label %END
> >>     +
> >>     +ELSE:
> >>     +  %gep = getelementptr float, float addrspace(1)* %ptr, i32 %c
> >>     +  store float %data, float addrspace(1)* %gep
> >>     +  br label %END
> >>     +
> >>     +END:
> >>     +  %r = phi float [ %data.if, %IF ], [ %data, %ELSE ]
> >>     +  ret float %r
> >>     +}
> >>     +
> >>     +; Check that branch conditions are properly marked as needing WQM...
> >>     +;
> >>     +;CHECK-LABEL: {{^}}test_control_flow_2:
> >>     +;CHECK-NEXT: ; %main_body
> >>     +;CHECK-NEXT: s_mov_b64 [[ORIG:s\[[0-9]+:[0-9]+\]]], exec
> >>     +;CHECK-NEXT: s_wqm_b64 exec, exec
> >>     +;CHECK: s_and_b64 exec, exec, [[ORIG]]
> >>     +;CHECK: store
> >>     +;CHECK: s_wqm_b64 exec, exec
> >>     +;CHECK: load
> >>     +;CHECK: s_and_b64 exec, exec, [[ORIG]]
> >>     +;CHECK: store
> >>     +;CHECK: s_wqm_b64 exec, exec
> >>     +;CHECK: v_cmp
> >>     +define <4 x float> @test_control_flow_2(<8 x i32> inreg %rsrc, <4 x
> >>     i32> inreg %sampler, float addrspace(1)* inreg %ptr, <3 x i32> %idx,
> >>     <2 x float> %data, i32 %coord) #0 {
> >>     +main_body:
> >>     +  %idx.1 = extractelement <3 x i32> %idx, i32 0
> >>     +  %gep.1 = getelementptr float, float addrspace(1)* %ptr, i32 %idx.1
> >>     +  %data.1 = extractelement <2 x float> %data, i32 0
> >>     +  store float %data.1, float addrspace(1)* %gep.1
> >>     +
> >>     +  ; The load that determines the branch (and should therefore be
> >>     WQM) is
> >>     +  ; surrounded by stores that require disabled WQM.
> >>     +  %idx.2 = extractelement <3 x i32> %idx, i32 1
> >>     +  %gep.2 = getelementptr float, float addrspace(1)* %ptr, i32 %idx.2
> >>     +  %z = load float, float addrspace(1)* %gep.2
> >>     +
> >>     +  %idx.3 = extractelement <3 x i32> %idx, i32 2
> >>     +  %gep.3 = getelementptr float, float addrspace(1)* %ptr, i32 %idx.3
> >>     +  %data.3 = extractelement <2 x float> %data, i32 1
> >>     +  store float %data.3, float addrspace(1)* %gep.3
> >>     +
> >>     +  %cc = fcmp ogt float %z, 0.0
> >>     +  br i1 %cc, label %IF, label %ELSE
> >>     +
> >>     +IF:
> >>     +  %coord.IF = mul i32 %coord, 3
> >>     +  br label %END
> >>     +
> >>     +ELSE:
> >>     +  %coord.ELSE = mul i32 %coord, 4
> >>     +  br label %END
> >>     +
> >>     +END:
> >>     +  %coord.END = phi i32 [ %coord.IF, %IF ], [ %coord.ELSE, %ELSE ]
> >>     +  %tex = call <4 x float> @llvm.SI.image.sample.i32(i32 %coord.END,
> >>     <8 x i32> %rsrc, <4 x i32> %sampler, i32 15, i32 0, i32 0, i32 0,
> >>     i32 0, i32 0, i32 0, i32 0)
> >>     +  ret <4 x float> %tex
> >>     +}
> >>     +
> >>     +; ... but only if they really do need it.
> >>     +;
> >>     +;CHECK-LABEL: {{^}}test_control_flow_3:
> >>     +;CHECK-NEXT: ; %main_body
> >>     +;CHECK-NEXT: s_mov_b64 [[ORIG:s\[[0-9]+:[0-9]+\]]], exec
> >>     +;CHECK-NEXT: s_wqm_b64 exec, exec
> >>     +;CHECK: image_sample
> >>     +;CHECK: s_and_b64 exec, exec, [[ORIG]]
> >>     +;CHECK: store
> >>     +;CHECK: load
> >>     +;CHECK: store
> >>     +;CHECK: v_cmp
> >>     +define float @test_control_flow_3(<8 x i32> inreg %rsrc, <4 x i32>
> >>     inreg %sampler, float addrspace(1)* inreg %ptr, <3 x i32> %idx, <2 x
> >>     float> %data, i32 %coord) #0 {
> >>     +main_body:
> >>     +  %tex = call <4 x float> @llvm.SI.image.sample.i32(i32 %coord, <8
> >>     x i32> %rsrc, <4 x i32> %sampler, i32 15, i32 0, i32 0, i32 0, i32
> >>     0, i32 0, i32 0, i32 0)
> >>     +  %tex.1 = extractelement <4 x float> %tex, i32 0
> >>     +
> >>     +  %idx.1 = extractelement <3 x i32> %idx, i32 0
> >>     +  %gep.1 = getelementptr float, float addrspace(1)* %ptr, i32 %idx.1
> >>     +  %data.1 = extractelement <2 x float> %data, i32 0
> >>     +  store float %data.1, float addrspace(1)* %gep.1
> >>     +
> >>     +  %idx.2 = extractelement <3 x i32> %idx, i32 1
> >>     +  %gep.2 = getelementptr float, float addrspace(1)* %ptr, i32 %idx.2
> >>     +  %z = load float, float addrspace(1)* %gep.2
> >>     +
> >>     +  %idx.3 = extractelement <3 x i32> %idx, i32 2
> >>     +  %gep.3 = getelementptr float, float addrspace(1)* %ptr, i32 %idx.3
> >>     +  %data.3 = extractelement <2 x float> %data, i32 1
> >>     +  store float %data.3, float addrspace(1)* %gep.3
> >>     +
> >>     +  %cc = fcmp ogt float %z, 0.0
> >>     +  br i1 %cc, label %IF, label %ELSE
> >>     +
> >>     +IF:
> >>     +  %tex.IF = fmul float %tex.1, 3.0
> >>     +  br label %END
> >>     +
> >>     +ELSE:
> >>     +  %tex.ELSE = fmul float %tex.1, 4.0
> >>     +  br label %END
> >>     +
> >>     +END:
> >>     +  %tex.END = phi float [ %tex.IF, %IF ], [ %tex.ELSE, %ELSE ]
> >>     +  ret float %tex.END
> >>     +}
> >>     +
> >>     +; Another test that failed at some point because of terminator
> >>     handling.
> >>     +;
> >>     +;CHECK-LABEL: {{^}}test_control_flow_4:
> >>     +;CHECK-NEXT: ; %main_body
> >>     +;CHECK-NEXT: s_mov_b64 [[ORIG:s\[[0-9]+:[0-9]+\]]], exec
> >>     +;CHECK-NEXT: s_wqm_b64 exec, exec
> >>     +;CHECK: %IF
> >>     +;CHECK: load
> >>     +;CHECK: s_and_saveexec_b64 [[SAVE:s\[[0-9]+:[0-9]+\]]],  [[ORIG]]
> >>     +;CHECK: store
> >>     +;CHECK: s_mov_b64 exec, [[SAVE]]
> >>     +;CHECK: %END
> >>     +;CHECK: image_sample
> >>     +define <4 x float> @test_control_flow_4(<8 x i32> inreg %rsrc, <4 x
> >>     i32> inreg %sampler, float addrspace(1)* inreg %ptr, i32 %coord, i32
> >>     %y, float %z) #0 {
> >>     +main_body:
> >>     +  %cond = icmp eq i32 %y, 0
> >>     +  br i1 %cond, label %IF, label %END
> >>     +
> >>     +IF:
> >>     +  %data = load float, float addrspace(1)* %ptr
> >>     +  %gep = getelementptr float, float addrspace(1)* %ptr, i32 1
> >>     +  store float %data, float addrspace(1)* %gep
> >>     +  br label %END
> >>     +
> >>     +END:
> >>     +  %tex = call <4 x float> @llvm.SI.image.sample.i32(i32 %coord, <8
> >>     x i32> %rsrc, <4 x i32> %sampler, i32 15, i32 0, i32 0, i32 0, i32
> >>     0, i32 0, i32 0, i32 0)
> >>     +  ret <4 x float> %tex
> >>     +}
> >>     +
> >>     +; Kill is performed in WQM mode so that uniform kill behaves
> >>     correctly ...
> >>     +;
> >>     +;CHECK-LABEL: {{^}}test_kill_0:
> >>     +;CHECK-NEXT: ; %main_body
> >>     +;CHECK-NEXT: s_mov_b64 [[ORIG:s\[[0-9]+:[0-9]+\]]], exec
> >>     +;CHECK-NEXT: s_wqm_b64 exec, exec
> >>     +;CHECK: image_sample
> >>     +;CHECK: s_and_b64 exec, exec, [[ORIG]]
> >>     +;SI: buffer_store_dword
> >>     +;VI: flat_store_dword
> >>     +;CHECK: s_wqm_b64 exec, exec
> >>     +;CHECK: v_cmpx_
> >>     +;CHECK: s_and_saveexec_b64 [[SAVE:s\[[0-9]+:[0-9]+\]]], [[ORIG]]
> >>     +;SI: buffer_store_dword
> >>     +;VI: flat_store_dword
> >>     +;CHECK: s_mov_b64 exec, [[SAVE]]
> >>     +;CHECK: image_sample
> >>     +define <4 x float> @test_kill_0(<8 x i32> inreg %rsrc, <4 x i32>
> >>     inreg %sampler, float addrspace(1)* inreg %ptr, <2 x i32> %idx, <2 x
> >>     float> %data, i32 %coord, i32 %coord2, float %z) #0 {
> >>     +main_body:
> >>     +  %tex = call <4 x float> @llvm.SI.image.sample.i32(i32 %coord, <8
> >>     x i32> %rsrc, <4 x i32> %sampler, i32 15, i32 0, i32 0, i32 0, i32
> >>     0, i32 0, i32 0, i32 0)
> >>     +
> >>     +  %idx.0 = extractelement <2 x i32> %idx, i32 0
> >>     +  %gep.0 = getelementptr float, float addrspace(1)* %ptr, i32 %idx.0
> >>     +  %data.0 = extractelement <2 x float> %data, i32 0
> >>     +  store float %data.0, float addrspace(1)* %gep.0
> >>     +
> >>     +  call void @llvm.AMDGPU.kill(float %z)
> >>     +
> >>     +  %idx.1 = extractelement <2 x i32> %idx, i32 1
> >>     +  %gep.1 = getelementptr float, float addrspace(1)* %ptr, i32 %idx.1
> >>     +  %data.1 = extractelement <2 x float> %data, i32 1
> >>     +  store float %data.1, float addrspace(1)* %gep.1
> >>     +
> >>     +  %tex2 = call <4 x float> @llvm.SI.image.sample.i32(i32 %coord2,
> >>     <8 x i32> %rsrc, <4 x i32> %sampler, i32 15, i32 0, i32 0, i32 0,
> >>     i32 0, i32 0, i32 0, i32 0)
> >>     +  %out = fadd <4 x float> %tex, %tex2
> >>     +
> >>     +  ret <4 x float> %out
> >>     +}
> >>     +
> >>     +; ... but only if WQM is necessary.
> >>     +;
> >>     +;CHECK-LABEL: {{^}}test_kill_1:
> >>     +;CHECK-NEXT: ; %main_body
> >>     +;CHECK-NEXT: s_mov_b64 [[ORIG:s\[[0-9]+:[0-9]+\]]], exec
> >>     +;CHECK-NEXT: s_wqm_b64 exec, exec
> >>     +;CHECK: image_sample
> >>     +;CHECK: s_and_b64 exec, exec, [[ORIG]]
> >>     +;SI: buffer_store_dword
> >>     +;VI: flat_store_dword
> >>     +;CHECK-NOT: wqm
> >>     +;CHECK: v_cmpx_
> >>     +define <4 x float> @test_kill_1(<8 x i32> inreg %rsrc, <4 x i32>
> >>     inreg %sampler, float addrspace(1)* inreg %ptr, i32 %idx, float
> >>     %data, i32 %coord, i32 %coord2, float %z) #0 {
> >>     +main_body:
> >>     +  %tex = call <4 x float> @llvm.SI.image.sample.i32(i32 %coord, <8
> >>     x i32> %rsrc, <4 x i32> %sampler, i32 15, i32 0, i32 0, i32 0, i32
> >>     0, i32 0, i32 0, i32 0)
> >>     +
> >>     +  %gep = getelementptr float, float addrspace(1)* %ptr, i32 %idx
> >>     +  store float %data, float addrspace(1)* %gep
> >>     +
> >>     +  call void @llvm.AMDGPU.kill(float %z)
> >>     +
> >>     +  ret <4 x float> %tex
> >>     +}
> >>     +
> >>     +declare void @llvm.amdgcn.image.store.v4i32(<4 x float>, <4 x i32>,
> >>     <8 x i32>, i32, i1, i1, i1, i1) #1
> >>     +
> >>     +declare <4 x float> @llvm.amdgcn.image.load.v4i32(<4 x i32>, <8 x
> >>     i32>, i32, i1, i1, i1, i1) #2
> >>     +
> >>     +declare <4 x float> @llvm.SI.image.sample.i32(i32, <8 x i32>, <4 x
> >>     i32>, i32, i32, i32, i32, i32, i32, i32, i32) #3
> >>     +declare <4 x float> @llvm.SI.image.sample.v4i32(<4 x i32>, <8 x
> >>     i32>, <4 x i32>, i32, i32, i32, i32, i32, i32, i32, i32) #3
> >>     +
> >>     +declare void @llvm.AMDGPU.kill(float)
> >>     +declare void @llvm.SI.export(i32, i32, i32, i32, i32, float, float,
> >>     float, float)
> >>     +
> >>     +attributes #0 = { "ShaderType"="0" }
> >>     +attributes #1 = { nounwind }
> >>     +attributes #2 = { nounwind readonly }
> >>     +attributes #3 = { nounwind readnone }
> >>
> >>
> >>     _______________________________________________
> >>     llvm-commits mailing list
> >>     llvm-commits at lists.llvm.org <mailto:llvm-commits at lists.llvm.org>
> >>     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
> >>