<table border="1" cellspacing="0" cellpadding="8">

    <tr>

        <th>Issue</th>

        <td>

            <a href=https://github.com/llvm/llvm-project/issues/54256>54256</a>

        </td>

    </tr>

    <tr>

        <th>Summary</th>

        <td>

            MemorySSA in SLP Vectorizer

        </td>

    </tr>

    <tr>

      <th>Labels</th>

      <td>

      </td>

    </tr>

    <tr>

      <th>Assignees</th>

      <td>

            preames

      </td>

    </tr>

    <tr>

      <th>Reporter</th>

      <td>

          preames

      </td>

    </tr>

</table>

<pre>

    This issue exists to motivate and discuss the use of memory SSA inside SLP.  I have an initial patch up for review (https://reviews.llvm.org/D117926), but there were motivation questions raised there.  I'm filing this for the purpose of providing the motivation and discussion of relevant issues for follow through.

I have a motivating example which spends a large amount of time computing memory dependencies in SLP.  Unfortunately, I can not share the example, and my attempts to produce a synthetic reproducer have not fared well.  The key observations from profiling the example are:

* We make a huge number of calls to calculateDependencies.  Here's the output of some additional statics I added downstream.

```

704357 SLP                          - Number of calcDeps actions

         699021 SLP                          - Number of schedule calls

           5598 SLP                          - Number of ReSchedule actions

             59 SLP                          - Number of ReScheduleOnFail actions

          10084 SLP                          - Number of schedule resets

           8523 SLP                          - Number of vector instructions generated

```

* Looking at the profile (via perf), we are bottlenecked by a combination of accessing the AliasCache (the hash for that map is *red* hot), and performing alias queries.

* This makes sense as SLP recomputes all aliasing dependencies for scheduling purposes every time the "scheduling window" changes.  The "scheduling window" changes many, many times on this example due to a combination of a) many vectorizeable trees, and b) trees which straddle long chunks of a basic block.

Taking a close look at the memory dependence handling, I observe that this is largely duplicating the computation of memoryssa through the optimized walker, but invalidated on every change to the scheduling window.

I believe the right thing to do here is to use memory SSA here.  This is a non-trivial task, both because I'm personally fairly new to mssa, and because I keep tripping across additional bits of unexpected complexity (e.g. bugs in the SLP code or bugs in my mental model of memoryssa)

The reasonable objection is that constructing memoryssa for an entire function eagerly when we might not be able to vectorize at all is undesirable.  I agree, but feel the relevant question is at what point does the inherent inefficiency of the current code outweigh the concern about the cost of computing memory SSA.  On this example, the cost of computing memory ssa is approximately 15 ms - as opposed to several seconds on the current code.  I don't yet have a full implementation, so I can not yet do a broader evaluation.  

Now, let me switch focus and argue for incrementalism in the implementation here.  As noted on the original review, the code posted is *slow*.  Very, very slow.  If enabled, simply preserving MSSA takes roughly 10x longer than the old code.

Despite this, I very much want to land this patch, and then work incrementally.  Why?  Because I've investigated the causes of the slowness, and they're sufficiently complex that I really want a) a working baseline, and b) targeted review for each change.

Here's an overview of my findings so far:

* For the insertion of new loads and stores, the renaming bit is the expensive bit.  However, this is not generic insertion. This is specifically insertion of a new instruction which is known to be clobbered by the set of instructions we're vectorizing (and thus about to remove.)  We can exploit that, but doing so locally revealed a silent bug in SLP where the new instruction is not in fact inserted after the current bundle (despite what all the comments say.)

* For the actual scheduling, we need to update the MSSA graph to account for the moving of instructions.  At the moment, this is the hugely slow part of the patch.  However, this update is actually quite limited.  Assuming a correct schedule, this movement can not change the order of any two clobbering defs and reads.  As such, this is not a general move, but instead a highly restricted (and thus optimizeable) one.  It is however, also a very delicate bit of code, which needs careful consideration, tests, and targeted review.  

Now, let me take a moment and address an alternative.  Why can't we simply restructure the existing code to preserve the memory dependencies across schedule window changes?

The short answer is: we can, at the cost of a bunch of complexity.  

The first thing to note is that preserving the memory dependencies requires the *exact same* updates in the exact same places as memoryssa.  This is fundamental.  We might be able to get away with simpler updates (since the data structure is slightly less complicated), but that's a strong "maybe".  

There's two approaches here:

* First, we could construct the memory dependencies for the block up front, and then preserve them through transformation.  This is the simplest of the two options, but has the effect of largely duplicating the functionality MemorySSA provides just specialized for working a block at a time.  

* Second, we could try to preserve the schedule window isolation (e.g. only compute dependencies for things in the schedule window).  This adds a *ton* of complexity.  As in, I haven't yet figured out how to implement it.  I view this option as needing a lot of motivation to justify the complexity, and at the moment, I don't see that motivation at all.  

Personally, I think going with Memory SSA is clearly the best answer.  We might find out it doesn't work for some reason, but we always have that possibility.  If so, we revert a couple changes, and try something else.  I don't see that risk as being a reason not to pursue Memory SSA for SLP in tree.  

</pre>

<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJydWV2T2zYS_DW7LyirtFzv18M-bOxynaviJHX2Jc8gCYqIQIIhQMm6X3_dA1CitHbOFZdrPygQmOmZ6enBlr4-PH9pbVA2hMko89WGGFT0qvPR7nQ0Sve1qm2opoDnrVFTMMo3qjOdHw_q8-cXZftga6M-__zbSqmPqtU7voXHNlrt1KBj1appUI0f1Wh21uzVVfHYxjiEq9uXq-ID_qfnYeXcrlv5cYNH729uHp6K-6vi6ap4p8op8vjRqD2_ZPOs79Vfkwn8IahR22DqtIymXBUPnWqss_0GD-EkLaAPwzQOPvkxjH5n67TibNuF3_wVS0fjzE73MWGVdmu8c36Pd0c_bdrV1fr91folfZ2ROG6KM8xX3Q0OPrQWmITB9HXACqfHDVZ2fsLuOCnazqjKd8Mkb2Wsa8P1pq8sDrd9Bvw_PeyIU49YuQOR-qgqoN_7qEKrARX9yufyY_rVHZSO0XRDijUwqKeKpoZDj-XRVnA2Px2TG9yvwXY18HcO537BtltzUL4MZkyYAZLRd9zuCPrxaIV3Ge0EUPGi_gDaestD2wnO91NX4iw4X2nnxCz8UE0Ofr1fOI6T_4XwIrQpHf0UgRLfCx6g6bq2NAV5FyKMqgLwwEPYXft9H-JodDeH6X6d_8uvD-u3t3cPRFV9998b9cvSzgqWIX6VOJ92OS69f3paFzc_vl2oWlNPAEr8v9hMqbu7p8cf3-zf5vO83betky2f_smGv_YftHXf2_ZmvX58-w-cHk0w8bWRj3fF7Y_vtjNVRE1ahnlK9qmN6c2IHKq_GXMm4s_eb5msOiZukOw1pKid1WowY5MpaC9JrEofo8Ou1RZJVaKSWKml7RNtwAxdVQakkfP_xVkd3mk4yi35pNWhzVSEMzs9gFDw2QuKi_a0PuYDWao834-dGMidSHcj6-DkgPA3aymoYHrQGlYRs9EkCsFzpFR6nfuc8QjtyFHgZ5kZgzI7A8oRIopierFYtbc9qgnPVNXqfiNV-eX_r4KRvVAUv8veQQEyoeaZJmp0IdT-a1CBSXovRdn-1-gS61HQJsxglVwkT2aGjSOKH8uch0FVO_XbILupElhUqnS-2p6x9hedkkFVjh3CITnmzLikYUayr-lqot3EhCaFNaammqjd4a1pcLZKXYCbpdAc3Ut7h6DnVpLIbQBI8BSkq93WjHMjtP0O0ayZ1gQwxSqBTPD46qs4XPSm0jhrdim4o920YjFt8yBKJY3WCguz3y-afe6tWTMAp973b-Jod2z1UYet2OhjixMqzZdTG0YeB9IyoGi0HfGthw6g0IDTx_jNr6CzmAGRtMMg0ahGD_mxIPfSRgnk1JuvAxICQBBRBwETDyw0s9qsgNVGOiWdZEVUHjoFCT8_Rx_sTB-xX4dP3FkgWIPLvCBORtMHpp0v_zTCMIISA175mXaOHZvhZH2hHeMUC0ibqU9vGb0xBGHfmp7E0kkI2GRL1K8ktj9lOjOQJYyzJqResCOXiNrSG6T7nBeNgRMS0VmqzOJIYhVxHL4M3uKT2pvUQW3PmFLW9KZpLFihrw6iQpim0yifJeSmuDc25yb8hTiATip9kmZ4EqQRvxIuyBvY-ut5pdPmv32L6NHqAYz81Xaib9TNHRIGpA-K8wOZqiZQgRXAlg_Go6Ly_SvjBaza90jGqA4mzuKsmYgrDZJMIFa0LPiFjuLympxUjl7X6DXA1k2yFtsus-QX8t075fACiDPsLaVv4yEiJb_BBaC3RnpUNaYTnQ3dnKPndsy19hJoRSp1YQXUq2URJNF8AhIhAiRcmRpKcDSH2P8OguA6IQo-JhwNspJ5VIvDPPuA7mdIYgzDJ5Z7lLYijET011-FSY30rmyNqxPASxzemzDYaCTgiRzl5G4CHHvmJYLmiIhkhEwIMwdEqQg_bpcYuQMM_qM9XN1-UOqnBbHsmL875vhG2FCA4KdhzmB626MfL_YHFA8oxjDldI9wLbNHquWPLHVSldgqrUeLTcQFnQPc2ZvzpkOapwF5wGGMDZp-ZuUzcI76FQj6HdHGC6QeUCOoGmcE5h_U9plg_pCHF8gbM859gyTqkJQpvwLoInXDRAK9FuEAskw0RTWO5hUsYMNDamm_Z-2kVxKlM-NFNaE_Hs9aHRkfQ0tlgZvgc2aLFmsW6is3Yby07REEBh3shsZaQrAl8SQRMkIAZ7Jtb1KMZgqkG2D1FEBWU2IdDyc7YLhiDDhSsGTho_M2SihnZqw9dwCqaPhiOeKEEMMIDD2QewgzmkKeqsjKeXC69Cjjg3UNFHD2n7s00YxnpFOCqZOMrHMxCPmSxnP_Z2oDT31YnXrNIs7YfyKnHRt51qC9SaQ3DRQAslRKdTPqoRXtVFUySM7TLgCi8xcQk1iyrvG0ZJkColIn0S0sHxToGOd6kmL9RuZkc2zIhuPdvya67SBhAJEQWZi6rK48UAKA8wBw3IbR7IS0M_nOskaIr05CX-Tj3s-plERtk2oAlVuHxJphSrSyzGydJwInJ50EFWhTMxlaK0yHMoL6EF2xTLtZkJE2mXK-l8Yi5dWeANEusF0I50FXUPdJwaU-V8uxqTYYzABfR4NGJBrC1pxXchuCdo8n5jrnmL9tPTGN1im0qffUNZwS0tEOyUppvTOJVgm2NEakV24EAgBSZTreINgg3VnajFwamFnuvpbGnCyyaDuOeEmGzrMAmPxSXoXWj7Q17BFly7sh2iOmvZs1-KwWNOsL-GXhkHTfBSLcs7FjWEhb9tGjYlv0uu_5MBrk8JiFEooT2oVJqzvDSk0pf1SYpw_V4HRFCMJJCS5EM1RgrVNjWwlrJfW3UH6Is9J7jQZkoaQlJMBkPg8ZiUGuStDjkVanWJGhHbdDDB3jLfBIBtZnV2nkRvYgvuuFXYtOH0o4VrzGcb5wQc2JHuM4G0SfnLcoop1pChwkyiBTznchnllKRjG5JoQ58UwNLFOtOw1II1KFs_EsxL4s2CtBFo6sRctZvLyyyBBgDE9haxoyERZ-b1abVTu0GoaLT-IEGTfdHcKHPyecJI0RSziu0alZL-jsGslfht4TvIDss0jWM8wiB--LArusIhu8SyJxnnV8n1UMBv5vISyyImfqxW7Iixk-sARzApZFMtDLqwJ74SZJ0lFBn_R0YzcTmzr7cutltDvKWSVaAyKQSkfYOMWCBUIGTDA5L1FY3MFiD0Jrm8NxZk6GzOmhLzvYSeMHkyfx5Z2u9N-L9P7tOJmmDQjVVm18Gp1Rfp8WV92oJ2c0JzfJWWZYIqxlIVPFCQ42jVmZW6lp5bqFF5VplpxzkVdLDvUe0liS6AnsaUvrEu4fecGZ04TahUzJfOGdyUypc81Q4-OMxHrGhfPh5wjMaMOWEShNwj-ZJF2S6TeN_IvAwnnaTnXEJMLQeYTxun6-rZ9un_R1tNGZ51OBZDn1-zzHjtfT6J7Pb_43gHgqV4gtfuH1f_72BtXFGRu_pvt2_HD3tri7v26fm_Xbyuj75h7peqfvH28fCl2uC3PbmLfmob67drqE389Xdz9d3b2_ts_FuijWt-uHm8fisXhYlY-Pj03ZVA-FKaqb9f3V27XptHXHPz9cj89iAy8K8KHjX0VOH2oEZoO8lf2LYuCdMu0reJieIvrYc354LbY_i-H_A8dzwVA">