[PATCH] D100993: [CSSPGO] Fix incorrect prorating indirect call distribution factor that leads to target count loss.

Thu Apr 22 10:14:00 PDT 2021

wenlei added inline comments.

================
Comment at: llvm/lib/Transforms/IPO/SampleProfile.cpp:835
+  // excluding promoted targets.
+  if (FullSum)
+    if (Optional<PseudoProbe> Probe = extractProbe(Inst))
----------------
hoy wrote:
> hoy wrote:
> > wenlei wrote:
> > > hoy wrote:
> > > > wenlei wrote:
> > > > > hoy wrote:
> > > > > > wenlei wrote:
> > > > > > > This makes sure that the count for the indirect call instruction accurately reflects the counts for remaining indirect call. However I thought later PGO ICP pass relies on the value profiles we put onto metadata, so even if the count for that icall instruction is a bit off, that ICP should not be affected. Is that correct?
> > > > > > > 
> > > > > > > Is the problem here more about pre-link sample loader ICP screw up distribution factor so post-link sample loader ICP sees a smaller count for the icall instruction which blocks the post-link sample loader ICP? In this case, if we look at call site target counts from profile, with promoted targets excluded, we would arrive at the correct sum even without distribution factor? Relying on distribution could get us the same result, but using prorating to mimic subtraction is a bit weird.. 
> > > > > > > 
> > > > > > > I can see how accurate distribution factor helps count quality in general though - it would make counts for icall more consistent before running inference. On that, how do we make sure promoted calls get correct counts for inference input though?
> > > > > > Yes, this is supposed to make sure that the indirect call and its enclosing block will have a correct count during sample annotation, but looks like doing it here is late. I just realized this is after `computeAndPropagateWeights`. Ideally, the distribution factor should be updated before `computeAndPropagateWeights` and the call target samples should be adjusted with the pre-updated factor. That means `updateIDTMetaData` should be done before `computeAndPropagateWeights`.
> > > > > > 
> > > > > > The prelink problem doesn't need this to be fixed. We have the pseudo probe update pass that updates distribution factor based on BFI to handle that. 
> > > > > > Ideally, the distribution factor should be updated before computeAndPropagateWeights and the call target samples should be adjusted with the pre-updated factor. 
> > > > > 
> > > > > Agreed, that ensures good input for inference. 
> > > > > 
> > > > > > That means updateIDTMetaData should be done before computeAndPropagateWeights. 
> > > > > 
> > > > > I think that update is meant for value profile metadata, which is orthogonal to computeAndPropagateWeights. In fact, we always set metadata after inference is done. Is it possible to decouple the distribution factor setting from metadata update?
> > > > > 
> > > > > > The prelink problem doesn't need this to be fixed. We have the pseudo probe update pass that updates distribution factor based on BFI to handle that. 
> > > > > 
> > > > > Ok, then does this cause immediate problem for ICP (does the fix generate good result for xalancbmk without setting icp threshold to 2)? 
> > > > > 
> > > > > > On that, how do we make sure promoted calls get correct counts for inference input though?
> > > > > 
> > > > > Taking another look, line 891 of sampleprofile.cpp took care of this?
> > > > > 
> > > > > 
> > > > > Is it possible to decouple the distribution factor setting from metadata update?
> > > > 
> > > > They are coupled in this patch because the target counts from profile need to be prorated based on a pre-updated factor (say from prelink or inlining, but not ICP). The ICP-updated factor is the final one and is needed before `computeAndPropagateWeights`. Decoupling the ICP factor update from metadata update means we need to save the original factor somewhere, which can be possibly done using an intermediate table. 
> > > > 
> > > > On the other hand, the indirect call metadata update may not rely on `computeAndPropagateWeights`, unlike the branch metdata update. But separating them seems a bigger change and less understandable.
> > > > 
> > > > > Ok, then does this cause immediate problem for ICP (does the fix generate good result for xalancbmk without setting icp threshold to 2)?
> > > > 
> > > > It causes less promotion with later PGO ICP since the early prorate eats the target count for non-promoted targets. I'm seeing this helps some benchmarks like perlbench and gcc, but regresses other benchmarks like h264. I guess that's due to the heuristics of PGO ICP. Overall no change in geomean. 
> > > > 
> > > > > Taking another look, line 891 of sampleprofile.cpp took care of this?
> > > > 
> > > > Yes.
> > > > Decoupling the ICP factor update from metadata update means we need to save the original factor somewhere, which can be possibly done using an intermediate table.
> > > 
> > > My understanding: final_factor = original_factor * (remaining_counts / total_counts), and the original factor is simply Probe->Factor. Why do we need to save the original factor? Did I miss anything?
> > > 
> > > > On the other hand, the indirect call metadata update may not rely on computeAndPropagateWeights, unlike the branch metdata update. But separating them seems a bigger change and less understandable. 
> > > 
> > > If icall metadata update doesn't reply on inference, it's ok to set icall metadata before inference. But that "breaks" the high level structure of inference first, followed by metedata update to persist the result onto metadata, hence a bit inconsistent. 
> > > 
> > > > It causes less promotion with later PGO ICP since the early prorate eats the target count for non-promoted targets.  
> > > 
> > > Hmm.. this goes back to my original question in earlier comments. "However I thought later PGO ICP pass relies on the value profiles we put onto metadata, so even if the count for that icall instruction is a bit off, that ICP should not be affected. Is that correct?"
> > > 
> > > I now see this `T = SampleRecord::adjustCallTargets(T.get(), Probe->Factor);` which explains why a smaller factor can screw up target counts and value profile metadata, in addition to block count.
> > > 
> > > This part still looks fishy. It's reasonable to scale down all target counts when call site is duplicated. But if the factor is also used to get to correct remaining count for icall itself, scaling down all target count using such factor is not right. I understand you now set factor later to avoid scaling down targets in the same pass, but previous pass can still set a factor even when there's no call site duplication. 
> > > 
> > > As I understand, in ICP case, the original motivation for distribution factor is simply to account for call site duplication. If we stick to that, scaling down all targets is fine. Problem arise when we use distribution factor to mimic the effect of subtraction (of promoted target counts). Now the fix workaround the issue by setting distribution factor late, but I still feel using distribution factor to mimic subtraction is a bit hacky, and gives distribution factor inconsistent meaning (which is not compatible when used to scale down all targets). I hope we can have better solution, let's discuss tomorrow.
> > > My understanding: final_factor = original_factor * (remaining_counts / total_counts), and the original factor is simply Probe->Factor. Why do we need to save the original factor? Did I miss anything?
> > 
> > The original factor, instead of the final factor, is used to scale down target counts, i.e, `SampleRecord::adjustCallTargets`. This is to make sure later PGO ICP see the right target count due to code duplication. The final factor is used to scale down the indirect call site counts which is used in `computeAndPropagateWeights`. Since `adjustCallTargets` currently runs after `computeAndPropagateWeights`, `original_factor` needs to be saved somewhere.
> > 
> > > Hmm.. this goes back to my original question in earlier comments. "However I thought later PGO ICP pass relies on the value profiles we put onto metadata, so even if the count for that icall instruction is a bit off, that ICP should not be affected. Is that correct?"
> > 
> > Right, callsite counts only affect BFI, but not PGO ICP. `SampleRecord::adjustCallTargets` affects ICP, and they use different factors, i.e, final factor and original factor, respectively.
> > 
> > > This part still looks fishy. It's reasonable to scale down all target counts when call site is duplicated. But if the factor is also used to get to correct remaining count for icall itself, scaling down all target count using such factor is not right. I understand you now set factor later to avoid scaling down targets in the same pass, but previous pass can still set a factor even when there's no call site duplication.
> > 
> > It was wrong because we used final factor to scale down all target count. Now we are using the original factor, which could be from code duplication or prelink sample ICP. When it is from prelink sample ICP and we are at postlink, the promoted targets being scaled down here (which are wrong) will be ignored since they are known promoted from the magic number NOMORE_ICP_MAGICNUM in the callsite metadata.
> > 
> > 
> > 
> > 
> Hmm, there seems still a problem with this approach. If the original factor comes from prelink sample ICP, and using it to scale down target counts in postlink doesn't make sense. Looks like we need two factors, one is for code duplication, the other is for ICP.
> Hmm, there seems still a problem with this approach. If the original factor comes from prelink sample ICP, and using it to scale down target counts in postlink doesn't make sense. Looks like we need two factors, one is for code duplication, the other is for ICP.

Yeah, this is exactly what I meant in my last reply. Fundamentally the problem is trying to use the factor which is supposed to handle duplication to also adjust remaining total counts for icall after ICP. The two usage are incompatible, moving the order only works around the issue for things happens in the same pass. Two factor would work, but adds complexity. How does AutoFDO handles this? 

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D100993/new/

https://reviews.llvm.org/D100993