[clang] [analyzer] Enforce not making overly complicated symbols (PR #144327)

Tue Jul 8 00:38:53 PDT 2025

balazs-benics-sonarsource wrote:

> > > So, I think if we go with the evalbinop approach, then it should work as efficiently as my original proposal, while sacreficing the special cases that fold away the binop. I'm fine with either of the approaches.
> > > I scheduled a measurement for the evalbinop approach, and I expect the results by tomorrow the latest.
> > 
> > 
> > I'm looking forward to it :) I think this evalbinop approach could be a good compromise that eliminates the outliers without messing up the code.
> 
> I can already share that the assertion enforcing that we honor Max complexity would fire on about half of the projects in the test pool so far. This is still acceptable for us, but lessens the guarantees of the original patch where this assertion would have held.

It turns out that `evalCast` is another API that creates overly complicated symbols besides `evalBinOp` , thus to assert that we never create overly complicated symbols, we would need to have an early return comparing against the threshold too. 
I've not done it though.

> I'll do another measurement now where there is no such assert in place to see if the performance would look good, in other words that we still make overly complicated symbols but less often to the degree that we care about. In this case we would accept this patch.

It took me a while to gather the data. I had migrate the uses of the internal `evalBinOpXX` APIs to use the top-level `evalBinOp` from checkers and other recursive places, to ensure that the check withing `evalBinOp` is honored.

Once this was done, the data looks more convincing.
It's not a clear win though, as you will see.
![max-compl-upstream-eval-animated](https://github.com/user-attachments/assets/408cb7c5-2659-467f-90bd-562b88d12342)

This is a log-log scatter plot of the running times of the individual entry points. The plot only charts entry points where either the baseline and the new run took at least 50 milliseconds, so we have 758786 entry points on the plot.

`'abs dRT' := RT_new - RT_base`
`'rel dRT%' := 'abs dRT' / RT_base * 100`

The "outliers" are the entry points where the `abs('rel dRT' - mean('rel dRT%') > 10 * std('rel dRT%')`, which accounts for 295 entry points out of the visualized, aka. 0.04%. These are circled to make them stand out.

Basically, if the dot is on the diagonal, that means it runs roughly as in the baseline. This is the majority of the entry points.
If the dot is on the bottom right, then it ran faster after the change.
Similarly, on the upper left side the entry point ran slower than on the baseline.
The further the dot is from the diagonal, the slower/faster it was relative to the baseline.
The dots on the bottom left corner are the "fast-ish" entry points, as they ran within the expected 10**3 (aka. 1 second) time budget.

If we are looking at the 1 second+ entry points, we can see that this patch greatly improves about 6 cases (marked with 5-10 on the chart).
However, it regresses one `(1) entry point from 43 seconds to 93 seconds (1.15x). (project `retdec:parameters.cpp serialize(...)`
Also regresses quicker entry points, like (2) `LibreOffice:document5.cxx` from 0.487 seconds to 4.2 seconds (7.7x).
And a bunch of others, like (3) `libuv:signal.c:uv_signal_event` from 3.7s to 9.4s (1.5x) and (4) `libuv:stream.c:uv_write` from 3.3s to 9.1s (1.7x).

| **#** | **RT_base** | **RT_new** | **rel dRT%** | **EntryPoint** |
|---|---|---|---|---|
| 1 | 43136 | 93100 | 115.83% | `retdec/src/config/parameters.cpp c:@N at retdec@N at config@S at Parameters@F at serialize<#$@N at rapidjson@S at PrettyWriter>#$@N at rapidjson@S at GenericStringBuffer>#$@N at rapidjson@S at UTF8>#C#$@N at rapidjson@S at CrtAllocator#S2_#S2_#S3_#Vi0>#&S0_#1` |
| 2 | 487 | 4226 | 767.76% | `LibreOffice/core/data/documen5.cxx c:@S at ScDocument@F at GetOldChartParameters#&1$@N at rtl@S at OUString#&$@S at ScRangeList#&b#S4_#` |
| 3 | 3688 | 9423 | 155.50% | `libuv/orkspce/src/unix/signal.c c:signal.c at F@uv__signal_event` |
| 4 | 3352 | 9124 | 172.20% | `libuv/src/unix/stream.c c:@F at uv_write` |
| 5 | 3126 | 120 | -96.16% | `GCC/gcc/sel-sched.cc c:sel-sched.cc at F@moveup_set_expr#**$@S at _list_node#*$@S at rtx_insn#b#` |
| 6 | 33322 | 1292 | -96.12% | `DelugeFirmware/src/OSLikeStuff/fault_handler/fault_handler.c c:@F at printPointers` |
| 7 | 3939 | 207 | -94.74% | `LibreOffice/basic/source/basmgr/basmgr.cxx c:@S at LibraryContainer_Impl@F at getByName#&1$@N at rtl@S at OUString#` |
| 8 | 4422 | 505 | -88.58% | `tensorflow/compiler/mlir/lite/flatbuffer_export.cc c:flatbuffer_export.cc at aN@S at Translator@F at Translator#$@N at mlir@S at ModuleOp#b#b#b#&1$@N at std@S at unordered_set>#$@N at std@N at __cxx11@S at basic_string>#C#$@N at std@S at char_traits>#C#$@N at std@S at allocator>#C#$@N at std@S at hash>#S3_#$@N at std@S at equal_to>#S3_#$@N at std@S at allocator>#S3_#S1_#*$@N at tensorflow@S at OpOrArgNameMapper#` |
| 9 | 2774 | 417 | -84.97% | `LibreOffice/cppuhelper/source/tdmgr.cxx c:tdmgr.cxx at N@cppu at F@createCTD#&1$@N at com@N at sun@N at star@N at uno@S at Reference>#$@N at com@N at sun@N at star@N at container@S at XHierarchicalNameAccess#&1$@N at com@N at sun@N at star@N at uno@S at Reference>#$@N at com@N at sun@N at star@N at reflection@S at XTypeDescription#` |
| 10 | 6525 | 1050 | -83.91% | `tensorflow/compiler/mlir/lite/flatbuffer_export.cc c:flatbuffer_export.cc at aN@S at Translator@F at GetOpcodeIndex#&1$@N at std@N at __cxx11@S at basic_string>#C#$@N at std@S at char_traits>#C#$@N at std@S at allocator>#C#$@N at tflite@E at BuiltinOperator#` |

If we only look at the entry points that ran for at least 1 second in at least one of the measurements, we have these numbers:
```
long.describe(percentiles=[0.0001, 0.001, .01, .1, .25, .5, .75, .9, .99, .999, .9999])
              RT_base         RT_new        abs dRT       rel dRT%
count   400184.000000  400184.000000  400184.000000  400184.000000
mean      3469.711195    3468.452514      -1.258681       0.426680
std       1413.859913    1425.623877     286.598363     117.717681
min          7.000000       5.000000  -32030.000000     -99.876695
0.01%      742.018300     685.036600   -4081.414400     -59.523150
0.1%       926.000000     913.000000   -1595.000000     -38.294830
1%        1029.000000    1028.000000    -950.000000     -21.638762
10%       1699.000000    1692.000000    -230.000000      -6.317705
25%       2534.000000    2535.000000     -89.000000      -2.810499
50%       3464.000000    3455.000000       4.000000       0.134168
75%       4289.000000    4301.000000      92.000000       2.890952
90%       5069.000000    5090.000000     238.000000       7.072818
99%       6879.000000    6856.000000     711.000000      18.337760
99.9%    11427.634000   11341.000000    1560.817000      63.733926
99.99%   28564.700700   28336.414400    2705.981700      80.511678
max     116043.000000  138428.000000   49964.000000   69714.285714
```

I can share the anonimized data, where the entry point names are sha-hashed if you want to play with it.
<details>
<summary>Here is the script that can visualize the data.</summary>

<pre>
# pip install pandas matplotlib hashlib PyQt7

import hashlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import math

b = pd.read_csv('baseline/merged_metrics.csv')
n = pd.read_csv('proposed/merged_metrics.csv')
max_orig_num_rows = max(len(b), len(n))

# keep only the columns we need
b = b[['EntryPoint', 'PathRunningTime']]
n = n[['EntryPoint', 'PathRunningTime']]

# Drop the rows where the EntryPoint appears multiple times.
b = b.drop_duplicates(subset=['EntryPoint'])
n = n.drop_duplicates(subset=['EntryPoint'])

# Rename the columns to RT_base, RT_new, and RT_refactored
b.rename(columns={'PathRunningTime': 'RT_base'}, inplace=True)
n.rename(columns={'PathRunningTime': 'RT_new'}, inplace=True)

# Join the three dataframes on the EntryPoint column
joined = pd.merge(b, n, on=['EntryPoint'], how='inner')
joined = joined.sort_values(by='RT_base')
joined.reset_index(inplace=True)

print(f'Dropped about {max_orig_num_rows - len(joined)} ({(max_orig_num_rows - len(joined)) / len(joined):.3f}%) rows')

# Create an anonimized EntryPoint name in case I'd want to share this data publicly.
joined['Hash'] = joined['EntryPoint'].apply(lambda x: hashlib.sha256(x.encode()).hexdigest())

# Save this so that other (more specialized) scripts can start from here. 
joined.to_csv('joined.csv', index=False)
# joined = pd.read_csv('joined.csv')

# Calculate the absolute and relative difference
joined['abs dRT'] = joined['RT_new'] - joined['RT_base']
joined['rel dRT%'] = joined['abs dRT'] / joined['RT_base'] * 100

long = joined[(joined['RT_base'] > 1000) | (joined['RT_new'] > 1000)]
long.reset_index(inplace=True)
long.drop(columns=['index'], inplace=True)
print(f'{len(long)} ({len(long) / len(joined):.3f}%) EntryPoints ran longer than 1 second at least once')
long.describe(percentiles=[0.0001, 0.001, .01, .1, .25, .5, .75, .9, .99, .999, .9999])

relevant = joined[(joined['RT_base'] > 50) & (joined['RT_new'] > 50)]
relevant.reset_index(inplace=True)

# Identify outliers based on relative difference standard deviation
rel_drt_std = relevant['rel dRT%'].std()
rel_drt_mean = relevant['rel dRT%'].mean()
threshold = rel_drt_std * 10
outliers = relevant[abs(relevant['rel dRT%'] - rel_drt_mean) > threshold]
outliers.reset_index(inplace=True)
print(f'Found {len(outliers)} outlier datapoints (rel dRT% > {threshold:.2f} from mean)')
print(f'Mean rel dRT%: {rel_drt_mean:.2f}%, Std dev: {rel_drt_std:.2f}%')

# Create a side-by-side layout with plot and table
fig, (ax, ax_table) = plt.subplots(1, 2, figsize=(16, 8), width_ratios=[2, 1])

# Create scatter plots with hover functionality
scatter_outliers = ax.scatter(outliers['RT_base'], outliers['RT_new'], s=30, alpha=0.5, 
                             facecolors='none', edgecolors='black', linewidth=1,
                             label=f'Outliers ({len(outliers)} points, {len(outliers) / len(relevant)*100:.2f}%)')

scatter_relevant = ax.scatter(relevant['RT_base'], relevant['RT_new'], s=0.1, color='red', alpha=0.2)

# Add diagonal lines in semi-transparent colors
max_val = max(relevant['RT_base'].max(), relevant['RT_new'].max())
ax.plot([0, max_val], [0, max_val], color='skyblue', alpha=0.5, linestyle='-', linewidth=1, label='RT_new=RT_base')

RT_new_vs_base_ratio = relevant['RT_new'] / relevant['RT_base']
best_case = relevant.iloc[RT_new_vs_base_ratio.idxmin()]
worst_case = relevant.iloc[RT_new_vs_base_ratio.idxmax()]

num_guides = 5
for nth in range(1, num_guides+1):
    curr = round(nth * (worst_case['RT_new'] / worst_case['RT_base']) / num_guides, 1)
    ax.plot([0, max_val/curr], [0, max_val], color='orange', alpha=0.5, linestyle='--', linewidth=1, label=f'RT_new=RT_base*{curr}')
    #ax.text(max_val/curr, max_val, f'RT_new=RT_base/{curr}', color='orange', alpha=0.7, fontsize=8, ha='left', va='center')

max_ratio_inverse = math.floor(1/(best_case['RT_new'] / best_case['RT_base']))
for nth in range(1, num_guides+1):
    curr = round(nth * max_ratio_inverse / num_guides, 1)
    ax.plot([0, max_val], [0, max_val/curr], color='green', alpha=0.5, linestyle='--', linewidth=1, label=f'RT_new=RT_base/{curr}')
    #ax.text(max_val, max_val/curr, f'RT_new=RT_base*{curr}', color='green', alpha=0.7, fontsize=8, ha='left', va='center')

annotations = []
clicked_points = []

# Clear on right click, show info on left click
def onclick(event):
    global annotations, clicked_points

    # Only handle clicks on the main plot (not the table)
    if event.inaxes != ax:
        return

    # Right click - clear everything
    if event.button == 3:
        for annot in annotations:
            annot.set_visible(False)
        annotations.clear()
        clicked_points.clear()
        update_table()
        return

    # Left click - add numbered annotation
    if event.button == 1:
        distances = (event.xdata - outliers['RT_base'])**2 + (event.ydata - outliers['RT_new'])**2
        closest_point = outliers.iloc[distances.idxmin()]

        # Create numbered annotation
        number = len(annotations) + 1
        annot = ax.annotate(str(number), 
                           xy=(closest_point['RT_base'], closest_point['RT_new']),
                           bbox=dict(boxstyle="circle", edgecolor='none', facecolor='none', alpha=0.7),
                           fontsize=10, fontweight='bold', color='blue')

        annotations.append(annot)
        clicked_points.append(closest_point)

        update_table()

def update_table():
    ax_table.clear()
    ax_table.set_xticks([])
    ax_table.set_yticks([])

    if clicked_points:
        # Create table data with more columns for better readability
        table_data = []
        for i, point in enumerate(clicked_points):
            entry_point = point['EntryPoint']
            # Truncate long entry points for display
            if len(entry_point) > 50:
                entry_point = entry_point[:47] + "..."
            table_data.append([
                f"{i+1}",
                f"{point['RT_base']:.2f}",
                f"{point['RT_new']:.2f}",
                f"{point['rel dRT%']:.2f}%",
                entry_point
            ])

        table = ax_table.table(cellText=table_data, 
                              colLabels=['#', 'RT_base', 'RT_new', 'rel dRT%', 'EntryPoint'],
                              cellLoc='left',
                              loc='center',
                              bbox=[0, 0, 1, 1])

        # Style the table
        table.auto_set_font_size(False)
        table.set_fontsize(8)
        table.scale(1, 1.5)

        # Color header row
        for j in range(len(table_data[0])):
            table[(0, j)].set_facecolor('#E6E6E6')
            table[(0, j)].set_text_props(weight='bold')

        # Color alternating rows for better readability
        for i in range(len(table_data)):
            for j in range(len(table_data[0])):
                if i % 2 == 0:
                    table[(i+1, j)].set_facecolor('#F8F8F8')

    fig.canvas.draw()

fig.canvas.mpl_connect("button_press_event", onclick)

ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('Baseline Running Time')
ax.set_ylabel('New Running Time')
ax.legend()

# Configure table subplot
ax_table.set_title('Selected EntryPoints', fontsize=12, fontweight='bold', pad=20)
ax_table.set_xticks([])
ax_table.set_yticks([])

# Add scrollable text widget for full entry point details
from matplotlib.widgets import TextBox
import tkinter as tk
from tkinter import ttk

# Create a separate window for detailed view
def show_details_window():
    if not clicked_points:
        return

    root = tk.Tk()
    root.title("EntryPoint Details")
    root.geometry("800x600")

    # Create frame for buttons
    button_frame = tk.Frame(root)
    button_frame.pack(fill='x', padx=5, pady=5)

    # Create treeview for spreadsheet-like interface
    tree = ttk.Treeview(root, columns=('Number', 'RT_base', 'RT_new', 'rel_dRT', 'EntryPoint'), show='headings')

    # Define columns
    tree.heading('Number', text='#')
    tree.heading('RT_base', text='RT_base')
    tree.heading('RT_new', text='RT_new')
    tree.heading('rel_dRT', text='rel dRT%')
    tree.heading('EntryPoint', text='EntryPoint')

    # Set column widths
    tree.column('Number', width=50)
    tree.column('RT_base', width=100)
    tree.column('RT_new', width=100)
    tree.column('rel_dRT', width=100)
    tree.column('EntryPoint', width=400)

    # Add scrollbars
    vsb = ttk.Scrollbar(root, orient="vertical", command=tree.yview)
    hsb = ttk.Scrollbar(root, orient="horizontal", command=tree.xview)
    tree.configure(yscrollcommand=vsb.set, xscrollcommand=hsb.set)

    # Pack layout
    tree.pack(side='left', fill='both', expand=True)
    vsb.pack(side='right', fill='y')
    hsb.pack(side='bottom', fill='x')

    # Populate data
    for i, point in enumerate(clicked_points):
        tree.insert('', 'end', values=(
            i+1,
            f"{point['RT_base']:.2f}",
            f"{point['RT_new']:.2f}",
            f"{point['rel dRT%']:.2f}%",
            point['EntryPoint']
        ))

    # Add copy to clipboard functionality
    def copy_selected():
        selected_items = tree.selection()
        if not selected_items:
            return

        clipboard_text = ""
        for item in selected_items:
            values = tree.item(item)['values']
            clipboard_text += "\t".join(str(v) for v in values) + "\n"

        root.clipboard_clear()
        root.clipboard_append(clipboard_text.strip())

    def copy_all():
        all_items = tree.get_children()
        if not all_items:
            return

        clipboard_text = "#\tRT_base\tRT_new\trel dRT%\tEntryPoint\n"
        for item in all_items:
            values = tree.item(item)['values']
            clipboard_text += "\t".join(str(v) for v in values) + "\n"

        root.clipboard_clear()
        root.clipboard_append(clipboard_text.strip())

    # Add buttons
    copy_selected_btn = tk.Button(button_frame, text="Copy Selected", command=copy_selected)
    copy_selected_btn.pack(side='left', padx=5)

    copy_all_btn = tk.Button(button_frame, text="Copy All", command=copy_all)
    copy_all_btn.pack(side='left', padx=5)

    # Add keyboard shortcuts
    def on_key(event):
        if event.state & 4:  # Ctrl key
            if event.keysym == 'c':
                copy_selected()
            elif event.keysym == 'a':
                tree.selection_set(tree.get_children())

    tree.bind('<Key>', on_key)

    root.mainloop()

# Add button to open details window
from matplotlib.widgets import Button
ax_button = plt.axes([0.83, 0.02, 0.15, 0.05])
button = Button(ax_button, 'View Details')
button.on_clicked(lambda x: show_details_window())

plt.tight_layout()
plt.show()
</pre>
</details>

In conclusion, in contrast to the originally proposed variant of this patch this evolved version is more controversial and the benefits are not clear cut. I don't think it's simpler than the original variant, so I think we reached a stale mate.

Unfortunately, I can't invest more time into this PR right now, so I'll close this.

https://github.com/llvm/llvm-project/pull/144327