[clang] [analyzer] Enforce not making overly complicated symbols (PR #144327)
Balázs Benics via cfe-commits
cfe-commits at lists.llvm.org
Tue Jul 8 00:38:53 PDT 2025
balazs-benics-sonarsource wrote:
> > > So, I think if we go with the evalbinop approach, then it should work as efficiently as my original proposal, while sacreficing the special cases that fold away the binop. I'm fine with either of the approaches.
> > > I scheduled a measurement for the evalbinop approach, and I expect the results by tomorrow the latest.
> >
> >
> > I'm looking forward to it :) I think this evalbinop approach could be a good compromise that eliminates the outliers without messing up the code.
>
> I can already share that the assertion enforcing that we honor Max complexity would fire on about half of the projects in the test pool so far. This is still acceptable for us, but lessens the guarantees of the original patch where this assertion would have held.
It turns out that `evalCast` is another API that creates overly complicated symbols besides `evalBinOp` , thus to assert that we never create overly complicated symbols, we would need to have an early return comparing against the threshold too.
I've not done it though.
> I'll do another measurement now where there is no such assert in place to see if the performance would look good, in other words that we still make overly complicated symbols but less often to the degree that we care about. In this case we would accept this patch.
It took me a while to gather the data. I had migrate the uses of the internal `evalBinOpXX` APIs to use the top-level `evalBinOp` from checkers and other recursive places, to ensure that the check withing `evalBinOp` is honored.
Once this was done, the data looks more convincing.
It's not a clear win though, as you will see.

This is a log-log scatter plot of the running times of the individual entry points. The plot only charts entry points where either the baseline and the new run took at least 50 milliseconds, so we have 758786 entry points on the plot.
`'abs dRT' := RT_new - RT_base`
`'rel dRT%' := 'abs dRT' / RT_base * 100`
The "outliers" are the entry points where the `abs('rel dRT' - mean('rel dRT%') > 10 * std('rel dRT%')`, which accounts for 295 entry points out of the visualized, aka. 0.04%. These are circled to make them stand out.
Basically, if the dot is on the diagonal, that means it runs roughly as in the baseline. This is the majority of the entry points.
If the dot is on the bottom right, then it ran faster after the change.
Similarly, on the upper left side the entry point ran slower than on the baseline.
The further the dot is from the diagonal, the slower/faster it was relative to the baseline.
The dots on the bottom left corner are the "fast-ish" entry points, as they ran within the expected 10**3 (aka. 1 second) time budget.
If we are looking at the 1 second+ entry points, we can see that this patch greatly improves about 6 cases (marked with 5-10 on the chart).
However, it regresses one `(1) entry point from 43 seconds to 93 seconds (1.15x). (project `retdec:parameters.cpp serialize(...)`
Also regresses quicker entry points, like (2) `LibreOffice:document5.cxx` from 0.487 seconds to 4.2 seconds (7.7x).
And a bunch of others, like (3) `libuv:signal.c:uv_signal_event` from 3.7s to 9.4s (1.5x) and (4) `libuv:stream.c:uv_write` from 3.3s to 9.1s (1.7x).
| **#** | **RT_base** | **RT_new** | **rel dRT%** | **EntryPoint** |
|---|---|---|---|---|
| 1 | 43136 | 93100 | 115.83% | `retdec/src/config/parameters.cpp c:@N at retdec@N at config@S at Parameters@F at serialize<#$@N at rapidjson@S at PrettyWriter>#$@N at rapidjson@S at GenericStringBuffer>#$@N at rapidjson@S at UTF8>#C#$@N at rapidjson@S at CrtAllocator#S2_#S2_#S3_#Vi0>#&S0_#1` |
| 2 | 487 | 4226 | 767.76% | `LibreOffice/core/data/documen5.cxx c:@S at ScDocument@F at GetOldChartParameters#&1$@N at rtl@S at OUString#&$@S at ScRangeList#&b#S4_#` |
| 3 | 3688 | 9423 | 155.50% | `libuv/orkspce/src/unix/signal.c c:signal.c at F@uv__signal_event` |
| 4 | 3352 | 9124 | 172.20% | `libuv/src/unix/stream.c c:@F at uv_write` |
| 5 | 3126 | 120 | -96.16% | `GCC/gcc/sel-sched.cc c:sel-sched.cc at F@moveup_set_expr#**$@S at _list_node#*$@S at rtx_insn#b#` |
| 6 | 33322 | 1292 | -96.12% | `DelugeFirmware/src/OSLikeStuff/fault_handler/fault_handler.c c:@F at printPointers` |
| 7 | 3939 | 207 | -94.74% | `LibreOffice/basic/source/basmgr/basmgr.cxx c:@S at LibraryContainer_Impl@F at getByName#&1$@N at rtl@S at OUString#` |
| 8 | 4422 | 505 | -88.58% | `tensorflow/compiler/mlir/lite/flatbuffer_export.cc c:flatbuffer_export.cc at aN@S at Translator@F at Translator#$@N at mlir@S at ModuleOp#b#b#b#&1$@N at std@S at unordered_set>#$@N at std@N at __cxx11@S at basic_string>#C#$@N at std@S at char_traits>#C#$@N at std@S at allocator>#C#$@N at std@S at hash>#S3_#$@N at std@S at equal_to>#S3_#$@N at std@S at allocator>#S3_#S1_#*$@N at tensorflow@S at OpOrArgNameMapper#` |
| 9 | 2774 | 417 | -84.97% | `LibreOffice/cppuhelper/source/tdmgr.cxx c:tdmgr.cxx at N@cppu at F@createCTD#&1$@N at com@N at sun@N at star@N at uno@S at Reference>#$@N at com@N at sun@N at star@N at container@S at XHierarchicalNameAccess#&1$@N at com@N at sun@N at star@N at uno@S at Reference>#$@N at com@N at sun@N at star@N at reflection@S at XTypeDescription#` |
| 10 | 6525 | 1050 | -83.91% | `tensorflow/compiler/mlir/lite/flatbuffer_export.cc c:flatbuffer_export.cc at aN@S at Translator@F at GetOpcodeIndex#&1$@N at std@N at __cxx11@S at basic_string>#C#$@N at std@S at char_traits>#C#$@N at std@S at allocator>#C#$@N at tflite@E at BuiltinOperator#` |
If we only look at the entry points that ran for at least 1 second in at least one of the measurements, we have these numbers:
```
long.describe(percentiles=[0.0001, 0.001, .01, .1, .25, .5, .75, .9, .99, .999, .9999])
RT_base RT_new abs dRT rel dRT%
count 400184.000000 400184.000000 400184.000000 400184.000000
mean 3469.711195 3468.452514 -1.258681 0.426680
std 1413.859913 1425.623877 286.598363 117.717681
min 7.000000 5.000000 -32030.000000 -99.876695
0.01% 742.018300 685.036600 -4081.414400 -59.523150
0.1% 926.000000 913.000000 -1595.000000 -38.294830
1% 1029.000000 1028.000000 -950.000000 -21.638762
10% 1699.000000 1692.000000 -230.000000 -6.317705
25% 2534.000000 2535.000000 -89.000000 -2.810499
50% 3464.000000 3455.000000 4.000000 0.134168
75% 4289.000000 4301.000000 92.000000 2.890952
90% 5069.000000 5090.000000 238.000000 7.072818
99% 6879.000000 6856.000000 711.000000 18.337760
99.9% 11427.634000 11341.000000 1560.817000 63.733926
99.99% 28564.700700 28336.414400 2705.981700 80.511678
max 116043.000000 138428.000000 49964.000000 69714.285714
```
I can share the anonimized data, where the entry point names are sha-hashed if you want to play with it.
<details>
<summary>Here is the script that can visualize the data.</summary>
<pre>
# pip install pandas matplotlib hashlib PyQt7
import hashlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import math
b = pd.read_csv('baseline/merged_metrics.csv')
n = pd.read_csv('proposed/merged_metrics.csv')
max_orig_num_rows = max(len(b), len(n))
# keep only the columns we need
b = b[['EntryPoint', 'PathRunningTime']]
n = n[['EntryPoint', 'PathRunningTime']]
# Drop the rows where the EntryPoint appears multiple times.
b = b.drop_duplicates(subset=['EntryPoint'])
n = n.drop_duplicates(subset=['EntryPoint'])
# Rename the columns to RT_base, RT_new, and RT_refactored
b.rename(columns={'PathRunningTime': 'RT_base'}, inplace=True)
n.rename(columns={'PathRunningTime': 'RT_new'}, inplace=True)
# Join the three dataframes on the EntryPoint column
joined = pd.merge(b, n, on=['EntryPoint'], how='inner')
joined = joined.sort_values(by='RT_base')
joined.reset_index(inplace=True)
print(f'Dropped about {max_orig_num_rows - len(joined)} ({(max_orig_num_rows - len(joined)) / len(joined):.3f}%) rows')
# Create an anonimized EntryPoint name in case I'd want to share this data publicly.
joined['Hash'] = joined['EntryPoint'].apply(lambda x: hashlib.sha256(x.encode()).hexdigest())
# Save this so that other (more specialized) scripts can start from here.
joined.to_csv('joined.csv', index=False)
# joined = pd.read_csv('joined.csv')
# Calculate the absolute and relative difference
joined['abs dRT'] = joined['RT_new'] - joined['RT_base']
joined['rel dRT%'] = joined['abs dRT'] / joined['RT_base'] * 100
long = joined[(joined['RT_base'] > 1000) | (joined['RT_new'] > 1000)]
long.reset_index(inplace=True)
long.drop(columns=['index'], inplace=True)
print(f'{len(long)} ({len(long) / len(joined):.3f}%) EntryPoints ran longer than 1 second at least once')
long.describe(percentiles=[0.0001, 0.001, .01, .1, .25, .5, .75, .9, .99, .999, .9999])
relevant = joined[(joined['RT_base'] > 50) & (joined['RT_new'] > 50)]
relevant.reset_index(inplace=True)
# Identify outliers based on relative difference standard deviation
rel_drt_std = relevant['rel dRT%'].std()
rel_drt_mean = relevant['rel dRT%'].mean()
threshold = rel_drt_std * 10
outliers = relevant[abs(relevant['rel dRT%'] - rel_drt_mean) > threshold]
outliers.reset_index(inplace=True)
print(f'Found {len(outliers)} outlier datapoints (rel dRT% > {threshold:.2f} from mean)')
print(f'Mean rel dRT%: {rel_drt_mean:.2f}%, Std dev: {rel_drt_std:.2f}%')
# Create a side-by-side layout with plot and table
fig, (ax, ax_table) = plt.subplots(1, 2, figsize=(16, 8), width_ratios=[2, 1])
# Create scatter plots with hover functionality
scatter_outliers = ax.scatter(outliers['RT_base'], outliers['RT_new'], s=30, alpha=0.5,
facecolors='none', edgecolors='black', linewidth=1,
label=f'Outliers ({len(outliers)} points, {len(outliers) / len(relevant)*100:.2f}%)')
scatter_relevant = ax.scatter(relevant['RT_base'], relevant['RT_new'], s=0.1, color='red', alpha=0.2)
# Add diagonal lines in semi-transparent colors
max_val = max(relevant['RT_base'].max(), relevant['RT_new'].max())
ax.plot([0, max_val], [0, max_val], color='skyblue', alpha=0.5, linestyle='-', linewidth=1, label='RT_new=RT_base')
RT_new_vs_base_ratio = relevant['RT_new'] / relevant['RT_base']
best_case = relevant.iloc[RT_new_vs_base_ratio.idxmin()]
worst_case = relevant.iloc[RT_new_vs_base_ratio.idxmax()]
num_guides = 5
for nth in range(1, num_guides+1):
curr = round(nth * (worst_case['RT_new'] / worst_case['RT_base']) / num_guides, 1)
ax.plot([0, max_val/curr], [0, max_val], color='orange', alpha=0.5, linestyle='--', linewidth=1, label=f'RT_new=RT_base*{curr}')
#ax.text(max_val/curr, max_val, f'RT_new=RT_base/{curr}', color='orange', alpha=0.7, fontsize=8, ha='left', va='center')
max_ratio_inverse = math.floor(1/(best_case['RT_new'] / best_case['RT_base']))
for nth in range(1, num_guides+1):
curr = round(nth * max_ratio_inverse / num_guides, 1)
ax.plot([0, max_val], [0, max_val/curr], color='green', alpha=0.5, linestyle='--', linewidth=1, label=f'RT_new=RT_base/{curr}')
#ax.text(max_val, max_val/curr, f'RT_new=RT_base*{curr}', color='green', alpha=0.7, fontsize=8, ha='left', va='center')
annotations = []
clicked_points = []
# Clear on right click, show info on left click
def onclick(event):
global annotations, clicked_points
# Only handle clicks on the main plot (not the table)
if event.inaxes != ax:
return
# Right click - clear everything
if event.button == 3:
for annot in annotations:
annot.set_visible(False)
annotations.clear()
clicked_points.clear()
update_table()
return
# Left click - add numbered annotation
if event.button == 1:
distances = (event.xdata - outliers['RT_base'])**2 + (event.ydata - outliers['RT_new'])**2
closest_point = outliers.iloc[distances.idxmin()]
# Create numbered annotation
number = len(annotations) + 1
annot = ax.annotate(str(number),
xy=(closest_point['RT_base'], closest_point['RT_new']),
bbox=dict(boxstyle="circle", edgecolor='none', facecolor='none', alpha=0.7),
fontsize=10, fontweight='bold', color='blue')
annotations.append(annot)
clicked_points.append(closest_point)
update_table()
def update_table():
ax_table.clear()
ax_table.set_xticks([])
ax_table.set_yticks([])
if clicked_points:
# Create table data with more columns for better readability
table_data = []
for i, point in enumerate(clicked_points):
entry_point = point['EntryPoint']
# Truncate long entry points for display
if len(entry_point) > 50:
entry_point = entry_point[:47] + "..."
table_data.append([
f"{i+1}",
f"{point['RT_base']:.2f}",
f"{point['RT_new']:.2f}",
f"{point['rel dRT%']:.2f}%",
entry_point
])
table = ax_table.table(cellText=table_data,
colLabels=['#', 'RT_base', 'RT_new', 'rel dRT%', 'EntryPoint'],
cellLoc='left',
loc='center',
bbox=[0, 0, 1, 1])
# Style the table
table.auto_set_font_size(False)
table.set_fontsize(8)
table.scale(1, 1.5)
# Color header row
for j in range(len(table_data[0])):
table[(0, j)].set_facecolor('#E6E6E6')
table[(0, j)].set_text_props(weight='bold')
# Color alternating rows for better readability
for i in range(len(table_data)):
for j in range(len(table_data[0])):
if i % 2 == 0:
table[(i+1, j)].set_facecolor('#F8F8F8')
fig.canvas.draw()
fig.canvas.mpl_connect("button_press_event", onclick)
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('Baseline Running Time')
ax.set_ylabel('New Running Time')
ax.legend()
# Configure table subplot
ax_table.set_title('Selected EntryPoints', fontsize=12, fontweight='bold', pad=20)
ax_table.set_xticks([])
ax_table.set_yticks([])
# Add scrollable text widget for full entry point details
from matplotlib.widgets import TextBox
import tkinter as tk
from tkinter import ttk
# Create a separate window for detailed view
def show_details_window():
if not clicked_points:
return
root = tk.Tk()
root.title("EntryPoint Details")
root.geometry("800x600")
# Create frame for buttons
button_frame = tk.Frame(root)
button_frame.pack(fill='x', padx=5, pady=5)
# Create treeview for spreadsheet-like interface
tree = ttk.Treeview(root, columns=('Number', 'RT_base', 'RT_new', 'rel_dRT', 'EntryPoint'), show='headings')
# Define columns
tree.heading('Number', text='#')
tree.heading('RT_base', text='RT_base')
tree.heading('RT_new', text='RT_new')
tree.heading('rel_dRT', text='rel dRT%')
tree.heading('EntryPoint', text='EntryPoint')
# Set column widths
tree.column('Number', width=50)
tree.column('RT_base', width=100)
tree.column('RT_new', width=100)
tree.column('rel_dRT', width=100)
tree.column('EntryPoint', width=400)
# Add scrollbars
vsb = ttk.Scrollbar(root, orient="vertical", command=tree.yview)
hsb = ttk.Scrollbar(root, orient="horizontal", command=tree.xview)
tree.configure(yscrollcommand=vsb.set, xscrollcommand=hsb.set)
# Pack layout
tree.pack(side='left', fill='both', expand=True)
vsb.pack(side='right', fill='y')
hsb.pack(side='bottom', fill='x')
# Populate data
for i, point in enumerate(clicked_points):
tree.insert('', 'end', values=(
i+1,
f"{point['RT_base']:.2f}",
f"{point['RT_new']:.2f}",
f"{point['rel dRT%']:.2f}%",
point['EntryPoint']
))
# Add copy to clipboard functionality
def copy_selected():
selected_items = tree.selection()
if not selected_items:
return
clipboard_text = ""
for item in selected_items:
values = tree.item(item)['values']
clipboard_text += "\t".join(str(v) for v in values) + "\n"
root.clipboard_clear()
root.clipboard_append(clipboard_text.strip())
def copy_all():
all_items = tree.get_children()
if not all_items:
return
clipboard_text = "#\tRT_base\tRT_new\trel dRT%\tEntryPoint\n"
for item in all_items:
values = tree.item(item)['values']
clipboard_text += "\t".join(str(v) for v in values) + "\n"
root.clipboard_clear()
root.clipboard_append(clipboard_text.strip())
# Add buttons
copy_selected_btn = tk.Button(button_frame, text="Copy Selected", command=copy_selected)
copy_selected_btn.pack(side='left', padx=5)
copy_all_btn = tk.Button(button_frame, text="Copy All", command=copy_all)
copy_all_btn.pack(side='left', padx=5)
# Add keyboard shortcuts
def on_key(event):
if event.state & 4: # Ctrl key
if event.keysym == 'c':
copy_selected()
elif event.keysym == 'a':
tree.selection_set(tree.get_children())
tree.bind('<Key>', on_key)
root.mainloop()
# Add button to open details window
from matplotlib.widgets import Button
ax_button = plt.axes([0.83, 0.02, 0.15, 0.05])
button = Button(ax_button, 'View Details')
button.on_clicked(lambda x: show_details_window())
plt.tight_layout()
plt.show()
</pre>
</details>
In conclusion, in contrast to the originally proposed variant of this patch this evolved version is more controversial and the benefits are not clear cut. I don't think it's simpler than the original variant, so I think we reached a stale mate.
Unfortunately, I can't invest more time into this PR right now, so I'll close this.
https://github.com/llvm/llvm-project/pull/144327
More information about the cfe-commits
mailing list