<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/106199>106199</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
[clang][analyzer] Potential Improvement for deduplication strategy in Clang StaticAnalyzer and Clang-Tidy
</td>
</tr>
<tr>
<th>Labels</th>
<td>
clang,
clang-tidy
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
shenjunjiekoda
</td>
</tr>
</table>
<pre>
I hope this message finds you well. I'd like to bring up a discussion about the current deduplication strategy used in both Clang Static Analyzer and Clang-Tidy. I've noticed that *PathDiagnosticConsumers* typically use the warning or error information from a Diagnostic as the basis for deduplication. While this approach works well in many cases, I believe there might be room for improvement in certain scenarios. Allow me to illustrate with an example:
```cpp
inline void XXX_memset(void *ptr, int c, size_t n) {
if (ptr != NULL && (n != (size_t)0)) {
memset(ptr, c, n);
}
}
void caller1() {
int arr[100];
XXX_memset((void *)arr, 0, 101 * sizeof(int)); // overflow1 root cause
}
void caller2() {
int *arr = (int *)malloc(10 * sizeof(int));
XXX_memset((void *)arr, 0, 11 * sizeof(int)); // overflow2 root cause
}
```
In this code, we have two potential buffer overflow issues. However, because the sink point for both is within the `XXX_memset` function (specifically, the `memset` call), the current strategy might deduplicate these as a single issue due to identical warning locations and diagnostic messages. This is despite the fact that the notes and triggering paths differ, and the root causes actually lie in the calling functions rather than in `XXX_memset`.
This observation leads me to pose two main questions:
1. Is deduplication based solely on the analyzer's sink point information (or ASTNode location for Clang-Tidy) always appropriate? Should diagnostics with different notes be considered distinct and not deduplicated?
2. Could the current mechanism potentially lead to underreporting in a single invocation of Clang Static Analyzer tools? If we agree that the diagnostics in the example should be treated separately, it's because the root cause of the errors is in the parameter construction at the call sites, not at the sink point or its containing function.
For the first question, I have a preliminary suggestion: We could introduce an option in PathDiagnostic's `Profile` to allow choosing between using the current `Profile` or a `FullProfile` (considering all path information) for generating the FoldingSetID. If this approach seems promising, I'd be happy to submit a merge request to implement this feature.
Regarding the second question, if we consider all paths in the static analyzer, finding the root cause for appropriate deduplication becomes challenging. Simply considering all notes might not effectively deduplicate based on root causes. Perhaps we could explore a method that only uses notes from the bug visitor and relevant value slice information for deduplication?
I'm very interested in hearing the LLVM/Clang community's thoughts on these points. Thank you for your time and consideration.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJyUV02P47gR_TXsS2EMmXb74-BDTxtGGpgsBulJdm4LiipJNU2RCknZ6_31QZGyLTsZbBZouG1Kqo9Xr16VVAjUWMSdeP4snvdPaoit87vQov0x2B-EH65ST6Wrzrs3aF2PEFsK0GEIqkGoyVYBzm6AExozgzch1xUY-kCIDkpPtoGhBwUVBT2EQM6CKt0QIbYIevAebYQKq6E3pFXk6yF6FbE5wxCwArJQutjCq1G2gfeoIml4scqc_0APylb5yqdvVJ2z_yOCdZE0VhBbFUHIl68qtntSjXUhkn51Ngwd-iDkC8RzT1oZk9ylqE7KW47beUDvnQeytfNdDq72rgMFN2OgQnqqVIEC1M7fZzODX1syI2qq771TuoWT8x8hQcb5dcqeQauAQchXeIMSDeExBeMROmraCCWCd65LDqjrvTtix9CRBY0-KrIQNFrlyYUZvBjjTtClIpAxQ4YUThRbUBbwd9X1BsXiRRR7UVw-V0X-032fT8gasghHRxV8__79tw67gFHITToR8qWPnkMmG0Hzl0B_4G8RrJBbEOvP2QoA1SDkpo8ehJyLxR5--eeXLyDkSsgVX7GXcyE32YSQ20LI7YMdgGsEo-fkld2JxfUusd6PCV2_pM8UNJca_VzIzWOMNoLyXjx_nheFeN7fDN5lPkleyC0_IF-h4I95MefThIGrhdyQjTkFsfgMQh6EPIA7oq-NO825mhG0GgL-WbDyfwcr5IvyHkbYxhMht50yxmkhN_Pi5_FcDP2_uf2F1OSfpXbl2fTwzeYW0a5C9nhCaBX3wMlB7yLaSMpAOdQ1-qsroBAGDDP4mzvhEVO8JSbPqScD2Q_oHWPDfZN0hEJqA7LpDrEqJhCsCqgHq1OjMxV71FRndWDT4wO3m_lKwuH1Ts6uApZb96YHKaqALBmKg2sM5hSgGnKvVpyoVuYqQsZlHQlJ6qqb7IwKHGbwjXGjABWGnrIPqJWOWf34l3URs4HoqWkwyXKvYhugIkaUM0iXW5xUL4DScUjaaAhhhIyT5ucvSAXwipWK3Vm-6RHT2bTOKVhXBvTHLKgGVRVGqepdyBXvWM_-PWBIDh50aj6Dt_AwM0rFsyI4g-YMLgeqxiEh5DpMqTCVcyE3zsPL-7dfXIVXsBNbbmOFm0-ZkzqPCt57UhHF4gDvrRvMtCyZXCOqzIUMfYmgnQ1UoUe-PUSyOibIrbtjSCUWh2mycgavyceUYB3qVlkK3a01uEaoKkZxsBV6j73zketEdkI2e7yk6OqfjNTonAmc3FvNXagaj3ij0jTVkRDjNIGQwSgRokdOBQL2ilshtw_FVIlpg964xvEkYzxxE51H62yhw4g-IRj9kNtzDIfJCIFiHp2M5XhhUm-emJGVxfKYnFL3jpkH53PrkA_xSr48kZMUKeg9GurIKn-GMDTNeMviBX7lAnP2ZKN31aCZfuD6FCtZuF9AEg5iVXz1riaDLCXRgUpDW7fOcbWgxHhCtDCkX9Py3z_pPCg-OgzGTI6F3Fwox88zTtzxU_Yzr5npDVr0Kl7cHJypyDbvGN_2M2bB_e4SELsAvXcdcWQJoLT1lSzZfX_mXMJQdhRBQYe-QfCY8EwSx1xJq0syW6OKg8e7SvwDG-WrSzgBtbPVXUEoMfOS3jW3K2dCpvRNAF7TonqxOGEdpz_p6UdVQe06DKBbHsS2IdvM4J0zOMMjuLnRs-IzD7GuUUc6siBNJ0BWKmenOjuDr-hb1YecFvMIf--N85gQjK0bV1ln854aRndpG03b59DAkQJFl3dijwaPykY4KjMgBEMa7_fYx0X1QXe4pB0c0Z-Z0ugxxLyMt6j8BcgvX_71dyEPWUe067rBUjwndsfWDU0bwyjGAXMzpnGl7Ed6ZeAYzm7wEKnDFPYFVDXpzqdqt6i2i616wt18LZfr5XpRFE_tbqM2xWZeqOftSq71spa42FRbLMv5Zr1c6eUT7WQhl8VGrouN3MhiJpfVclFWq-1msa7neimWBXaKzMyYYzdzvnlK03g3L1bz7fbJqBJNSG9GUmpOUkgp5Ovl16eYhoPk9ya_YxufyqEJYlkYCjHcrEaKJr1hZSPPe_H8-UrO5z18vS44b5PV_r9KdFssyN6J909eh54Gb3ZtjH2aoGlLayi2QznTrhPywPGN_z713v1AHYU85J1KyMOIwnEn_xMAAP__OPrKBQ">