[PATCH] D134130: [clangd] Add doxygen parsing for Hover [1/3]

Tue Dec 13 08:19:12 PST 2022

kadircet added a comment.

Hi!

Sorry for letting these series of patches sit around without any comments. We were having some discussions internally to both understand the value proposition and its implications on the infrastructure.
So it'd help a lot if you can provide some more information for use cases, that way we can find a nice scope for this functionality to make sure it provides most of the value and doesn't introduce an unnecessary complexity into rest of the infrastructure and also we should try not to regress indexing of projects that don't have doxygen comments.

So first of all, what are the exact use cases you're planning to address/improve with support for doxygen parsing of comments? Couple that comes to mind:

- obtaining docs about params & return value
- stripping doxygen commands
- treating brief/detail/warning/note differently
- formatting text within comments (bold etc)
- getting linebreaks/indent right clangd#1040 <https://github.com/clangd/clangd/issues/1040>

any other use cases that you believe are important?

as you might've noticed, this list already talks about dealing with certain doxygen commands (but not all).
that list is gathered by counting occurrences of those commands in a codebase with lots of open-source third_party code. findings and some initial ideas look like this:

- \brief: ~70k occurrences
  - common but usefulness in practice is unclear
  - can infer for non-doxy too (e.g. first sentence of a regular documentation)
  - maybe just strip (or merge into regular documentation)?
- \return[s]: 30k occurrences
  - unclear if worth separation in hover, because it might be tied to rest of the documentation (re-ordering concerns)
  - can infer for non-doxy maybe?
  - probably just strip the command and keep the rest as-is.
- \param: 28k occurrences
  - useful for signature help. maybe hover on func calls
  - probably worth storing in a structured manner.
- \detail[s]: 2k
- \p: 20k
- \code: 1k
- \warning: 2k
- \note: 9k
  - (for all of the above) just introduce as formatted text?

what do you think about those conclusions? any other commands that you seem worth giving a thought?
One important concern we've noticed around this part is, re-ordering comment components might actually hinder readability. as the comments are usually written with the assumption that they'll be read from top to bottom, but if we re-order them during presentation (e.g. hover has its own layout) we might start referring to concepts/entities in documentation before they're introduced. so we believe it's important to avoid any sort of re-ordering. this is one of the big down sides for preserving parameter comments in a structured way.

another important thing to consider is trying to heuristically infer some of these fields for non-doxygen code bases as well. that way we can provide a similar experience for both.

some other things to discuss about the design overall:

- How to store the extra information?
  - Proposal from our side would be to introduce structured storage for the pieces we want (limited), and keep the rest as part of main documentation text while doing stripping/reformatting.
- What to use as a parser?
  - Clang's doxygen parser actually looks like a great piece of code to re-use, it's unfortunate that it can issue diagnostics, requires AST etc. It'd be great to refactor that into a state where we can use it without any AST or diagnostics, and a minimal SourceManager (this seems feasible to achieve at first glance, as most of these inputs seem to be optional or used in 1 or 2 places).
  - we still need to make sure performance and behaviour on non-doxygen is reasonable though. do you have any numbers here?
- How to store in the index?
  - If we can strip the parser off the dependencies on an astcontext, diagnostics etc. the best option would be to just store as raw text and run the whole pipeline on demand (e.g. do the doxygen parsing and markdown-ization afterwards). This is the simplest approach as it keeps index interfaces the same.

Happy to move the discussion to some other medium as well, if you would like to have them in discourse/github etc.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D134130/new/

https://reviews.llvm.org/D134130