Skip to content

[clang-doc] Improve clang-doc performance through memoization #96809

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jul 25, 2024

Conversation

PeterChou1
Copy link
Contributor

@PeterChou1 PeterChou1 commented Jun 26, 2024

This patch adds a short circuit which address performance problem referenced in
issue: #96570

We memoize the results after visiting a USR definition. Since the definition is always parsed last so we can safely exit earlier next time we visit the USR because everything relevant to the doc generator would've been parsed already.

From my local testing it cut the runtime of generating LLVM docs from 10 hours to around 1 hour

The only differences between this output and the last are changes due to the racy nature of clang-docs doc generation. Because it is multithreaded the generated infos may be reordered in a different way, but the results remain the same

@llvmbot
Copy link
Member

llvmbot commented Jun 26, 2024

@llvm/pr-subscribers-clang-tools-extra

Author: None (PeterChou1)

Changes

This patch adds a short circuit which address performance problem referenced in
issue: #96570

From my local testing it cut the runtime of generating LLVM docs from 10 hours to 30 mins


Full diff: https://github.com/llvm/llvm-project/pull/96809.diff

1 Files Affected:

  • (modified) clang-tools-extra/clang-doc/Mapper.cpp (+19-1)
diff --git a/clang-tools-extra/clang-doc/Mapper.cpp b/clang-tools-extra/clang-doc/Mapper.cpp
index bb8b7952980ac..b9ca6edd03b8c 100644
--- a/clang-tools-extra/clang-doc/Mapper.cpp
+++ b/clang-tools-extra/clang-doc/Mapper.cpp
@@ -12,11 +12,17 @@
 #include "clang/AST/Comment.h"
 #include "clang/Index/USRGeneration.h"
 #include "llvm/ADT/StringExtras.h"
-#include "llvm/Support/Error.h"
+#include "llvm/Support/Mutex.h"
+#include <unordered_set>
 
 namespace clang {
 namespace doc {
 
+
+static std::unordered_set<std::string> USRVisited;
+
+static llvm::sys::Mutex USRVisitedGuard;
+
 void MapASTVisitor::HandleTranslationUnit(ASTContext &Context) {
   TraverseDecl(Context.getTranslationUnitDecl());
 }
@@ -34,6 +40,17 @@ template <typename T> bool MapASTVisitor::mapDecl(const T *D) {
   // If there is an error generating a USR for the decl, skip this decl.
   if (index::generateUSRForDecl(D, USR))
     return true;
+
+  // Prevent Visiting USR twice
+  {
+    std::lock_guard<llvm::sys::Mutex> Guard(USRVisitedGuard);
+    std::string Visited = USR.str().str();
+    if (USRVisited.count(Visited)) {
+      return true;
+    }
+    USRVisited.insert(Visited);
+  }
+
   bool IsFileInRootDir;
   llvm::SmallString<128> File =
       getFile(D, D->getASTContext(), CDCtx.SourceRoot, IsFileInRootDir);
@@ -41,6 +58,7 @@ template <typename T> bool MapASTVisitor::mapDecl(const T *D) {
                                getLine(D, D->getASTContext()), File,
                                IsFileInRootDir, CDCtx.PublicOnly);
 
+
   // A null in place of I indicates that the serializer is skipping this decl
   // for some reason (e.g. we're only reporting public decls).
   if (I.first)

@PeterChou1 PeterChou1 marked this pull request as draft June 26, 2024 18:51
@PeterChou1 PeterChou1 marked this pull request as ready for review June 26, 2024 20:33
Copy link
Contributor

@ilovepi ilovepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is likely a step in the right direction, I'm a bit skeptical that this patch is entirely correct.

First, I think we need to understand why we only want to cache/memoize the USR strings from definitions.

Second, this requires some more significant testing than is currently present for clang-doc. Additionally, we'd need to evaluate such an invasive change on multiple projects, and ensure that the documentation output has not changed due to this patch. Even then, I'd wouldn't be confident that we'd gotten it correct until tests were in place, and we could examine the differences with, and without this patch.

Lastly, I think its worth discussing the memoization/caching strategy used here, and if its appropriate/scalable. Other clang tools use different mechanisms, and it would be good to avoid reinventing the wheel if we can just adopt one of those that also happens to be well tested, such as clangd's indexing and the search/caching mechanisms used in scandeps.

Comment on lines +53 to +57
// We considered a USR to be visited only when its defined
if (IsDefinition)
USRVisited.insert(Visited);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this logic makes sense. Why is it unreasonable to expect a USR to be valid only when its a definition? Is there some logic that prevents things that aren't definitions from being able to hold documentation? What I'm worried about, is the case where we have several USRs that contain different documentation bits that would have been merged in the past, but now wont.

If there is a concrete reason, please document that here in the comments, and in the commit message.

Copy link
Contributor Author

@PeterChou1 PeterChou1 Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My logic was that definition of the USR is parsed last, so when the ASTVisitor visits the definition it would have already parsed every other USR that points to the same declaration. So we can safely short circuit, since every other fragments of USR would've been parsed already

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My logic was that definition of the USR is parsed last, so when the ASTVisitor visits the definition it would have already parsed every other USR that points to the same declaration. So we can safely short circuit, since every other fragments of USR would've been parsed already

Is it really parsed last in all cases? Isn't it possible to have multiple of these definitions, depending on the scope of the AST construct and the target options. For instance, what if some code is compiled w/ a particular #define enabled and that provides different documentation than was found previously, but is compiled elsewhere without that define? I can easily imagine that affecting header code, where documentation is likely to be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I hadn't considered that this was my clumsy attempt at trying to remove the redundant work done by the ASTVisitor. Since we are undeniably doing some redundant work. if you take a look at the Shape class from the e2e test you'll see that we visit the declaration 3 times once for parsing the initial file and the twice more for each subclass. Is there any other mechanism that prevents this type of behaviour? This essentially what this patch is trying to accomplish

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is why we need more tests that exercise a diverse set of behavior. Some of the different targets should control the documentation bits w/ defines, and we should also make sure that we're covering all the ways a USR could be brought in. For instance, if there was only one base class that brought in Shape, I'm not sure how much better you could do, since I'd assume it would only be included once, and then parsed in full the one time.

For more complicated arrangements, we'd need to be sure the definitions, and the documentation would be resolved the same way in all cases. I'm not aware of a mechanism that does that off the top of my head, but I would expect the clangd indexing mechanic probably has a way to handle this on some level. I'd also expect that the logic in scan-deps must have some mechanism for not processing files multiple times.

I'm not confident enough to say what you're proposing is outright wrong, but I'm also not confident that its correct, either. What I'm saying is that we need to be sure, that when we see a definition in clang-doc that it won't change from out beneath us. That said, given the merge logic in the reduction phase, perhaps if the USR is complete(i.e. has no missing fields), the memoization is sufficient.

// Prevent Visiting USR twice
{
std::lock_guard<llvm::sys::Mutex> Guard(USRVisitedGuard);
std::string Visited = USR.str().str();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the logic here. We're using SmallString<128> above, but now you're memoizing a copy of the USR string contents in a Set, using an std::string. Is there a reason we need 2 string types?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreover, can we just construct a single owned string, and use a ‘StringRef’ to handle the rest of the operations? We may need to refactor some of the code slightly, but I think that should be possible.

@ilovepi ilovepi requested a review from petrhosek June 27, 2024 01:04
@PeterChou1
Copy link
Contributor Author

PeterChou1 commented Jun 27, 2024

Second, this requires some more significant testing than is currently present for clang-doc. Additionally, we'd need to evaluate such an invasive change on multiple projects, and ensure that the documentation output has not changed due to this patch. Even then, I'd wouldn't be confident that we'd gotten it correct until tests were in place, and we could examine the differences with, and without this patch.

I agreed I'm working on a patch adding more e2e test to better evaluate the output, I'm also working on comparing the current output of the llvm doc with the previous

I'll update the thread for more updates

@PeterChou1
Copy link
Contributor Author

Here's the diff for the doc output for the different branches
https://drive.google.com/file/d/1QS0YgNaB5gkJqM1HPqm2RziGXA88nqMM/view?usp=sharing

@PeterChou1
Copy link
Contributor Author

CC: @sam-mccall
I was told by @ilovepi and @petrhosek that you might have same insights with libtooling/clang

We are currently facing some performance issues with clang-doc (a documentation generator built on top of libtooling). The problem we're having is that it can take up to 10 hours from clang-doc to run through LLVM compilation database, whereas other tools similar tools can do this much faster.
Currently clang-doc uses RecursiveASTVisitor to visit every relevant decl and serializes it into a custom bitcode which it then deserializes later to generate the docs. The problem is that there is a lot of redundant computation when the ASTVisitor does the traversal, as the same declaration can be visited multiple times. Sometimes clang-doc wants to visit a decl multiple times so we can merge the comments in the declaration and definition. But other times we don't this behaviour such as when ASTVisitor visits a parent class from a derived class and we've already visited the parent already.
I added the above patch to try to cut down on redundant computation. This patch works, and cuts down on the computation of clang-doc significantly however this their is a slight difference between the generated output from this patch, and the main branch. Which is why we're hesitant to land this patch as, we don't fully understand what the effect this patch will have.

Is there a better way to approach this?

Copy link

github-actions bot commented Jul 11, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@PeterChou1
Copy link
Contributor Author

I did some more investigating and ran this patch against several different projects like llvm itself, and other well know libraries like zlib. The only functional difference I found was differences in the USR generated and generation order in the table genned files

The one exceptions to this is we accidently overlooked anonymous typedef like

typedef struct {} Foo; 

This causes the recursiveASTVisitor to visit the declaration once as a typedef and another time as a record. Originally we had the following code which guards against emitting duplicate code when visiting an anonmymous typedef.
This patch breaks from the previous behaviour since once we have visit the anonymous typedef we add the typedef to the memoization list but the guard prevents it from getting serialized. And next time we visit it as a record the USR is already in the list which causes clang-doc to skip it again, leaving nothing to be emitted.

To prevent this I added a check for if a decl is an anonymous typedef and cause it to always process the decl regardless or not if its in the memoization list

@PeterChou1
Copy link
Contributor Author

PeterChou1 commented Jul 11, 2024

Ok nevermind, disregard the above comment I was wrong about the mechanism of the bug.
the source of this bug comes from the way clang-doc handles C code, particularly anonymous typedef in C. When clang-doc encounters an anonymous typedef in C it incorrectly serializes its a name. As a simple example running clang-doc on this file

test.c

/**
 * Foo anon typedef
 */
typedef struct {} Foo;

with cause it to interpret Foo as @nonymous_record_XXXXXXX.

The reason why that is because when checking anonymous typedef name we're explicitly casting it to a CXXDecl. (see code here) Since the above is c code it won't be cast to CXXRecordDecl and hences clang-doc will treat it as an anonymous record.

This becomes a problem when we run the LLVM compilation database. As an example, a record that we were not properly serializing before was LLVMOrcCDependenceMapPair. When clang-doc runs the compilation database of LLVM it first comes across the file OrcV2CBindingsRemovableCode.c under the directory llvm/examples/OrcV2Examples/OrcV2CBindingsRemovableCode. This file causes recursiveASTVisitor to visit LLVMOrcCDependenceMapPair but because its c code its name is not found and it is incorrectly serialized into @nonymous_record_XXXXX. Normally this is not an issue for clang-doc, since we'll visit the record multiple times and the next time we visit a cpp file using LLVMOrcCDependenceMapPair , clang-doc will correctly fill in the name of the declaration.
This also means that the order of the compilation also matters. If clang-doc had parsed a cpp file with LLVMOrcCDependenceMapPair than it will correctly serialized the record. This was why I had so much trouble reproducing the bug when I was using clang-doc --filter option to try and isolate the bug.

Since this patch changes the behaviour to only visiting the a typedef declaration at most once,
we now incorrectly serialized what was LLVMOrcCDependenceMapPair to @nonymous_record_XXXXX since on my machine it encounters LLVMOrcCDependenceMapPair first when its parsing OrcV2CBindingsRemovableCode.c.

The workaround I added still works since when it comes to anonymous typedefs we skip out on memomization. It would be better if we just correctly handle the case of parsing anonymous typedef in C. But this approach is more conservative and safer since it reverts to old behaviour of clang-doc instead of introducing some other bug with parsing typedefs.


namespace clang {
namespace doc {

static llvm::StringSet USRVisited;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static llvm::StringSet USRVisited;
static llvm::StringSet<> USRVisited;

There's a warning about type deduction (-Wctad-maybe-unsupported)

@@ -53,30 +80,34 @@ template <typename T> bool MapASTVisitor::mapDecl(const T *D) {
}

bool MapASTVisitor::VisitNamespaceDecl(const NamespaceDecl *D) {
return mapDecl(D);
return mapDecl(D, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for constant params you should have a comment, per our style guide.

Suggested change
return mapDecl(D, true);
return mapDecl(D, /*isDefinition=*/true);

Copy link
Contributor

@ilovepi ilovepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like there's a few clang-format errors, so lets address those, too.

}

bool MapASTVisitor::VisitTypedefDecl(const TypedefDecl *D) {
return mapDecl(D);
return mapDecl(D, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here and below.

Comment on lines 27 to 29
if (const TypedefNameDecl *TD = C->getTypedefNameForAnonDecl()) {
return true;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (const TypedefNameDecl *TD = C->getTypedefNameForAnonDecl()) {
return true;
}
return C->getTypedefNameForAnonDecl();

Isn't this just the same. You still need the outer return, but this doesn't need a branch and temporary.

@ilovepi
Copy link
Contributor

ilovepi commented Jul 11, 2024

Ok nevermind, disregard the above comment I was wrong about the mechanism of the bug. the source of this bug comes from the way clang-doc handles C code, particularly anonymous typedef in C. When clang-doc encounters an anonymous typedef in C it incorrectly serializes its a name. As a simple example running clang-doc on this file

test.c

/**
 * Foo anon typedef
 */
typedef struct {} Foo;

with cause it to interpret Foo as @nonymous_record_XXXXXXX.

The reason why that is because when checking anonymous typedef name we're explicitly casting it to a CXXDecl. (see code here) Since the above is c code it won't be cast to CXXRecordDecl and hences clang-doc will treat it as an anonymous record.

This becomes a problem when we run the LLVM compilation database. As an example, a record that we were not properly serializing before was LLVMOrcCDependenceMapPair. When clang-doc runs the compilation database of LLVM it first comes across the file OrcV2CBindingsRemovableCode.c under the directory llvm/examples/OrcV2Examples/OrcV2CBindingsRemovableCode. This file causes recursiveASTVisitor to visit LLVMOrcCDependenceMapPair but because its c code its name is not found and it is incorrectly serialized into @nonymous_record_XXXXX. Normally this is not an issue for clang-doc, since we'll visit the record multiple times and the next time we visit a cpp file using LLVMOrcCDependenceMapPair , clang-doc will correctly fill in the name of the declaration. This also means that the order of the compilation also matters. If clang-doc had parsed a cpp file with LLVMOrcCDependenceMapPair than it will correctly serialized the record. This was why I had so much trouble reproducing the bug when I was using clang-doc --filter option to try and isolate the bug.

Since this patch changes the behaviour to only visiting the a typedef declaration at most once, we now incorrectly serialized what was LLVMOrcCDependenceMapPair to @nonymous_record_XXXXX since on my machine it encounters LLVMOrcCDependenceMapPair first when its parsing OrcV2CBindingsRemovableCode.c.

The workaround I added still works since when it comes to anonymous typedefs we skip out on memomization. It would be better if we just correctly handle the case of parsing anonymous typedef in C. But this approach is more conservative and safer since it reverts to old behaviour of clang-doc instead of introducing some other bug with parsing typedefs.

Thanks for the detailed analysis. With the current version of the patch, are there any difference between what we get on main and what we get w/ this patch?

@PeterChou1
Copy link
Contributor Author

Ok nevermind, disregard the above comment I was wrong about the mechanism of the bug. the source of this bug comes from the way clang-doc handles C code, particularly anonymous typedef in C. When clang-doc encounters an anonymous typedef in C it incorrectly serializes its a name. As a simple example running clang-doc on this file
test.c

/**
 * Foo anon typedef
 */
typedef struct {} Foo;

with cause it to interpret Foo as @nonymous_record_XXXXXXX.
The reason why that is because when checking anonymous typedef name we're explicitly casting it to a CXXDecl. (see code here) Since the above is c code it won't be cast to CXXRecordDecl and hences clang-doc will treat it as an anonymous record.
This becomes a problem when we run the LLVM compilation database. As an example, a record that we were not properly serializing before was LLVMOrcCDependenceMapPair. When clang-doc runs the compilation database of LLVM it first comes across the file OrcV2CBindingsRemovableCode.c under the directory llvm/examples/OrcV2Examples/OrcV2CBindingsRemovableCode. This file causes recursiveASTVisitor to visit LLVMOrcCDependenceMapPair but because its c code its name is not found and it is incorrectly serialized into @nonymous_record_XXXXX. Normally this is not an issue for clang-doc, since we'll visit the record multiple times and the next time we visit a cpp file using LLVMOrcCDependenceMapPair , clang-doc will correctly fill in the name of the declaration. This also means that the order of the compilation also matters. If clang-doc had parsed a cpp file with LLVMOrcCDependenceMapPair than it will correctly serialized the record. This was why I had so much trouble reproducing the bug when I was using clang-doc --filter option to try and isolate the bug.
Since this patch changes the behaviour to only visiting the a typedef declaration at most once, we now incorrectly serialized what was LLVMOrcCDependenceMapPair to @nonymous_record_XXXXX since on my machine it encounters LLVMOrcCDependenceMapPair first when its parsing OrcV2CBindingsRemovableCode.c.
The workaround I added still works since when it comes to anonymous typedefs we skip out on memomization. It would be better if we just correctly handle the case of parsing anonymous typedef in C. But this approach is more conservative and safer since it reverts to old behaviour of clang-doc instead of introducing some other bug with parsing typedefs.

Thanks for the detailed analysis. With the current version of the patch, are there any difference between what we get on main and what we get w/ this patch?

When sorted by USR, there are no differences between the index between this patch and what's on main,

@ilovepi
Copy link
Contributor

ilovepi commented Jul 16, 2024

When sorted by USR, there are no differences between the index between this patch and what's on main,

I'm not sure I completely follow. are you saying that when you sort the contents of all the files you no longer have any differences? So you're sure the only differences in output are due to ordering? I'd suppose that is rather expected from a racy framework, like clang-doc.

Once the tests PRs are landed, can you rebase this patch, so we can see the affects on the larger tests? Once that's done, we should be able to move forward with this, assuming we don't find any material differences in generated documentation other than ordering.

Copy link
Contributor

@ilovepi ilovepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can confirm that the only differences in documentation generation are due to ordering, then I think it makes sense to land this patch.

@@ -243,6 +243,7 @@ Example usage for a project using a compile commands database:

// Fail early if an invalid format was provided.
std::string Format = getFormatString();
llvm::outs() << "Unoptimized\n";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: leftover from debugging?

@ilovepi
Copy link
Contributor

ilovepi commented Jul 24, 2024

Seems to be a formatting issue. git clang-format should fix it.

@ilovepi
Copy link
Contributor

ilovepi commented Jul 24, 2024

Oh, can you also update the title and commit message? You're not really adding a short circuit, you're just memoizing visited items, and avoiding reprocessing them.

A bit more context, and a summary of your findings on the changes to output would also be good to have in the commit log.

@PeterChou1 PeterChou1 changed the title [clang-doc] Improve performance by adding a short circuit [clang-doc] Improve clang-doc performance through memoization Jul 24, 2024
@PeterChou1 PeterChou1 merged commit 1b7631a into llvm:main Jul 25, 2024
7 checks passed
yuxuanchen1997 pushed a commit that referenced this pull request Jul 25, 2024
Summary: 

Test Plan: 

Reviewers: 

Subscribers: 

Tasks: 

Tags: 


Differential Revision: https://phabricator.intern.facebook.com/D60250689
PeterChou1 added a commit that referenced this pull request Aug 8, 2024
Fixes #97507

#96809 introduce non-determinstic behaviour in clang-doc which caused
the output to be randomly ordered which lead to randomly failing test.
This patch modify clang-doc to behave deterministically again by sorting
the output by location or USR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants