Symbol indexer #521

jakebailey · 2019-01-05T00:16:13Z

For #433.
Closes #483.
Closes #484.
Closes #485.

General overview:

Indexing happens on each parse, so with the current LS, every file in the workspace gets indexed. This will need to be reworked.
All symbol queries go to the index, I have not yet approached Query symbol index for document symbols if analysis on document is not complete #486.
I do some extra heuristics to highlight some of the more interesting symbols, namely:
- __init__ in a class will be marked as a constructor.
- Functions annotated with a property decorator are given the kind Property.
- Functions that fit the pattern __.*__ are given the kind Operator.
- Other functions in a class are given the kind Method.
- Variables that look like UPPERCASE_UNDERSCORE_123 at the top level or directly inside a class are given the kind Constant.
Any parse updates the index. I don't do any version management other than to look up the AST on the parse complete event.
VS-specific "functionKind" is preserved.

Things to do in future PRs:

Fallback to analysis when it's present. I left all of the functions previously used there.
Using the import system to answer the question "is this thing being imported a module or not", probably augmented/modified outside the indexer.
Parse version management.
Asynchronous behavior (switching the dictionary to be a map from Uri to TaskCompletionSource<HierarchicalSymbol>).

AlexanderSher · 2019-01-05T02:32:02Z

src/Analysis/Engine/Impl/Indexing/ISymbolIndex.cs

+
+namespace Microsoft.PythonTools.Analysis.Indexing {
+    internal interface ISymbolIndex {
+        Task<IEnumerable<HierarchicalSymbol>> HierarchicalDocumentSymbolsAsync(Uri uri, CancellationToken token = default);


Task<IEnumerable<T>> has weird semantic meaning, something like Lazy<Lazy<T>>. It is better to have Task<IReadOnlyList<T>>.

Yeah, I know. I currently can't do a task which gives back a read only list because I'm using (perhaps abusing) yield return to lazily traverse the index, then a Take at the use of the enumerable to limit how far it goes based on the limits provided by the LS. IAsyncEnumerable hasn't left the proposal phase yet, so I'll have to come up with something else.

Do you actually use Task?

Likely I'll just end up passing limits down to the index when doing lookups, it just feels a lot less clean and inflexible. In any case, everything here in the first commit is just prototype code that I wanted to push somewhere so I wasn't sitting on it anymore. 😃

Missed your question, sorry.

I don't actually use Task, but Mikhail had suggested sticking async on things (and it made sense, especially given how async the new analyzer is turning out). The future would be to have another implementation of this interface that instead goes to something like a sqlite database on disk or similar, so I wasn't sure if it was a good idea to make this completely synchronous.

The other thing that I was considering was cancellation; If a user is searching for a symbol, I believe the editor will cancel the request and send a new one when the input changes, so I think I'd need to be passing the cancellation token down so I can allow it to throw when it's cancelled to stop the query. Unfortunately, that has the unfortunate effect of my IEnumerable being tied to that token, so if a user of this interface were to give the query result to something else, that something else may end up being cancelled unexpectedly where it doesn't make sense.

Perhaps I drop the async on this and make it fully synchronous instead, and make a new interface if we need something that's asynchronous (and have the service provide both?).

I think indexer should be working similar to what async parser and analysis do. I.e.

Get notification on text changes (probably push rather than listening on events to be less dependent on the source),

Cancel current indexing on the changed document

Start internal indexing task asynchronously.

Fire event 'indexing complete' on every document and on overall end of indexing.

It can probably run indexing on several files in parallel. Indexing results on each document should be an immutable snapshot. Snapshot is getting replaced when indexing on the file is done.

Then I probably don't understand what this method does. Is it a method that awaits for the most recent snapshot or it runs indexing over single document?

This method is just the query over the index's state. It does no indexing itself. This is somewhat similar to GetModuleVariables which the indexer will partially replace.

src/Analysis/Engine/Impl/Indexing/IndexerWalker.cs

src/Analysis/Engine/Impl/Indexing/SymbolExtensions.cs

src/Analysis/Engine/Impl/Indexing/SymbolIndex.cs

src/LanguageServer/Impl/Implementation/Server.WorkspaceSymbols.cs

AlexanderSher · 2019-01-07T18:56:37Z

Design question: is it possible that indexing of one file can trigger indexing of another file?

src/LanguageServer/Impl/Implementation/Server.WorkspaceSymbols.cs

jakebailey · 2019-01-07T18:57:52Z

Design question: is it possible that indexing of one file can trigger indexing of another file?

No, I don't think so. Symbols are not analysis dependent, at least not in the way I'm trying to implement it. It's a pure function from file content to symbols.

MikhailArkhipov · 2019-01-07T18:58:35Z

@AlexanderSher - no, it shouldn't as indexing only consumes changes.

AlexanderSher · 2019-01-07T19:06:49Z

no, it shouldn't as indexing only consumes changes.

Then I don't understand how indexing can be async, unless we plan to run on different parts of the document in parallel.
Public wrapper methods that allow some code to wait until indexing is complete will be async of course, but I'm not sure that those methods should be part of indexer, cause they will be parser-dependent.

jakebailey · 2019-01-07T20:15:53Z

The end game here is that the indexer does its own parsing instead of relying on some event from the language server to give it the trees. When the new analyzer comes online, there will no longer be a parse event for every file in the project, so the indexer will need to do its own thing in lieu of being given everything. Technically speaking, that doesn't need to be async either, since the walker is fast and the process would generally be "read file, parse, update index", which is just a synchronous action.

There are two sides; the API for querying (ISymbolIndex), and then the other side which provides ASTs, the parsing, etc, which will need to be changed.

AlexanderSher · 2019-01-07T20:24:38Z

Well, "read file" is an I/O operation, which should be async.

The end game here is that the indexer does its own parsing instead of relying on some event from the language server to give it the trees

Meaning we will be doing the same parsing work twice?

MikhailArkhipov · 2019-01-07T20:34:19Z

It should be async as in indexing in its own task/thread upon receiving document change. When done, set result on the task. Like in https://github.com/MikhailArkhipov/python-language-server/blob/analysis10/src/Analysis/Ast/Impl/Modules/PythonModule.cs#L171 and https://github.com/MikhailArkhipov/python-language-server/blob/analysis10/src/Analysis/Ast/Impl/Modules/PythonModule.cs#L244

MikhailArkhipov · 2019-01-07T21:07:20Z

@AlexanderSher "Meaning we will be doing the same parsing work twice?"

Possibly, but not always. Indexer processes all files in the workspace. New analyzer is only loading open file and dependencies. So there can be AST Provider service where indexer can ask for an AST given URI and if it is not available, create its own. When file does become opened, it can switch then to use live AST and flip back to private when file is closed. Technically analyzer does not need to preserve the AST.

AlexanderSher · 2019-01-07T22:06:11Z

Like in https://github.com/MikhailArkhipov/python-language-server/blob/analysis10/src/Analysis/Ast/Impl/Modules/PythonModule.cs#L171 and https://github.com/MikhailArkhipov/python-language-server/blob/analysis10/src/Analysis/Ast/Impl/Modules/PythonModule.cs#L244

Ok, so Parse(CancellationToken cancellationToken) runs parser.ParseFile synchronously, so async isn't for the parsing, but for the callers to wait for the newest AST. And here it needs to know more about those callers. For example:

If Update(IEnumerable<DocumentChangeSet> changes) is called two times, it is possible that two parser.ParseFile calls are executed concurrently and second one completes before the first one. As a result, GetAstAsync may await on the wrong task and will never get the right one - _parsingTask will be set to null.
If Update is called multiple times, we can get several parsers running concurrently, and I'm not sure if we can solve this with throttling.
If we have changes in several files (e.g., user has undone a rename refactoring), we shouldn't start analysis immediately after the first parse is completed.

And so on. That's why I thought that we need some kind of a service that takes care of all parsing requests and provides awaiters for the code that needs to wait for the most recent parse or index.

MikhailArkhipov · 2019-01-07T23:02:48Z

Next update cancels prev parse. May be helpful to sprinkle parser with cancellation. But in RL parsing takes like 10-20ms tops so cancellation mostly means 'ignore the result'. Throttle is done in standard way in editors, i.e. parse on idle + ~200 ms (not yet implemented). Changes 99% come in a single active file, unless someone stats flipping across files and changing them. So # of parallel parsers depends more on # of imports that can be going in parallel. I haven't seen more ~2-3 so far.

jakebailey · 2019-01-15T00:56:11Z

Dropping the WIP, since I think it's looking alright. Let me know what you think.

All of the old IMember code is still there in the LS symbols file; I didn't delete any of it even though it's no longer referenced.

src/Analysis/Engine/Impl/Indexing/SymbolExtensions.cs

AlexanderSher · 2019-01-15T15:49:00Z

src/Analysis/Engine/Impl/Indexing/SymbolIndex.cs

+
+        public SymbolIndex() { }
+
+        public IEnumerable<HierarchicalSymbol> HierarchicalDocumentSymbols(Uri uri) {


Looks like you can easily return IReadOnlyList<HierarchicalSymbol> here.

That's true, though my intention was that a future PR allows this data to be persisted in a local database, where not needing to collect the entire result beforehand. Though, I guess it's a bit pointless when the HierarchicalSymbol itself contains an IList<HierarchicalSymbol>.

src/Analysis/Engine/Impl/Indexing/SymbolIndex.cs

src/Analysis/Engine/Impl/Indexing/SymbolIndexWalker.cs

…thand names, use queue instead of linked list

src/Analysis/Engine/Impl/Indexing/SymbolIndexWalker.cs

jakebailey · 2019-01-15T23:40:39Z

We figured out that I forgot to specialize for-loops and generators, so I'm doing that now.

jakebailey · 2019-01-16T00:40:39Z

Added for loops and comprehensions. For example:

This to me is much better than other outlines where s would get dumped in the wrong scope.

src/Analysis/Engine/Impl/Indexing/SymbolIndexWalker.cs

jakebailey · 2019-01-16T23:20:51Z

Any last opinions on this?

MikhailArkhipov · 2019-01-16T23:48:53Z

Ship it.

first working indexer prototype

e89df57

AlexanderSher reviewed Jan 5, 2019

View reviewed changes

src/Analysis/Engine/Impl/Indexing/IndexerWalker.cs Outdated Show resolved Hide resolved