Skip to content

Dedup imports_granularity = "Item" #4737

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions src/formatting/imports.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@ use std::borrow::Cow;
use std::cmp::Ordering;
use std::fmt;

use itertools::Itertools;

use rustc_ast::ast::{self, UseTreeKind};
use rustc_span::{
symbol::{self, sym},
Expand Down Expand Up @@ -183,6 +185,8 @@ pub(crate) fn merge_use_trees(use_trees: Vec<UseTree>, merge_by: SharedPrefix) -
}

pub(crate) fn flatten_use_trees(use_trees: Vec<UseTree>) -> Vec<UseTree> {
// Return non-sorted single occurance of the use-trees text string;
// order is by first occurance of the use-tree.
use_trees
.into_iter()
.flat_map(UseTree::flatten)
Expand All @@ -197,6 +201,7 @@ pub(crate) fn flatten_use_trees(use_trees: Vec<UseTree>) -> Vec<UseTree> {
}
tree
})
.unique_by(|ut| format!("{}", ut))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would much rather implement Hash on UseTree so that we can just use unique() directly instead of taking the hit on calling format.

We're really just dealing with the path so I think this would suffice (along with the default Hash derive on UseSegment):

impl Hash for UseTree {
    fn hash<H: Hasher>(&self, state: &mut H) {
        self.path.hash(state);
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before I try to implement this, I would like to make sure fully understand the implementation. My main question is what is default Hash derive on UseSegment? How and were it is implemented?

I am asking because my first approach was to implement Hasher for both UseTree and UseSegment. This is by generating the list of the segments` text and then hashing the generated text. I ended up with something very similar to the UseTree format and realized that using the format is about the same from performance point of view.

One more question for my information (assuming that the answer is short). I didn't fully understand how the the hash function works. Taking as an example the hash function you suggested above. First, it doesn't have a return value, so where is the hash value stored and how it is used? Second, through the iterations I assume that several numeric hash valuses are calculated for the non-List UseSegments. If this is the case, it is not clear to me if and how these hash values are "combined" to form the hash of the initial UseTree.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main question is what is default Hash derive on UseSegment? How and were it is implemented?

I was referring to the Hash trait, and that we should able to leverage the derive attribute for UseSegment instead of having to manually implement the Hash trait ourselves. I'm not sure we'll be able to leverage derive on UseTree because that only works when all the fields themselves have a Hash impl, and I know that not all of the AST nodes do (feel free to check/test though!). Once there's a Hash impl for both UseTree and UseSegment then you can use unique() directly, instead of formatting to a temporary String.

I am asking because my first approach was to implement Hasher for both UseTree and UseSegment. This is by generating the list of the segments` text and then hashing the generated text. I ended up with something very similar to the UseTree format and realized that using the format is about the same from performance point of view.

I don't think there will be a need to implement the Hasher trait on any of our custom types to resolve this issue. My assumption here was that the itertools functions we were hoping to leverage utilized some hash-based data structure for the deduplication feature we needed, and that in turn would require that the items (UseTree in our case) implemented the Hash trait (side note, looks like they do leverage a hash set internally 😉).

This is indeed the case, as highlighted in the the itertools docs for the unique/unique_by functions. If you were to try to just use unique() on the current code, you'd presumably get errors complaining about unbound constraints, because our UseTree does not have a Hash implementation.

Your initial proposal works because you're allocating an intermediate String representation for each UseTree (which we have to do again later for the actual emitting of the formatted code) and that satisfies the itertools constraints because String does have a Hash impl. We don't want to do this though because it is unnecessary, and we should simply ensure that we have a Hash impl.

One more question for my information (assuming that the answer is short). I didn't fully understand how the the hash function works

Sorry, but that's a little too off topic for a code review within rustfmt 😄 Hopefully the above info helps somewhat, but otherwise I'd refer you to materials like the Book, the stdlib docs (https://doc.rust-lang.org/std/index.html), or even the rustc source if you really want to drill into it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed response! The reference to stdlib also help as I realized I missed the finish method that emits the hash calculation result.

I would much rather implement Hash on UseTree so that we can just use unique() directly instead of taking the hit on calling format.

I have now change unique_by to unique using your suggested code.

.collect()
}

Expand Down
File renamed without changes.
6 changes: 6 additions & 0 deletions tests/source/imports/imports_granularity_default-with-dups.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
use crate::lexer;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some test files that have multiple groups, and perhaps some that have comments too? Maybe something like this:

use crate::lexer::{tokens::TokenData /* foo */ };
use crate::lexer::{tokens::TokenData};

It might also be useful to have some duplicates but with varying visibility levels (we may not handle this well with the other variants either FWIW, but would be good to check while we're working on this one)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some test files that have multiple groups, and perhaps some that have comments too?

I didn't add such test files as there are some issues, especially related to comments, so maybe it is better to discuss first (here or in a new issue) which of these issues should be resolved, whether there are cases where the original code should be used, what should be the expected results, etc.

For example, the for the following input:

use crate::foo;
use crate::foo::self;
use crate::foo::self /* foo 1 */;
use crate::foo::{self /* foo 2 */ };
use crate::foo::{tokens::TokenData};
use crate::foo::{tokens::TokenData /* foo 3 */ };
use crate::foo::{self, tokens::TokenData /* foo 4 */};
use crate::bar::self /* bar 1 */;
use crate::bar::self;
use crate::foobar::self; /* foobar 1 */
use crate::foobar::self;

"Crate" granularity produce the following output:

use crate::bar; /* bar 1 */;
use crate::foo; /* foo 1 */;
use crate::foobar; /* foobar 1 */
use crate::{
    bar, foo,
    foo::{self /* foo 2 */, tokens::TokenData},
    foobar,
};

Comments foo 3 and foo 4 disappeared and there is an extra ; after comments bar 1 and foo 1.

"Module" granularity produce the following output:

use crate::bar; /* bar 1 */;
use crate::foo; /* foo 1 */;
use crate::foo::tokens::TokenData;
use crate::foo::{self /* foo 2 */, self};
use crate::foobar; /* foobar 1 */
use crate::{bar, foo, foobar};

In addition to the same issues as for "Crate", foo::{...} includes self twice: one with foo 2 comment and one without a comment.

"Item" granularity produce the following output:

use crate::bar; /* bar 1 */;
use crate::foo;
use crate::foo::tokens::TokenData;
use crate::foo::{self /* foo 2 */};
use crate::foobar; /* foobar 1 */

Comment foo 1 is missing because in the original code crate::foo::self is before crate::foo::self /* foo 1 */. The bar 1 comment does appear, since crate::bar::self is before crate::foobar::self; /* foobar 1 */. As above, foo 3 and foo 4 comments disappeared and there is an extra ; after comments bar 1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as there are some issues, especially related to comments

wow there sure is 🙃

We've always had comment issues with this option (which was named merge_imports prior to a recent rename/restructure) so I'm not too surprised by some of those, but the emission of invalid code in this type of case is particularly concerning:

use rustc_ast::ast /* abc */;

I didn't add such test files as there are some issues, especially related to comments, so maybe it is better to discuss first (here or in a new issue) which of these issues should be resolved, whether there are cases where the original code should be used, what should be the expected results, etc.

Good call 👍

Copy link
Contributor Author

@davidBar-On davidBar-On Mar 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submitted PR #4759 to fix the extra semicolon issue.

I will also try to resolve the issue where in some cases if the same import module appears twice - once with a comment and once without a comment, it appears twice in the formatted output. E.g. in the above output for Crate, 'create::barandcrate:::foo` appear twice.

Update:
The reason some import items appear twice for Crate and Module granularity, is because merge_use_trees() don't try to merge items with comments and just push them to the output:

if use_tree.has_comment() || use_tree.attrs.is_some() {
result.push(use_tree);
continue;

This is probably a patch done because imports comments are not handled well, but it ignores duplicates.

There is a relatively simple way to remove these duplicates by ignoring all non-commented flattened items that have a duplicate with comment. However, because duplicates with and without comments are not expected to be common in practice, and because no issue was opened by users against this problem, it seems that it is not worth trying to add this enhancement.

use crate::lexer::tokens::TokenData;
use crate::lexer::{tokens::TokenData};
use crate::lexer::self;
use crate::lexer::{self};
use crate::lexer::{self, tokens::TokenData};
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
// rustfmt-imports_granularity: Item
// rustfmt-reorder_imports: false
// rustfmt-group_imports: StdExternalCrate

use crate::lexer;
use crate::lexer;
use crate::lexer::tokens::TokenData;
use crate::lexer::{tokens::TokenData};
use crate::lexer::self;
use crate::lexer;
use crate::lexer;
use crate::lexer::{self};
use crate::lexer::{self, tokens::TokenData};
11 changes: 11 additions & 0 deletions tests/source/imports/imports_granularity_item-with-dups.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
// rustfmt-imports_granularity: Item

use crate::lexer;
use crate::lexer;
use crate::lexer::tokens::TokenData;
use crate::lexer::{tokens::TokenData};
use crate::lexer::self;
use crate::lexer;
use crate::lexer;
use crate::lexer::{self};
use crate::lexer::{self, tokens::TokenData};
File renamed without changes.
6 changes: 6 additions & 0 deletions tests/target/imports/imports_granularity_default-with-dups.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
use crate::lexer;
use crate::lexer;
use crate::lexer::tokens::TokenData;
use crate::lexer::tokens::TokenData;
use crate::lexer::{self};
use crate::lexer::{self, tokens::TokenData};
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
// rustfmt-imports_granularity: Item
// rustfmt-reorder_imports: false
// rustfmt-group_imports: StdExternalCrate

use crate::lexer;
use crate::lexer::tokens::TokenData;
use crate::lexer::{self};
5 changes: 5 additions & 0 deletions tests/target/imports/imports_granularity_item-with-dups.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
// rustfmt-imports_granularity: Item

use crate::lexer;
use crate::lexer::tokens::TokenData;
use crate::lexer::{self};