Skip to content

Search with Elasticlunr, updated #604

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Mar 7, 2018

Conversation

mattico
Copy link
Contributor

@mattico mattico commented Feb 2, 2018

@Phaiax's #472, manually rebased, plus a few improvements.

Major Changes from #472:

  • Breadcrumbs are made by adding Chapter.parent_names, rather than modifying the BookItems iterator
  • Markdown renderer renders footnotes, and also inserts spaces for certain tags to ensure words don't get merged together in the search results
  • SearchDocument removed, render_item renders directly into elasticlunr::Index
  • CSS modifications moved into stylus files
  • Removed JQuery from JS
  • Added integration test
  • Added aria attributes to search elements
  • Added search cargo feature:
    • Search index creation moved into module
    • Search JS moved into theme module for override/disable
  • HTML textification uses ammonia sanitizer

Closes #472
Fixes #51

@Michael-F-Bryan Michael-F-Bryan self-requested a review February 2, 2018 12:14
@mattico mattico force-pushed the search-eljs-rebase branch 5 times, most recently from bed285a to d17976c Compare February 5, 2018 00:39
@mattico
Copy link
Contributor Author

mattico commented Feb 15, 2018

@Michael-F-Bryan

If there's anything I can do to make this easier to review, let me know. I'd be happy to break it into smaller commits.

No pressure, though. Wouldn't want a good maintainer to get burned out.

Copy link
Contributor

@Michael-F-Bryan Michael-F-Bryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done an initial skim through the code. Overall I can't see any glaring issues and it looks pretty good.

We could probably do with a couple more tests, particularly around rendering. That way we'll be able to prevent accidental regressions in the future and it also helps act as "live examples" of how the pieces work.

Another question I had is about how we distribute the elasticlunr JavaScript dependency. We currently vendor minified JavaScript files in the repo, but that makes things a bit annoying when people want to package for platforms like Debian and it's not easy to verify authenticity or whether something is open-source.

src/config.rs Outdated
#[serde(default, rename_all = "kebab-case")]
pub struct Search {
/// Enable in browser searching. Default: `true` (if `search` feature is enabled).
pub enable: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to have an enabled flag? I would have thought we can see search is enabled because a output.html.search is either present or not.

So an alternative to the enabled flag would be to change the search field in HtmlConfig to be Option<Search>... Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Especially now that the JS is conditionally included, the flag is not very useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, that would change search to be disabled by default. Is that your intention?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dunno, to be honest it's one of those features you'll always want for your document.

It should probably be always turned on unless you manually deactivate the search feature flag when installing mdbook (e.g. when packaging for Debian).

src/config.rs Outdated
pub enable: bool,
/// The path to the searcher to use. If not specified, will use the searcher code included in
/// MDBook. Default: `""`
pub searcher: PathBuf,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think we may end up having a similar issue to themes here. By adding this to the config file we're theoretically allowing people to swap in their own implementations, but because there is no well defined (and documented) interface to our searcher, people probably won't be able to make use of this knob in practice.

What would one gain by being able to use their own searcher implementations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the way it's setup is not ideal, I just copied the setup used for playpen_editor.

While we give no guarantees about the stability or interfaces for the searcher, the current system can be useful in practice for making small changes to search behavior (remove highlighting, change keyboard shortcuts, etc.). I'll agree, however, that this feature is unlikely to be used much if at all. In general I don't think many users will make significant theme changes out-of-tree until we provide a better way to do that.

So what change can we make to this PR? Here's what I'm thinking:

  1. Remove the editor and searcher settings and the corresponding file override behavior.
  2. Add copy-js settings for both. If set to false, the JS code for the feature will not be copied to the output directory. This is more flexible since users can then include whatever JS they want with additional-js, and it reduces the complexity of this little-used feature to almost zero.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that sounds quite reasonable 👍

Besides, it's not hard to put back in if we find people actually want the feature.

src/config.rs Outdated
fn default() -> Search {
// Please update the documentation of `Search` when changing values!
Search {
enable: cfg!(feature = "search"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice use of the cfg!() macro! 👍

if cfg!(feature = "search") {
data.insert("search_enabled".to_owned(), json!(true));
} else {
warn!("mdBook compiled without search support, ignoring `output.html.search.enable`");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea with adding a warning! Should we also say something like "please reinstall with cargo install mdbook --force --features search to use the search feature"?

@@ -690,9 +647,9 @@ mod tests {

#[test]
fn anchor_generation() {
assert_eq!(id_from_content("## `--passes`: add more rustdoc passes"),
assert_eq!(utils::id_from_content("## `--passes`: add more rustdoc passes"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that the id_from_content() function has been moved to the utils module it may make sense to relocate these tests as well. I'm also curious to see how the ID generation will go when your header includes emojis, bold, or bits of code.

}

/// Write the given data to a file, creating it first if necessary
pub fn write_file<P: AsRef<Path>>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for removing this from the HTML renderer! It actually started life as a helper method on MDBook and got moved out because it coupled the entire HTML renderer to MDBook's internals. At the time I was refactoring a different part of the system, so didn't want to go off on a tangent by also fixing write_file().

src/utils/mod.rs Outdated
use std::borrow::Cow;

pub use self::string::{RangeArgument, take_lines};

pub fn remove_html_tags<'a>(text: &'a str) -> Cow<'a, str> {
let regex = Regex::new(r"<[^>]*?>").unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a good idea to use a regex here instead of something like a HTML sanitizer? I'm not 100% familiar with how the pulldown-cmark markdown renderer deals with angle brackets, but if we're not careful when calling it, we could accidentally clobber the wrong stuff (e.g. an innocent paragraph like "if `x < 0` and `x > -5`, blah blah blah").

Of course, that approach could also be overkill and 95% of the time the regex will work just fine...

Copy link
Contributor Author

@mattico mattico Feb 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a Regex on HTML does feel bad, but I didn't want to pull in a whole sanitizer. Angle brackets already have to be escaped in HTML blocks so it could clobber some data, but I blame the user 😄 . Worst-case scenario is that some pathological text inside of an HTML tag isn't in the search index.

}

/// Renders markdown into flat unformatted text and adds it to the search index.
fn render_item(index: &mut Index, search_config: &Search, item: &BookItem) -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth creating a bunch of tests for this so we can be sure the text gets stripped as intended?

}

/// Converts the index and search options to a JSON string
fn write_to_json(index: Index, search_config: &Search) -> Result<String> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have access to any "known good" versions of the generated JSON? It's probably a good idea to add a test or two to make sure we're able to generate valid JSON, that way we can also prevent accidental regressions in the future if someone needs to come back and change things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's difficult to have any significant test for the entire search index. Tiny changes in the source material change the index drastically, and it's so large that it's difficult to determine if any change matters at all. If you have suggestions, I'm all ears.

Copy link
Contributor

@Michael-F-Bryan Michael-F-Bryan Feb 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, you make a good point... What I usually do is generate a small example and store that as a file, giving it a quick look through to make sure it seems alright. Then every time after that you can have a test which uses the same input and check the rendered document is the same as our known good copy.

I have a similar issue at work where it's not really possible to manually construct an object and check each of its elements are the same every time. For that I've got a create_fixture() function and a constant which I manually toggle to make it regenerate the fixture.

/// Regenerates the test fixture (pre-nested list of cutlines).
///
/// # Note
///
/// You'll need to run `cargo test` manually here, if tests are being run via
/// `cargo watch` and you try to regenerate the test fixture we'll get midway
/// through writing out the fixture before `cargo watch` restarts the process.
/// Leaving a mangled fixture file.
const GENERATE_FIXTURE: bool = false;


// loads of tests


fn get_fixture() -> Vec<Cutline> {
    if cfg!(windows) && GENERATE_FIXTURE {
        let inputs = create_known_dummy_inputs();
        let thing = generate_output(&inputs).unwrap();

        let fixture_path =
            Path::new(env!("CARGO_MANIFEST_DIR")).join("tests/data/fixture.json");
        let mut f = ::std::fs::File::create(&fixture_path).unwrap();
        serde_json::to_writer_pretty(&mut f, &thing).unwrap();

        nested
    } else {
        let jason = include_str!("data/fixture.json");
        serde_json::from_str(jason).expect("Unable to deserialize the fixture")
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's a good solution! One of the issues I was running into is that rendering dummy_book is different when you're in a test (due to env vars?) so it's annoying to make fixtures.

}

impl Searcher {
pub fn new(src: &Path) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we create a test for this to make sure the file loading logic is sound and people are able to override the scripts?

@mattico
Copy link
Contributor Author

mattico commented Feb 20, 2018

I think that addresses everything, except for the JS dependency question. I would like to address that, I'm working out a solution right now. I think we should leave that for another PR, though.

@Phaiax
Copy link

Phaiax commented Feb 23, 2018

I just wanted to say thank you for taking this over. 👍

(I wished I had enough time for this but nope, got a new job and more...)

Copy link
Contributor

@Michael-F-Bryan Michael-F-Bryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did another review of the search code. Other than my question around #[serde(default)] and the Option<Search> item in the HTML settings, I think it's pretty much ready to merge. I'm just going to try it out on my machine first to make sure there aren't any UI issues.

This is really really cool by the way. Thank you for all the hard work!

@@ -152,15 +152,20 @@ pub struct Chapter {
pub sub_items: Vec<BookItem>,
/// The chapter's location, relative to the `SUMMARY.md` file.
pub path: PathBuf,
/// An ordered list of the names of each chapter above this one, in the hierarchy.
pub parent_names: Vec<String>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing that we need to manually store the parent names and keep them in sync makes me question whether it may be better to represent a Book as containing a "doubly-linked" tree of chapters, or even turn it into a graph structure. That way a Chapter can contain a pointer back to its parent and you can use that to walk up and down the levels of the tree.

It would also hopefully simplify tracking section numbers and parsing. I might play around with this idea on a private branch and see how it goes.

@@ -425,29 +425,79 @@ pub struct HtmlConfig {
pub livereload_url: Option<String>,
/// Should section labels be rendered?
pub no_section_label: bool,
/// Search settings. If `None`, the default will be used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If None, the default will be used.

I believe this is already sorted out by the #[serde(default)] annotation on HtmlConfig. That annotation tells serde to use Search::default() if it can't deserialize the settings.

To me, when I see Option<Search> I'd think this indicates that searching is optional and when no Search is found in my book.toml this means my book won't have a search bar... Am I interpreting it wrong, or is this comment out of sync with the implementation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is already sorted out by the #[serde(default)] annotation on HtmlConfig. That annotation tells serde to use Search::default() if it can't deserialize the settings.

The default it'll use is None, since that's how Default::default::<Option>() is implemented: https://doc.rust-lang.org/src/core/option.rs.html#900-906, and we're using #[derive(Default)]. The html renderer will then manually use unwrap_or_default to get the default configuration. The reason to do this is so we can print the warning message if the user is using mdbook compiled without feature = "search", but does have an [output.html.search] table in their book.toml.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be less confusing overall to just bring back the [output.html.search.enabled] field.

pub editable: bool,
/// Copy JavaScript files for the editor to the output directory?
/// Default: `true`.
pub copy_js: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably belongs in its own PR. Changing how the playpen works and whether we copy across JavaScript files is a non-trivial change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose. Should I revert the changes to editor, but leave the output.html.search.copy-js setting?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm, don't worry about it for now, it's probably not worth the effort. As long as it's noted in the PR description it should be fine.

for (i, item) in book.iter().enumerate() {
let mut depthfirstiterator = book.iter();
let mut is_index = true;
while let Some(item) = depthfirstiterator.next() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this just be a plain old for loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not sure why this was changed. This is from the original PR.

//
// If you're pretty sure you haven't broken anything, change `GENERATE_FIXTURE`
// above to `true`, and run `cargo test` to generate a new fixture. Then
// change it back to `false`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got a nice laugh out of this block of comments, yet it's still surprisingly useful for people who see it because of a failing test. Thank you!

@Michael-F-Bryan
Copy link
Contributor

Michael-F-Bryan commented Feb 24, 2018

I just tried this out on my laptop. One thing I've noticed is the search index is only available when running mdbook with mdbook serve which is a bit of a pain when testing with mdbook build --open locally.

The browser console shows the following error.

searcher.js:260 Failed to load file:///home/michael/Documents/mdBook/book-example/book/searchindex.json: Cross origin requests are only supported for protocol schemes: http, data, chrome, chrome-extension, https.
init @ searcher.js:260

After a brief search it seems like this is a Chrome-specific issue and there isn't really any way around it other than invoking chrome from the command-line with the --allow-file-access-from-files flag. @sorin-davidoi, have you encountered this before?


Other than the CORS issue, the search functionality is really, really nice 👍

@sorin-davidoi
Copy link
Contributor

Don't think the issue is Chrome-specific - you just can't read arbitrary files from the filesystem. A (nasty) workaround could be to make a searchindex.js file which contains the entire JSON and assigns it to window - you can then add the file as a script tag, wait for it to load and execute, and proceed with reading the JSON from window.

@mattico
Copy link
Contributor Author

mattico commented Feb 27, 2018

I looked into HTML sanitization. ammonia isn't that big of a dependency, and it could work with a small change.

This commit adds search functionality to mdBook, based on work done by @Phaiax. The in-browser search code uses elasticlunr.js to execute the search, using an index generated at book build time by elasticlunr-rs.
Someone on Reddit was wondering how the rust book was generated and said they checked the source. Thought I'd put this here. Might be a good idea to have a little footer "made with mdBook", but this'll do for now.
@mattico
Copy link
Contributor Author

mattico commented Feb 28, 2018

@Michael-F-Bryan
Copy link
Contributor

@mattico, I just looked through the PR again and I'm pretty happy with it 👍

On a side note, it looks like html5ever emits a bunch of unimplemented warnings when generating the search index, do you know what may be causing them? The stop_parsing operation sounds like more of an optimisation to stop parsing HTML early so I doubt it's actually going to have any negative impact on mdbook, but the user experience isn't the best.

We can always hack around this by updating the init_logging() function in src/bin/mdbook.rs to use builder.filter(Some("mdbook"), LevelFilter::Info) when the RUST_LOG environment variable isn't set, although I'd be interested to see if there's a way to fix the underlying html5ever issue.

For posterity's sake, this is what I see when running mdbook build locally:

$ cargo run -- build --open book-example                                     
    Finished dev [unoptimized + debuginfo] target(s) in 0.0 secs
     Running `target/debug/mdbook build --open book-example`
2018-03-05 16:15:36 [INFO] (mdbook::book): Book building has started
2018-03-05 16:15:36 [INFO] (mdbook::book): Running the html backend
2018-03-05 16:15:37 [WARN] (html5ever::tree_builder): stop_parsing not implemented, full speed ahead!
2018-03-05 16:15:37 [WARN] (html5ever::tree_builder): stop_parsing not implemented, full speed ahead!
2018-03-05 16:15:37 [WARN] (html5ever::tree_builder): stop_parsing not implemented, full speed ahead!
2018-03-05 16:15:37 [WARN] (html5ever::tree_builder): stop_parsing not implemented, full speed ahead!
2018-03-05 16:15:37 [WARN] (html5ever::tree_builder): stop_parsing not implemented, full speed ahead!
2018-03-05 16:15:37 [WARN] (html5ever::tree_builder): stop_parsing not implemented, full speed ahead!
2018-03-05 16:15:37 [WARN] (html5ever::tree_builder): stop_parsing not implemented, full speed ahead!
2018-03-05 16:15:37 [WARN] (html5ever::tree_builder): stop_parsing not implemented, full speed ahead!
Created new window in existing browser session.

Other than the annoying html5ever warnings, I think this is about ready to merge.

@Michael-F-Bryan Michael-F-Bryan mentioned this pull request Mar 5, 2018
6 tasks
@mattico
Copy link
Contributor Author

mattico commented Mar 5, 2018

Ah, should've run more than cargo test. I'll look into it.

@mattico mattico force-pushed the search-eljs-rebase branch 2 times, most recently from 35875e2 to 5720c1a Compare March 5, 2018 17:58
@mattico mattico force-pushed the search-eljs-rebase branch from 5720c1a to 1251101 Compare March 5, 2018 17:58
@mattico mattico mentioned this pull request Mar 5, 2018
@mattico
Copy link
Contributor Author

mattico commented Mar 6, 2018

Fixed.

@Michael-F-Bryan
Copy link
Contributor

And merged 🎉

@Michael-F-Bryan Michael-F-Bryan merged commit b2ad669 into rust-lang:master Mar 7, 2018
@mattico mattico deleted the search-eljs-rebase branch March 7, 2018 15:11
@mattico
Copy link
Contributor Author

mattico commented Mar 7, 2018

4 months ago I looked at the search issue and thought "how hard can it be"?. 🎉🎊🎖️📯

@Michael-F-Bryan
Copy link
Contributor

I've been using master locally for a couple days now and can't see any issues with search, so I'll probably to do another release either tomorrow or the day after.

Should I also make a post to /r/rust? Is there anything in particular you'd like me to say about mdbook's new search functionality?

@mattico
Copy link
Contributor Author

mattico commented Mar 11, 2018

Should I also make a post to /r/rust?

That's a great idea! People should know to rebuild their docs.

Is there anything in particular you'd like me to say about mdbook's new search functionality?

Hmm. I'd mention that if anyone notices something which won't show up in search results that they should file an issue on here. There's possibly still some bugs in the index preprocessing or elasticlunr-rs.

If you're going to have a "thanks to" section make sure to mention phaiax for writing most of this ^^ and notriddle for maintaining ammonia.

There are a lot of questions on /r/rust like "what's a project I can contribute to to learn Rust". I think mdbook works well for that, the code is easy to understand and its fun to see the results in the browser. On the other hand sometimes working on it is more HTML than Rust, and a lot of the stuff that needs to be worked on is more architecture than self-contained improvements.

@Michael-F-Bryan
Copy link
Contributor

Michael-F-Bryan commented Mar 15, 2018

@mattico I think we have a small CSS issue with the search results 😜

image

Do you know what may have caused this? I'm using Firefox on Windows, but I remember testing the search feature before our 0.1.4 release (Chrome on Arch Linux) and the search results would get inserted above the page contents just fine.

@sorin-davidoi, this should be a pretty trivial CSS fix, shouldn't it?

@oberien oberien mentioned this pull request Mar 16, 2018
Ruin0x11 pushed a commit to Ruin0x11/mdBook that referenced this pull request Aug 30, 2020
* Add search with elasticlunr.js

This commit adds search functionality to mdBook, based on work done by @Phaiax. The in-browser search code uses elasticlunr.js to execute the search, using an index generated at book build time by elasticlunr-rs.

* Add generator comment
Someone on Reddit was wondering how the rust book was generated and said they checked the source. Thought I'd put this here. Might be a good idea to have a little footer "made with mdBook", but this'll do for now.

* Remove search/editor file override behavior

* Use for loop for book iterator

* Improve HTML regex

* Fix search CORS in file URIs

* Use ammonia to sanitize HTML

* Filter html5ever log messages
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants