Search with Elasticlunr, updated #604

mattico · 2018-02-02T03:05:24Z

@Phaiax's #472, manually rebased, plus a few improvements.

Major Changes from #472:

Breadcrumbs are made by adding Chapter.parent_names, rather than modifying the BookItems iterator
Markdown renderer renders footnotes, and also inserts spaces for certain tags to ensure words don't get merged together in the search results
SearchDocument removed, render_item renders directly into elasticlunr::Index
CSS modifications moved into stylus files
Removed JQuery from JS
Added integration test
Added aria attributes to search elements
Added search cargo feature:
- Search index creation moved into module
- Search JS moved into theme module for override/disable
HTML textification uses ammonia sanitizer

Closes #472
Fixes #51

mattico · 2018-02-15T17:33:49Z

@Michael-F-Bryan

If there's anything I can do to make this easier to review, let me know. I'd be happy to break it into smaller commits.

No pressure, though. Wouldn't want a good maintainer to get burned out.

Michael-F-Bryan

I've done an initial skim through the code. Overall I can't see any glaring issues and it looks pretty good.

We could probably do with a couple more tests, particularly around rendering. That way we'll be able to prevent accidental regressions in the future and it also helps act as "live examples" of how the pieces work.

Another question I had is about how we distribute the elasticlunr JavaScript dependency. We currently vendor minified JavaScript files in the repo, but that makes things a bit annoying when people want to package for platforms like Debian and it's not easy to verify authenticity or whether something is open-source.

Michael-F-Bryan · 2018-02-18T07:22:17Z

src/config.rs

+#[serde(default, rename_all = "kebab-case")]
+pub struct Search {
+    /// Enable in browser searching. Default: `true` (if `search` feature is enabled).
+    pub enable: bool,


Is it necessary to have an enabled flag? I would have thought we can see search is enabled because a output.html.search is either present or not.

So an alternative to the enabled flag would be to change the search field in HtmlConfig to be Option<Search>... Thoughts?

Agreed. Especially now that the JS is conditionally included, the flag is not very useful.

Hm, that would change search to be disabled by default. Is that your intention?

I dunno, to be honest it's one of those features you'll always want for your document.

It should probably be always turned on unless you manually deactivate the search feature flag when installing mdbook (e.g. when packaging for Debian).

Michael-F-Bryan · 2018-02-18T07:24:15Z

src/config.rs

+    pub enable: bool,
+    /// The path to the searcher to use. If not specified, will use the searcher code included in
+    /// MDBook. Default: `""`
+    pub searcher: PathBuf,


Hmm, I think we may end up having a similar issue to themes here. By adding this to the config file we're theoretically allowing people to swap in their own implementations, but because there is no well defined (and documented) interface to our searcher, people probably won't be able to make use of this knob in practice.

What would one gain by being able to use their own searcher implementations?

I agree that the way it's setup is not ideal, I just copied the setup used for playpen_editor.

While we give no guarantees about the stability or interfaces for the searcher, the current system can be useful in practice for making small changes to search behavior (remove highlighting, change keyboard shortcuts, etc.). I'll agree, however, that this feature is unlikely to be used much if at all. In general I don't think many users will make significant theme changes out-of-tree until we provide a better way to do that.

So what change can we make to this PR? Here's what I'm thinking:

Remove the editor and searcher settings and the corresponding file override behavior.

Add copy-js settings for both. If set to false, the JS code for the feature will not be copied to the output directory. This is more flexible since users can then include whatever JS they want with additional-js, and it reduces the complexity of this little-used feature to almost zero.

I think that sounds quite reasonable 👍

Besides, it's not hard to put back in if we find people actually want the feature.

Michael-F-Bryan · 2018-02-18T07:25:04Z

src/config.rs

+    fn default() -> Search {
+        // Please update the documentation of `Search` when changing values!
+        Search {
+            enable: cfg!(feature = "search"),


Nice use of the cfg!() macro! 👍

Michael-F-Bryan · 2018-02-18T07:29:20Z

src/renderer/html_handlebars/hbs_renderer.rs

+        if cfg!(feature = "search") {
+            data.insert("search_enabled".to_owned(), json!(true));
+        } else {
+            warn!("mdBook compiled without search support, ignoring `output.html.search.enable`");


Good idea with adding a warning! Should we also say something like "please reinstall with cargo install mdbook --force --features search to use the search feature"?

Michael-F-Bryan · 2018-02-18T07:31:30Z

src/renderer/html_handlebars/hbs_renderer.rs

@@ -690,9 +647,9 @@ mod tests {

    #[test]
    fn anchor_generation() {
-        assert_eq!(id_from_content("## `--passes`: add more rustdoc passes"),
+        assert_eq!(utils::id_from_content("## `--passes`: add more rustdoc passes"),


Now that the id_from_content() function has been moved to the utils module it may make sense to relocate these tests as well. I'm also curious to see how the ID generation will go when your header includes emojis, bold, or bits of code.

Michael-F-Bryan · 2018-02-18T07:45:59Z

src/utils/fs.rs

+}
+
+/// Write the given data to a file, creating it first if necessary
+pub fn write_file<P: AsRef<Path>>(


Thank you for removing this from the HTML renderer! It actually started life as a helper method on MDBook and got moved out because it coupled the entire HTML renderer to MDBook's internals. At the time I was refactoring a different part of the system, so didn't want to go off on a tangent by also fixing write_file().

Michael-F-Bryan · 2018-02-18T07:50:50Z

src/utils/mod.rs

 use std::borrow::Cow;

 pub use self::string::{RangeArgument, take_lines};

+pub fn remove_html_tags<'a>(text: &'a str) -> Cow<'a, str> {
+    let regex = Regex::new(r"<[^>]*?>").unwrap();


Is it a good idea to use a regex here instead of something like a HTML sanitizer? I'm not 100% familiar with how the pulldown-cmark markdown renderer deals with angle brackets, but if we're not careful when calling it, we could accidentally clobber the wrong stuff (e.g. an innocent paragraph like "if `x < 0` and `x > -5`, blah blah blah").

Of course, that approach could also be overkill and 95% of the time the regex will work just fine...

Using a Regex on HTML does feel bad, but I didn't want to pull in a whole sanitizer. Angle brackets already have to be escaped in HTML blocks so it could clobber some data, but I blame the user 😄 . Worst-case scenario is that some pathological text inside of an HTML tag isn't in the search index.

Michael-F-Bryan · 2018-02-18T07:55:07Z

src/renderer/html_handlebars/search.rs

+}
+
+/// Renders markdown into flat unformatted text and adds it to the search index.
+fn render_item(index: &mut Index, search_config: &Search, item: &BookItem) -> Result<()> {


Is it worth creating a bunch of tests for this so we can be sure the text gets stripped as intended?

Michael-F-Bryan · 2018-02-18T07:56:13Z

src/renderer/html_handlebars/search.rs

+}
+
+/// Converts the index and search options to a JSON string
+fn write_to_json(index: Index, search_config: &Search) -> Result<String> {


Do you have access to any "known good" versions of the generated JSON? It's probably a good idea to add a test or two to make sure we're able to generate valid JSON, that way we can also prevent accidental regressions in the future if someone needs to come back and change things.

It's difficult to have any significant test for the entire search index. Tiny changes in the source material change the index drastically, and it's so large that it's difficult to determine if any change matters at all. If you have suggestions, I'm all ears.

Hmm, you make a good point... What I usually do is generate a small example and store that as a file, giving it a quick look through to make sure it seems alright. Then every time after that you can have a test which uses the same input and check the rendered document is the same as our known good copy.

I have a similar issue at work where it's not really possible to manually construct an object and check each of its elements are the same every time. For that I've got a create_fixture() function and a constant which I manually toggle to make it regenerate the fixture.

/// Regenerates the test fixture (pre-nested list of cutlines). /// /// # Note /// /// You'll need to run `cargo test` manually here, if tests are being run via /// `cargo watch` and you try to regenerate the test fixture we'll get midway /// through writing out the fixture before `cargo watch` restarts the process. /// Leaving a mangled fixture file. const GENERATE_FIXTURE: bool = false; // loads of tests fn get_fixture() -> Vec<Cutline> { if cfg!(windows) && GENERATE_FIXTURE { let inputs = create_known_dummy_inputs(); let thing = generate_output(&inputs).unwrap(); let fixture_path = Path::new(env!("CARGO_MANIFEST_DIR")).join("tests/data/fixture.json"); let mut f = ::std::fs::File::create(&fixture_path).unwrap(); serde_json::to_writer_pretty(&mut f, &thing).unwrap(); nested } else { let jason = include_str!("data/fixture.json"); serde_json::from_str(jason).expect("Unable to deserialize the fixture") } }

Ah that's a good solution! One of the issues I was running into is that rendering dummy_book is different when you're in a test (due to env vars?) so it's annoying to make fixtures.

Michael-F-Bryan · 2018-02-18T07:57:41Z

src/theme/searcher/mod.rs

+}
+
+impl Searcher {
+    pub fn new(src: &Path) -> Self {


Should we create a test for this to make sure the file loading logic is sound and people are able to override the scripts?

mattico · 2018-02-20T17:42:26Z

I think that addresses everything, except for the JS dependency question. I would like to address that, I'm working out a solution right now. I think we should leave that for another PR, though.

Phaiax · 2018-02-23T16:49:47Z

I just wanted to say thank you for taking this over. 👍

(I wished I had enough time for this but nope, got a new job and more...)

Michael-F-Bryan

I did another review of the search code. Other than my question around #[serde(default)] and the Option<Search> item in the HTML settings, I think it's pretty much ready to merge. I'm just going to try it out on my machine first to make sure there aren't any UI issues.

This is really really cool by the way. Thank you for all the hard work!

Michael-F-Bryan · 2018-02-24T09:31:38Z

src/book/book.rs

@@ -152,15 +152,20 @@ pub struct Chapter {
    pub sub_items: Vec<BookItem>,
    /// The chapter's location, relative to the `SUMMARY.md` file.
    pub path: PathBuf,
+    /// An ordered list of the names of each chapter above this one, in the hierarchy.
+    pub parent_names: Vec<String>,


Seeing that we need to manually store the parent names and keep them in sync makes me question whether it may be better to represent a Book as containing a "doubly-linked" tree of chapters, or even turn it into a graph structure. That way a Chapter can contain a pointer back to its parent and you can use that to walk up and down the levels of the tree.

It would also hopefully simplify tracking section numbers and parsing. I might play around with this idea on a private branch and see how it goes.

Michael-F-Bryan · 2018-02-24T09:35:29Z

src/config.rs

@@ -425,29 +425,79 @@ pub struct HtmlConfig {
    pub livereload_url: Option<String>,
    /// Should section labels be rendered?
    pub no_section_label: bool,
+    /// Search settings. If `None`, the default will be used.


If None, the default will be used.

I believe this is already sorted out by the #[serde(default)] annotation on HtmlConfig. That annotation tells serde to use Search::default() if it can't deserialize the settings.

To me, when I see Option<Search> I'd think this indicates that searching is optional and when no Search is found in my book.toml this means my book won't have a search bar... Am I interpreting it wrong, or is this comment out of sync with the implementation?

I believe this is already sorted out by the #[serde(default)] annotation on HtmlConfig. That annotation tells serde to use Search::default() if it can't deserialize the settings.

The default it'll use is None, since that's how Default::default::<Option>() is implemented: https://doc.rust-lang.org/src/core/option.rs.html#900-906, and we're using #[derive(Default)]. The html renderer will then manually use unwrap_or_default to get the default configuration. The reason to do this is so we can print the warning message if the user is using mdbook compiled without feature = "search", but does have an [output.html.search] table in their book.toml.

It may be less confusing overall to just bring back the [output.html.search.enabled] field.

Michael-F-Bryan · 2018-02-24T09:39:27Z

src/config.rs

    pub editable: bool,
+    /// Copy JavaScript files for the editor to the output directory?
+    /// Default: `true`.
+    pub copy_js: bool,


This probably belongs in its own PR. Changing how the playpen works and whether we copy across JavaScript files is a non-trivial change.

I suppose. Should I revert the changes to editor, but leave the output.html.search.copy-js setting?

Mmm, don't worry about it for now, it's probably not worth the effort. As long as it's noted in the PR description it should be fine.

Michael-F-Bryan · 2018-02-24T09:41:24Z

src/renderer/html_handlebars/hbs_renderer.rs

-        for (i, item) in book.iter().enumerate() {
+        let mut depthfirstiterator = book.iter();
+        let mut is_index = true;
+        while let Some(item) = depthfirstiterator.next() {


Shouldn't this just be a plain old for loop?

Yeah, I'm not sure why this was changed. This is from the original PR.

Michael-F-Bryan · 2018-02-24T09:50:06Z

tests/rendered_output.rs

+    //
+    // If you're pretty sure you haven't broken anything, change `GENERATE_FIXTURE`
+    // above to `true`, and run `cargo test` to generate a new fixture. Then
+    // change it back to `false`.


I got a nice laugh out of this block of comments, yet it's still surprisingly useful for people who see it because of a failing test. Thank you!

Michael-F-Bryan · 2018-02-24T10:12:06Z

I just tried this out on my laptop. One thing I've noticed is the search index is only available when running mdbook with mdbook serve which is a bit of a pain when testing with mdbook build --open locally.

The browser console shows the following error.

searcher.js:260 Failed to load file:///home/michael/Documents/mdBook/book-example/book/searchindex.json: Cross origin requests are only supported for protocol schemes: http, data, chrome, chrome-extension, https.
init @ searcher.js:260

After a brief search it seems like this is a Chrome-specific issue and there isn't really any way around it other than invoking chrome from the command-line with the --allow-file-access-from-files flag. @sorin-davidoi, have you encountered this before?

Other than the CORS issue, the search functionality is really, really nice 👍

sorin-davidoi · 2018-02-24T17:33:00Z

Don't think the issue is Chrome-specific - you just can't read arbitrary files from the filesystem. A (nasty) workaround could be to make a searchindex.js file which contains the entire JSON and assigns it to window - you can then add the file as a script tag, wait for it to load and execute, and proceed with reading the JSON from window.

mattico · 2018-02-27T19:48:20Z

I looked into HTML sanitization. ammonia isn't that big of a dependency, and it could work with a small change.

@Phaiax

This commit adds search functionality to mdBook, based on work done by @Phaiax. The in-browser search code uses elasticlunr.js to execute the search, using an index generated at book build time by elasticlunr-rs.

Someone on Reddit was wondering how the rust book was generated and said they checked the source. Thought I'd put this here. Might be a good idea to have a little footer "made with mdBook", but this'll do for now.

mattico · 2018-02-28T18:37:59Z

Probably wait for https://github.com/notriddle/ammonia/pull/90 now.

Michael-F-Bryan · 2018-03-05T08:26:08Z

@mattico, I just looked through the PR again and I'm pretty happy with it 👍

On a side note, it looks like html5ever emits a bunch of unimplemented warnings when generating the search index, do you know what may be causing them? The stop_parsing operation sounds like more of an optimisation to stop parsing HTML early so I doubt it's actually going to have any negative impact on mdbook, but the user experience isn't the best.

We can always hack around this by updating the init_logging() function in src/bin/mdbook.rs to use builder.filter(Some("mdbook"), LevelFilter::Info) when the RUST_LOG environment variable isn't set, although I'd be interested to see if there's a way to fix the underlying html5ever issue.

For posterity's sake, this is what I see when running mdbook build locally:

$ cargo run -- build --open book-example                                     
    Finished dev [unoptimized + debuginfo] target(s) in 0.0 secs
     Running `target/debug/mdbook build --open book-example`
2018-03-05 16:15:36 [INFO] (mdbook::book): Book building has started
2018-03-05 16:15:36 [INFO] (mdbook::book): Running the html backend
2018-03-05 16:15:37 [WARN] (html5ever::tree_builder): stop_parsing not implemented, full speed ahead!
2018-03-05 16:15:37 [WARN] (html5ever::tree_builder): stop_parsing not implemented, full speed ahead!
2018-03-05 16:15:37 [WARN] (html5ever::tree_builder): stop_parsing not implemented, full speed ahead!
2018-03-05 16:15:37 [WARN] (html5ever::tree_builder): stop_parsing not implemented, full speed ahead!
2018-03-05 16:15:37 [WARN] (html5ever::tree_builder): stop_parsing not implemented, full speed ahead!
2018-03-05 16:15:37 [WARN] (html5ever::tree_builder): stop_parsing not implemented, full speed ahead!
2018-03-05 16:15:37 [WARN] (html5ever::tree_builder): stop_parsing not implemented, full speed ahead!
2018-03-05 16:15:37 [WARN] (html5ever::tree_builder): stop_parsing not implemented, full speed ahead!
Created new window in existing browser session.

Other than the annoying html5ever warnings, I think this is about ready to merge.

mattico · 2018-03-05T16:46:42Z

Ah, should've run more than cargo test. I'll look into it.

mattico · 2018-03-06T02:55:35Z

Fixed.

Michael-F-Bryan · 2018-03-07T13:01:59Z

And merged 🎉

mattico · 2018-03-07T15:14:16Z

4 months ago I looked at the search issue and thought "how hard can it be"?. 🎉🎊🎖️📯

Michael-F-Bryan · 2018-03-11T14:26:45Z

I've been using master locally for a couple days now and can't see any issues with search, so I'll probably to do another release either tomorrow or the day after.

Should I also make a post to /r/rust? Is there anything in particular you'd like me to say about mdbook's new search functionality?

mattico · 2018-03-11T16:12:55Z

Should I also make a post to /r/rust?

That's a great idea! People should know to rebuild their docs.

Is there anything in particular you'd like me to say about mdbook's new search functionality?

Hmm. I'd mention that if anyone notices something which won't show up in search results that they should file an issue on here. There's possibly still some bugs in the index preprocessing or elasticlunr-rs.

If you're going to have a "thanks to" section make sure to mention phaiax for writing most of this ^^ and notriddle for maintaining ammonia.

There are a lot of questions on /r/rust like "what's a project I can contribute to to learn Rust". I think mdbook works well for that, the code is easy to understand and its fun to see the results in the browser. On the other hand sometimes working on it is more HTML than Rust, and a lot of the stuff that needs to be worked on is more architecture than self-contained improvements.

Michael-F-Bryan · 2018-03-15T06:19:40Z

@mattico I think we have a small CSS issue with the search results 😜

Do you know what may have caused this? I'm using Firefox on Windows, but I remember testing the search feature before our 0.1.4 release (Chrome on Arch Linux) and the search results would get inserted above the page contents just fine.

@sorin-davidoi, this should be a pretty trivial CSS fix, shouldn't it?

@Phaiax

* Add search with elasticlunr.js This commit adds search functionality to mdBook, based on work done by @Phaiax. The in-browser search code uses elasticlunr.js to execute the search, using an index generated at book build time by elasticlunr-rs. * Add generator comment Someone on Reddit was wondering how the rust book was generated and said they checked the source. Thought I'd put this here. Might be a good idea to have a little footer "made with mdBook", but this'll do for now. * Remove search/editor file override behavior * Use for loop for book iterator * Improve HTML regex * Fix search CORS in file URIs * Use ammonia to sanitize HTML * Filter html5ever log messages

Michael-F-Bryan self-requested a review February 2, 2018 12:14

mattico force-pushed the search-eljs-rebase branch 5 times, most recently from bed285a to d17976c Compare February 5, 2018 00:39

Michael-F-Bryan suggested changes Feb 18, 2018

View reviewed changes

Michael-F-Bryan suggested changes Feb 24, 2018

View reviewed changes

mattico force-pushed the search-eljs-rebase branch from aae5676 to 515712c Compare February 28, 2018 16:49

mattico added 6 commits February 28, 2018 10:53

Add search with elasticlunr.js

556c6d1

This commit adds search functionality to mdBook, based on work done by @Phaiax. The in-browser search code uses elasticlunr.js to execute the search, using an index generated at book build time by elasticlunr-rs.

Remove search/editor file override behavior

04cd273

Add generator comment

83ff10e

Someone on Reddit was wondering how the rust book was generated and said they checked the source. Thought I'd put this here. Might be a good idea to have a little footer "made with mdBook", but this'll do for now.

Use for loop for book iterator

ef28672

Improve HTML regex

94e2b62

Fix search CORS in file URIs

c37e9a9

mattico force-pushed the search-eljs-rebase branch from 515712c to c37e9a9 Compare February 28, 2018 16:53

mattico mentioned this pull request Mar 3, 2018

Please release 1.0.2 on crates.io rust-ammonia/ammonia#93

Closed

Michael-F-Bryan approved these changes Mar 5, 2018

View reviewed changes

Michael-F-Bryan mentioned this pull request Mar 5, 2018

Search with Elasticlunr #472

Closed

6 tasks

mattico force-pushed the search-eljs-rebase branch 2 times, most recently from 35875e2 to 5720c1a Compare March 5, 2018 17:58

mattico added 2 commits March 5, 2018 11:58

Use ammonia to sanitize HTML

f6592d8

Filter html5ever log messages

1251101

mattico force-pushed the search-eljs-rebase branch from 5720c1a to 1251101 Compare March 5, 2018 17:58

mattico mentioned this pull request Mar 5, 2018

Switch from Stylus to CSS #636

Closed

Michael-F-Bryan merged commit b2ad669 into rust-lang:master Mar 7, 2018

mattico deleted the search-eljs-rebase branch March 7, 2018 15:11

Keats mentioned this pull request Mar 7, 2018

Add search index creation getzola/zola#140

Closed

Michael-F-Bryan mentioned this pull request Mar 15, 2018

Search results drawn on top of the page content #643

Closed

oberien mentioned this pull request Mar 16, 2018

Search rust-lang/reference#73

Closed

Search with Elasticlunr, updated #604

Search with Elasticlunr, updated #604

Conversation

mattico commented Feb 2, 2018 • edited Loading

mattico commented Feb 15, 2018

Michael-F-Bryan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattico Feb 18, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michael-F-Bryan Feb 19, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattico commented Feb 20, 2018

Phaiax commented Feb 23, 2018

Michael-F-Bryan left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michael-F-Bryan commented Feb 24, 2018 • edited Loading

sorin-davidoi commented Feb 24, 2018

mattico commented Feb 27, 2018

mattico commented Feb 28, 2018

Michael-F-Bryan commented Mar 5, 2018

mattico commented Mar 5, 2018

mattico commented Mar 6, 2018

Michael-F-Bryan commented Mar 7, 2018

mattico commented Mar 7, 2018

Michael-F-Bryan commented Mar 11, 2018

mattico commented Mar 11, 2018

Michael-F-Bryan commented Mar 15, 2018 • edited Loading

mattico commented Feb 2, 2018 •

edited

Loading

mattico Feb 18, 2018 •

edited

Loading

Michael-F-Bryan Feb 19, 2018 •

edited

Loading

Michael-F-Bryan left a comment •

edited

Loading

Michael-F-Bryan commented Feb 24, 2018 •

edited

Loading

Michael-F-Bryan commented Mar 15, 2018 •

edited

Loading