-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Search with Elasticlunr, updated #604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
bed285a
to
d17976c
Compare
If there's anything I can do to make this easier to review, let me know. I'd be happy to break it into smaller commits. No pressure, though. Wouldn't want a good maintainer to get burned out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done an initial skim through the code. Overall I can't see any glaring issues and it looks pretty good.
We could probably do with a couple more tests, particularly around rendering. That way we'll be able to prevent accidental regressions in the future and it also helps act as "live examples" of how the pieces work.
Another question I had is about how we distribute the elasticlunr
JavaScript dependency. We currently vendor minified JavaScript files in the repo, but that makes things a bit annoying when people want to package for platforms like Debian and it's not easy to verify authenticity or whether something is open-source.
src/config.rs
Outdated
#[serde(default, rename_all = "kebab-case")] | ||
pub struct Search { | ||
/// Enable in browser searching. Default: `true` (if `search` feature is enabled). | ||
pub enable: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to have an enabled
flag? I would have thought we can see search is enabled because a output.html.search
is either present or not.
So an alternative to the enabled
flag would be to change the search
field in HtmlConfig
to be Option<Search>
... Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Especially now that the JS is conditionally included, the flag is not very useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, that would change search to be disabled by default. Is that your intention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dunno, to be honest it's one of those features you'll always want for your document.
It should probably be always turned on unless you manually deactivate the search
feature flag when installing mdbook
(e.g. when packaging for Debian).
src/config.rs
Outdated
pub enable: bool, | ||
/// The path to the searcher to use. If not specified, will use the searcher code included in | ||
/// MDBook. Default: `""` | ||
pub searcher: PathBuf, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think we may end up having a similar issue to themes here. By adding this to the config file we're theoretically allowing people to swap in their own implementations, but because there is no well defined (and documented) interface to our searcher
, people probably won't be able to make use of this knob in practice.
What would one gain by being able to use their own searcher
implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the way it's setup is not ideal, I just copied the setup used for playpen_editor.
While we give no guarantees about the stability or interfaces for the searcher, the current system can be useful in practice for making small changes to search behavior (remove highlighting, change keyboard shortcuts, etc.). I'll agree, however, that this feature is unlikely to be used much if at all. In general I don't think many users will make significant theme changes out-of-tree until we provide a better way to do that.
So what change can we make to this PR? Here's what I'm thinking:
- Remove the
editor
andsearcher
settings and the corresponding file override behavior. - Add
copy-js
settings for both. If set to false, the JS code for the feature will not be copied to the output directory. This is more flexible since users can then include whatever JS they want withadditional-js
, and it reduces the complexity of this little-used feature to almost zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that sounds quite reasonable 👍
Besides, it's not hard to put back in if we find people actually want the feature.
src/config.rs
Outdated
fn default() -> Search { | ||
// Please update the documentation of `Search` when changing values! | ||
Search { | ||
enable: cfg!(feature = "search"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice use of the cfg!()
macro! 👍
if cfg!(feature = "search") { | ||
data.insert("search_enabled".to_owned(), json!(true)); | ||
} else { | ||
warn!("mdBook compiled without search support, ignoring `output.html.search.enable`"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea with adding a warning! Should we also say something like "please reinstall with cargo install mdbook --force --features search
to use the search feature"?
@@ -690,9 +647,9 @@ mod tests { | |||
|
|||
#[test] | |||
fn anchor_generation() { | |||
assert_eq!(id_from_content("## `--passes`: add more rustdoc passes"), | |||
assert_eq!(utils::id_from_content("## `--passes`: add more rustdoc passes"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that the id_from_content()
function has been moved to the utils
module it may make sense to relocate these tests as well. I'm also curious to see how the ID generation will go when your header includes emojis, bold, or bits of code.
} | ||
|
||
/// Write the given data to a file, creating it first if necessary | ||
pub fn write_file<P: AsRef<Path>>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for removing this from the HTML renderer! It actually started life as a helper method on MDBook
and got moved out because it coupled the entire HTML renderer to MDBook
's internals. At the time I was refactoring a different part of the system, so didn't want to go off on a tangent by also fixing write_file()
.
src/utils/mod.rs
Outdated
use std::borrow::Cow; | ||
|
||
pub use self::string::{RangeArgument, take_lines}; | ||
|
||
pub fn remove_html_tags<'a>(text: &'a str) -> Cow<'a, str> { | ||
let regex = Regex::new(r"<[^>]*?>").unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a good idea to use a regex here instead of something like a HTML sanitizer? I'm not 100% familiar with how the pulldown-cmark
markdown renderer deals with angle brackets, but if we're not careful when calling it, we could accidentally clobber the wrong stuff (e.g. an innocent paragraph like "if `x < 0` and `x > -5`, blah blah blah").
Of course, that approach could also be overkill and 95% of the time the regex will work just fine...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a Regex on HTML does feel bad, but I didn't want to pull in a whole sanitizer. Angle brackets already have to be escaped in HTML blocks so it could clobber some data, but I blame the user 😄 . Worst-case scenario is that some pathological text inside of an HTML tag isn't in the search index.
} | ||
|
||
/// Renders markdown into flat unformatted text and adds it to the search index. | ||
fn render_item(index: &mut Index, search_config: &Search, item: &BookItem) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth creating a bunch of tests for this so we can be sure the text gets stripped as intended?
} | ||
|
||
/// Converts the index and search options to a JSON string | ||
fn write_to_json(index: Index, search_config: &Search) -> Result<String> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have access to any "known good" versions of the generated JSON? It's probably a good idea to add a test or two to make sure we're able to generate valid JSON, that way we can also prevent accidental regressions in the future if someone needs to come back and change things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's difficult to have any significant test for the entire search index. Tiny changes in the source material change the index drastically, and it's so large that it's difficult to determine if any change matters at all. If you have suggestions, I'm all ears.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, you make a good point... What I usually do is generate a small example and store that as a file, giving it a quick look through to make sure it seems alright. Then every time after that you can have a test which uses the same input and check the rendered document is the same as our known good copy.
I have a similar issue at work where it's not really possible to manually construct an object and check each of its elements are the same every time. For that I've got a create_fixture()
function and a constant which I manually toggle to make it regenerate the fixture.
/// Regenerates the test fixture (pre-nested list of cutlines).
///
/// # Note
///
/// You'll need to run `cargo test` manually here, if tests are being run via
/// `cargo watch` and you try to regenerate the test fixture we'll get midway
/// through writing out the fixture before `cargo watch` restarts the process.
/// Leaving a mangled fixture file.
const GENERATE_FIXTURE: bool = false;
// loads of tests
fn get_fixture() -> Vec<Cutline> {
if cfg!(windows) && GENERATE_FIXTURE {
let inputs = create_known_dummy_inputs();
let thing = generate_output(&inputs).unwrap();
let fixture_path =
Path::new(env!("CARGO_MANIFEST_DIR")).join("tests/data/fixture.json");
let mut f = ::std::fs::File::create(&fixture_path).unwrap();
serde_json::to_writer_pretty(&mut f, &thing).unwrap();
nested
} else {
let jason = include_str!("data/fixture.json");
serde_json::from_str(jason).expect("Unable to deserialize the fixture")
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that's a good solution! One of the issues I was running into is that rendering dummy_book is different when you're in a test (due to env vars?) so it's annoying to make fixtures.
src/theme/searcher/mod.rs
Outdated
} | ||
|
||
impl Searcher { | ||
pub fn new(src: &Path) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we create a test for this to make sure the file loading logic is sound and people are able to override the scripts?
I think that addresses everything, except for the JS dependency question. I would like to address that, I'm working out a solution right now. I think we should leave that for another PR, though. |
I just wanted to say thank you for taking this over. 👍 (I wished I had enough time for this but nope, got a new job and more...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did another review of the search code. Other than my question around #[serde(default)]
and the Option<Search>
item in the HTML settings, I think it's pretty much ready to merge. I'm just going to try it out on my machine first to make sure there aren't any UI issues.
This is really really cool by the way. Thank you for all the hard work!
@@ -152,15 +152,20 @@ pub struct Chapter { | |||
pub sub_items: Vec<BookItem>, | |||
/// The chapter's location, relative to the `SUMMARY.md` file. | |||
pub path: PathBuf, | |||
/// An ordered list of the names of each chapter above this one, in the hierarchy. | |||
pub parent_names: Vec<String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing that we need to manually store the parent names and keep them in sync makes me question whether it may be better to represent a Book
as containing a "doubly-linked" tree of chapters, or even turn it into a graph structure. That way a Chapter
can contain a pointer back to its parent and you can use that to walk up and down the levels of the tree.
It would also hopefully simplify tracking section numbers and parsing. I might play around with this idea on a private branch and see how it goes.
@@ -425,29 +425,79 @@ pub struct HtmlConfig { | |||
pub livereload_url: Option<String>, | |||
/// Should section labels be rendered? | |||
pub no_section_label: bool, | |||
/// Search settings. If `None`, the default will be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If
None
, the default will be used.
I believe this is already sorted out by the #[serde(default)]
annotation on HtmlConfig
. That annotation tells serde to use Search::default()
if it can't deserialize the settings.
To me, when I see Option<Search>
I'd think this indicates that searching is optional and when no Search
is found in my book.toml
this means my book won't have a search bar... Am I interpreting it wrong, or is this comment out of sync with the implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is already sorted out by the #[serde(default)] annotation on HtmlConfig. That annotation tells serde to use Search::default() if it can't deserialize the settings.
The default it'll use is None
, since that's how Default::default::<Option>()
is implemented: https://doc.rust-lang.org/src/core/option.rs.html#900-906, and we're using #[derive(Default)]
. The html renderer will then manually use unwrap_or_default
to get the default configuration. The reason to do this is so we can print the warning message if the user is using mdbook compiled without feature = "search"
, but does have an [output.html.search]
table in their book.toml
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be less confusing overall to just bring back the [output.html.search.enabled]
field.
pub editable: bool, | ||
/// Copy JavaScript files for the editor to the output directory? | ||
/// Default: `true`. | ||
pub copy_js: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably belongs in its own PR. Changing how the playpen works and whether we copy across JavaScript files is a non-trivial change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose. Should I revert the changes to editor
, but leave the output.html.search.copy-js
setting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm, don't worry about it for now, it's probably not worth the effort. As long as it's noted in the PR description it should be fine.
for (i, item) in book.iter().enumerate() { | ||
let mut depthfirstiterator = book.iter(); | ||
let mut is_index = true; | ||
while let Some(item) = depthfirstiterator.next() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this just be a plain old for
loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm not sure why this was changed. This is from the original PR.
tests/rendered_output.rs
Outdated
// | ||
// If you're pretty sure you haven't broken anything, change `GENERATE_FIXTURE` | ||
// above to `true`, and run `cargo test` to generate a new fixture. Then | ||
// change it back to `false`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got a nice laugh out of this block of comments, yet it's still surprisingly useful for people who see it because of a failing test. Thank you!
I just tried this out on my laptop. One thing I've noticed is the search index is only available when running mdbook with The browser console shows the following error.
After a brief search it seems like this is a Chrome-specific issue and there isn't really any way around it other than invoking chrome from the command-line with the Other than the CORS issue, the search functionality is really, really nice 👍 |
Don't think the issue is Chrome-specific - you just can't read arbitrary files from the filesystem. A (nasty) workaround could be to make a |
I looked into HTML sanitization. |
aae5676
to
515712c
Compare
This commit adds search functionality to mdBook, based on work done by @Phaiax. The in-browser search code uses elasticlunr.js to execute the search, using an index generated at book build time by elasticlunr-rs.
Someone on Reddit was wondering how the rust book was generated and said they checked the source. Thought I'd put this here. Might be a good idea to have a little footer "made with mdBook", but this'll do for now.
515712c
to
c37e9a9
Compare
Probably wait for https://github.com/notriddle/ammonia/pull/90 now. |
@mattico, I just looked through the PR again and I'm pretty happy with it 👍 On a side note, it looks like We can always hack around this by updating the For posterity's sake, this is what I see when running
Other than the annoying |
Ah, should've run more than cargo test. I'll look into it. |
35875e2
to
5720c1a
Compare
5720c1a
to
1251101
Compare
Fixed. |
And merged 🎉 |
4 months ago I looked at the search issue and thought "how hard can it be"?. 🎉🎊🎖️📯 |
I've been using Should I also make a post to /r/rust? Is there anything in particular you'd like me to say about |
That's a great idea! People should know to rebuild their docs.
Hmm. I'd mention that if anyone notices something which won't show up in search results that they should file an issue on here. There's possibly still some bugs in the index preprocessing or elasticlunr-rs. If you're going to have a "thanks to" section make sure to mention phaiax for writing most of this ^^ and notriddle for maintaining ammonia. There are a lot of questions on /r/rust like "what's a project I can contribute to to learn Rust". I think mdbook works well for that, the code is easy to understand and its fun to see the results in the browser. On the other hand sometimes working on it is more HTML than Rust, and a lot of the stuff that needs to be worked on is more architecture than self-contained improvements. |
@mattico I think we have a small CSS issue with the search results 😜 Do you know what may have caused this? I'm using Firefox on Windows, but I remember testing the search feature before our @sorin-davidoi, this should be a pretty trivial CSS fix, shouldn't it? |
* Add search with elasticlunr.js This commit adds search functionality to mdBook, based on work done by @Phaiax. The in-browser search code uses elasticlunr.js to execute the search, using an index generated at book build time by elasticlunr-rs. * Add generator comment Someone on Reddit was wondering how the rust book was generated and said they checked the source. Thought I'd put this here. Might be a good idea to have a little footer "made with mdBook", but this'll do for now. * Remove search/editor file override behavior * Use for loop for book iterator * Improve HTML regex * Fix search CORS in file URIs * Use ammonia to sanitize HTML * Filter html5ever log messages
@Phaiax's #472, manually rebased, plus a few improvements.
Major Changes from #472:
Chapter.parent_names
, rather than modifying theBookItems
iteratorSearchDocument
removed,render_item
renders directly intoelasticlunr::Index
Closes #472
Fixes #51