Skip to content

Book representation #371

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

Michael-F-Bryan
Copy link
Contributor

@Michael-F-Bryan Michael-F-Bryan commented Jul 2, 2017

This PR adds a new internal representation for the book, plus functionality for loading that book using the Summary we got when parsing SUMMARY.md.

It is part of a larger issue (#359) which is working towards refactoring the Book structure and decouple it from the other bits like rendering and configuration.

@Michael-F-Bryan Michael-F-Bryan changed the title Book representation [WIP] Book representation Jul 2, 2017
@Michael-F-Bryan Michael-F-Bryan changed the title [WIP] Book representation Book representation Jul 3, 2017
@Michael-F-Bryan
Copy link
Contributor Author

Michael-F-Bryan commented Jul 3, 2017

Believe it or not but this PR is pretty much complete. It only took 2 or 3 commits (pretty sure the first commit is 8310cce) to recursively walk the Summary and load each chapter from disk. I also added an integration test and it can successfully load the example book, plus looking through the JSON representation it looks correct.

@azerupi or @budziq would you be able to have a look through #360 before merging this one. I branched directly from there instead of master, so this PR contains a lot of commits from that PR. When you merge it the history should clean itself up.

@budziq
Copy link
Contributor

budziq commented Jul 3, 2017

Wow that s a lot of work @Michael-F-Bryan !

But would you be willing to squash the commits somewhat?
The commits represent your internal work in progress and I'm not sure if any of them is self contained enough for a per commit review.
ATM the review will be quite hard for instance I cannot say if the code you have introduced has been plugged anywhere into the mdbook binary. I have not started to review yet just due to sheer number of internal WIP / refactor / lint fix commits (as opposed to rather manageable final size of the PR).

@Michael-F-Bryan
Copy link
Contributor Author

@azerupi, sure thing. Most of the commits are checkpoints I've made along the way and #360 will probably be easier to review as a single unit.

I'll squash #360 to make that easier to review and merge, then once we're all happy with that I can clean up the history for this commit.

Would you like me to post a small summary as a comment on #360 so you know what it's meant to do? I haven't tied any of this code into the actual mdbook binary, so the mdbook::loader module isn't being called/used by anything other than integration tests yet.

@budziq
Copy link
Contributor

budziq commented Jul 3, 2017

Would you like me to post a small summary as a comment on #360 so you know what it's meant to do?

I'm not sure if its worth the effort to keep two separate PR's Why not incorporate all of the changes to #360 change it's title and close this one? I would advise against merging it until it's actually used anyway.

As for the comment/summary. Yes please do! And please add descriptive information to the squashed commit.

@Michael-F-Bryan
Copy link
Contributor Author

Michael-F-Bryan commented Jul 3, 2017

Why not incorporate all of the changes to #360 change it's title and close this one?

Yeah, that's a good idea. I was thinking of keeping the two separate because they are logically separate (one parses SUMMARY.md, the other turns it into a Book), but merging them will probably be less work for you guys.

I'm not sure if I want to hold out on merging until it's actually used though. Integrating the new loader module with the rest of mdbook will take a bit of work and probably increase the amount of code to review to 1500-2000 lines.

Anyway, here's a rough summary of the work I've done so far:

Summary Module

  • Added a private submodule called mdbook::loader::summary which contains all the code for parsing SUMMARY.md
  • A Summary contains a title (optional), then some prefix, numbered, and suffix chapters (technically Vec<SummaryItem>)
  • A SummaryItem is either a Link (i.e. link to a chapter), or a separator
  • A Link contains the chapter name, its location relative to the book's src/ directory, and a list of nested SummaryItems
  • The SummaryParser (a state machine parser) uses pulldown_cmark to turn the SUMMARY.md string into a stream of Events, it then iterates over those events changing its behaviour depending on the current state,
    • The states are Start, PrefixChapters, NestedChapters(u32) (the u32 represents your nesting level, because lists can contain lists), SuffixChapters, and End
    • Each state will read the appropriate link and build up the Summary, skipping any events which aren't a link, horizontal rule (separator), or a list

Book Module

  • Created another private submodule, mdbook::loader::book
  • This submodule contains the data types representing a Book
    • For now the Book just contains a list of BookItems (either chapters or separators)
  • A Chapter contains its name, contents (as one long string), an optional section number (only numbered chapters have numbers, obviously), and any nested chapters
  • There's a function for loading a single Chapter from disk using it's associated Link entry from the SUMMARY.md
  • Another function builds up the Book by recursively visiting all Links and separators in the Summary and joining them into a single Vec<SummaryItem>. This is the only non-dumb-data-type item which is actually exported from the book module

Loader Module

  • The mdbook::loader module contains a Loader which ties together the functions and data types exposed by the summary and book modules into a more convenient API

Tests

  • Added a heap of tests, about half the PR is actually just unit and integration tests, in particular:
  • Added one or two unit tests for each state in the parser's state machine
  • Made sure you can load from disk by writing some files to a temporary directory
  • Added integration tests for parsing a dummy SUMMARY.md then asserting the result is exactly what we expected
  • Made sure the Loader can load the entire example-book from disk and doesn't crash or hit an error
  • Increased test coverage from 34.4% to 47.7% (as reported by cargo kcov) 😃

I've made sure everything has documentation (with #![deny(missing_docs)]), so if you check out a local copy of my book-representation branch you can open up the docs for a more high-level view of how loading works and what the data types are.

I feel like I'm exposing more than I'd like to here because all the data types (Summary, Book, etc) are completely public. I originally did that because it's meant to be a plain old data tree which people can inspect to modify and browse in plugins or when making their own Renderer, so let me know if things could be made private.

I'm also trying out the new rustfmt-nightly and have that set to automatically reformat on save, so if there are any style issues don't be shy about criticizing 😜

Sorry for the wall of words and the massive PR, in retrospect I really should have broken this up into much smaller pieces!

EDIT: Forgot to mention all the tests

@budziq
Copy link
Contributor

budziq commented Jul 3, 2017

@Michael-F-Bryan thanks for the summary. It will help with the review 👍

I'm not sure if I want to hold out on merging until it's actually used though.

I might just be me but I subscribe to the line of thought that anything that is not used or does not implement a required usecase should be immediately purged from the code base (mostly because I work on very large legacy codebases day to day and hate code that was "just left because it might be used someday" with heat of thousand suns) 😉 These things tend to quickly bitrot add complexity, reduce readability and maintainability.

On the other hand actually plugging your changes, seeing how they interact with rest of the codebase and replace the old versions will let you iterate and minimise the design if required.

Integrating the new loader module with the rest of mdbook will take a bit of work

No rush. I'll gladly wait to see how the final api will turn out 👍

and probably increase the amount of code to review to 1500-2000 lines.

As long as the changes are relatively self contained in commits this number is not that staggering. But First thing that comes to my mind is that you might be trying to make the design too broad. How about enumerating the actual usecases that are not covered by the legacy code and problems that you are trying to address in a short list here?

@Michael-F-Bryan
Copy link
Contributor Author

These things tend to quickly bitrot add complexity, reduce readability and maintainability.

When you put it that way it makes a lot of sense!

First thing that comes to my mind is that you might be trying to make the design too broad

I was worried about that too, but over half the code is actually tests to ensure the implementation does what I expect it to. I've found doing unit and integration tests like that also helps quite a bit with the overall API ergonomics.

I originally wanted to do this refactoring because I was trying to make a PDF backend and found the existing Renderer and MdBook components too complicated and coupled to the HTML backend, so I'm trying to make sure the API will be easy for others to build plugins and renderers on top of.

I've tried to make sure the loader module in general is quite self-contained. As far as I'm concerned, its only responsibility is to load some in-memory representation of a book into memory, using the SUMMARY.md as a guide. As such, it shouldn't really need to interact with any other parts of the system.

@Michael-F-Bryan
Copy link
Contributor Author

Here are a couple use cases I was thinking of:

MathJax Plugin Author

I want to have easy access to all of the source text from each chapter in a book. I should be able to implement a Visitor trait (which looks something like the syn::visit::Visitor) and only have to override a single method to do my mathjax substitution.

struct MathJaxPreprocessor;

impl Visitor for MathJaxPreprocessor {
  fn visit_chapter_content(&mut self, content: &mut String) {
    let pattern = Regex::new(...).unwrap();
    content = pattern.replace_all(...);
  }
}

PDF Renderer Author

As a renderer author I want complete access to all facets of the book, preferably in some sort of AST-like structure so I can walk the AST, generating a LaTeX document in memory which I can then compile to PDF. Similar to the plugin author I would like some sort of Visitor trait which I can implement, then I can create a mdbook-pdf crate which looks something like this:

use pdf::Renderer;
use mdbook::{MdBook, Visitor};

fn main() {
  let renderer = Renderer::new(...);
  let mut md = MdBook::new("/path/to/book/root");  // Get the actual path using `clap`
  md.set_renderer("pdf", Box::new(renderer));  // takes a boxed `Visitor`

  md.build().unwrap();
}

MdBook Developer

It'd be nice to structure the process of building a book as a pipeline of transformations, with each step manipulating a tree-like Book. kinda like how you'd do multiple passes and graph manipulations in a compiler. the Book would be a really dumb data object which lets you access its attributes and manipulate them. Each step can then be configured based on the contents of Book.toml. This should make the entire process easier to reason about and cleaner overall.

The Book should be completely (or as much as possible) decoupled from any particular renderer and allow me to make changes without breaking other people's code.


Thanks for asking for a couple use cases. Even as I was writing them up I figured out a nicer way to expose the Book's contents in a way that still allows it to evolve.

@budziq
Copy link
Contributor

budziq commented Jul 4, 2017

@Michael-F-Bryan thanks for the writeup. Here are some of my thoughts:

I originally wanted to do this refactoring because I was trying to make a PDF backend and found the existing Renderer and MdBook components too complicated and coupled to the HTML backend,

Yep that is a serious problem currently and major roadblock in supporting other renderers 👍

MathJax Plugin

This will probably the hardest one to tackle as MathJax is a superset of of Some LaTeX with math specific plugins + MathML (so the output of MathJax on html + MathJax.js will most likely be largely incompatible with pdf one that would be LaTeX based). I suspect that we cannot do this cleanly and it might be easier to have separate incompatible plugins for Math Symbols for each renderer (otherwise we might have a MathJax to image converter).

I would strongly advise to implement a second renderer (the more stripped and unfeatureful as possible, certainly lacking MathJax support) along this PR to actually show what is needed from this PR and most likely we will arrive at a simpler design which we might always rewrite again if need arises.

I want to have easy access to all of the source text from each chapter in a book. I should be able to implement a Visitor trait
...
AST-like structure so I can walk the AST,

Ar we sure that we need a design of such caliber (yet)? The book is a relatively flat sequential structure (summary is actually needed only to establish viewing order and render TOC).
Most likely there will be no need for actual AST representation for each page until we stumble on a real roadblock in implementation that would make implementation impossible otherwise.
Most likely html, epub, mobi/azw renderers would never need and AST (as these are html based and would be handled by pulldown-cmark possibly without need for knowledge about MarkDown AST)
If a renderer plugin would like to have access to MarkDown AST I would suggest it just uses the pulldown-cmark (we would not be able to cater to all possible renderer needs with our simplied AST anyway so lets not try to and delegat the work to the more qualified crate)

I would rather see a minimal pdf renderer along with major decoupling and simplification than a more complicated (and arguably over engineered) MdBook design without actual renderer implementation that warrants it.

md.set_renderer("pdf", Box::new(renderer)); // takes a boxed Visitor

I would suggest avoiding the visitor pattern if possible as this will needlessly complicate the design without much benefit. The question is which element should drive the iteration (the book or the renderer). Arguably the renderer will have a more domain knowledge about its output and requirements and will be able to make better choices about items and data from the book it would like to visit/revisit.

Most likely making the book contents iterable and accessible in some sane fashion from the renderer would make the design easier (Granted there would be some minor code duplication between renderers that we might extract into separate logic if we have more than 1-2 renderers 😉 )

book as a pipeline of transformations,

This is a nice idea and would be quite easily implemented with some simple imperative design that would boil down to following pseudocode. Please do not take it as a real suggestion buth rather another point of view for discussion:

  let mut md = MdBook::new("/path/to/book/root");
  // for renderer agnostic preprocessing that works on bare markdown (like the `{{#playpen}}` and upcoming `{{#include}}` )
  for prep in preprocessors {
      // possibly plugins here too
       prep.process(md)?;
  }
  for renderer in renderers {
       renderer.register_plugins(ctx); // get the plugins from some global context - 
       // run renderer dependent plugins and preprocessing internally in render implementation
       render.render(md).build()?;
  }

@Michael-F-Bryan
Copy link
Contributor Author

Are we sure that we need a design of such caliber (yet)? The book is a relatively flat sequential structure (summary is actually needed only to establish viewing order and render TOC).

I wasn't planning on making it too complex, to me a book is just a collection of chapters, and each chapter looks roughly like this:

struct Chapter {
  name: String,
  content: String,
  number: Option<SectionNumber>, // SectionNumber is just a newtype'd Vec<u32>
  nested_chapters: Vec<Chapter>,  // because chapters can have sub-chapters
}

I probably shouldn't have used the word "AST", what I was meaning is a graph where each node is a Chapter and each Chapter may have sub-chapters. The Visitor just gives you a convenient trait for all plugins and renderers to implement, allowing the MdBook to accept any boxed Visitor so people can do their thing. It also means users don't have to make sure they implement all the recursion manually, they can rely on Visitor's default impls for stuff they don't care about.

This is roughly the design we already have, except I've replaced the PathBuf with a plain String containing the content. Presumably we were storing the PathBuf so you can open a File instead of holding everything in memory. That feels like premature optimisation to me and just holding everything in memory (it's a couple hundred k of text at most) makes things a lot simpler for everyone and means you are no longer bound to the filesystem (i.e. tests can be done entirely in-memory instead of in a tempdir).

@budziq
Copy link
Contributor

budziq commented Jul 4, 2017

number: Option, // SectionNumber is just a newtype'd Vec

Would we be able to use this data except for formatting a string prefix for chapter name? I don't see a real possibility for it in case of visitor approach. In case of non visitor approach we might use these for some wacky indexing/traversal if need be (but please see my next comment)

I was meaning is a graph where each node is a Chapter and each Chapter may have sub-chapters.

I hope that we are not planning to create any cycles . I guess that we could use a tree structure. We might even get away with a plain list/vec representing the tree traversal and I would suggest a vec as a first POC to findout if we need anything more sophisticated.

It also means users don't have to make sure they implement all the recursion manually, they can rely on Visitor's default impls for stuff they don't care about.

  • Visitor approach will most likely make implementing navigation a lot harder.
  • implementing recursion by hand can be avoided by simply providing an iterator representing tree traversal (possibly double ended one). This is the approach I would advocate
  • please see my previous comment about the fact that we do not know if some renderer will not need a non sequential traversal imposed by visitor. (Nontrivial cases would require direct node indexing but we would need to encounter such case before adding such api)

Presumably we were storing the PathBuf so you can open a File instead of holding everything in memory.

Might be. I'm not opposed to storing everything in memory. We are looking at trivial amounts of data anyway.

Anyhow I would suggest that this effort would be driven by the needs of a budding pdf (or any other) renderer. Then we would be able to iterate on the design to find the cleanest and most maintainable one.

@Michael-F-Bryan
Copy link
Contributor Author

I've started an initial PDF renderer here. It actually works and took about 5 minutes to write up, I just need to break the markdown up and convert it to LaTeX but that problem is kinda orthogonal to the book's representation.

I've attached the outputted document after running it over the example book (just knock off the trailing ".txt", apparently github doesn't like attaching "*.tex" documents).

book-example.tex.txt

@budziq
Copy link
Contributor

budziq commented Jul 4, 2017

Very cool 👍 (although I would call it LaTex renderer which might be even more awesome amd versatile in its own right)
I venture that using iterator instead of visitor would make the implementation even shorter 😄

Also the second important question is if we bundle some additional renderrers with mdbook or do we split (even on repo level). I guess that I would start with approach similar to ripgrep. One canonical repo with several crates. Then you could add this WIP rendererr to this PR making the design much easier to reason. And then experiment with split to different repos.

@Michael-F-Bryan
Copy link
Contributor Author

Would we be able to use this data except for formatting a string prefix for chapter name

I think that's the only reason we need it. MdBook makes a distinction between prefix chapters, numbered chapters, and suffix chapters so I thought we'd still need it. I prefer having a newtype'd list of numbers though because I spent forever trying to figure out what the string in mdbook::book::BookItem::Chapter(String, Chapter) was. With a newtype you can add a trivial Display impl, plus get the extra context of knowing what it is and not needing to do string parsing to get the elements.

I hope that we are not planning to create any cycles . I guess that we could use a tree structure.

Yep, Rust's borrowing doesn't like graphs with cycles (you end up with multiple mutable references), plus a tree structure seems to be the logical way of representing a book because your book almost always reflects how you've laid things out on the file system.

I venture that using iterator instead of visitor would make the implementation even shorter

Haha, I'll implement a Book::iter() function which gives you an iterator over all of the elements. I guess if anyone wants to iterate over the chapters in a Book they can implement it themselves, like I did in the pdf renderer.

The visitor pattern may be a premature optimisation/generalisation on my part because I've been working on a toy language and that's about the only (feasible) way of inspecting and manipulating your AST 😜

@Michael-F-Bryan
Copy link
Contributor Author

Michael-F-Bryan commented Jul 7, 2017

Sorry it's taken me so long to get back to this! I got the go ahead to start working on a Rust library/DLL which will be integrated into a product at work so I've been a tad distracted.

I've squashed the commits to make the history a bit easier to navigate, plus I implemented a depth-first iterator over the items in a book as per popular demand 😉 I reckon my implementation may be a bit nicer than the existing one, so that's a bonus too.

Should my next step be to try and start integrating Book into the MdBook struct?

@Michael-F-Bryan
Copy link
Contributor Author

Michael-F-Bryan commented Jul 7, 2017

Ugh, apparently Appveyor's "stable" target isn't actually the stable release, so even though struct field shorthands are stable (as of 1.17) the build fails trying to compile the unicode-bidi dependency.

From what I can tell the dependency chain is hyper > url > idna > unicode-bidi, but I'm not sure what I can do to fix it on my end...

EDIT: Fixed. Rust was being installed directly from static.rust-lang.org and that's probably gone stale, so I'm using rustup instead.

@budziq
Copy link
Contributor

budziq commented Jul 7, 2017

Thanks a lot @Michael-F-Bryan . I would put the last commit azerupi@092b445 in a separate PR as it is not really related to rest of the work and could be merged immediately. I'll try to give a review of rest of the PR shortly

Copy link
Contributor

@budziq budziq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, here are some of my random thoughts.

I find the code very clean and very good craftsmanship 👍 ! But I have a feeling that Summary and Loader over more than a little over-complicated for relatively simple task at hand while the Book seams ok. Could we pare down the implementation and strip the number of entities to the bare minimum? Currently It's few times longer and more complex while performing exactly the same tasks (not counting the enormous amount of tests which I adore ❤️ ).

Don't get me wrong. I love the refactoring and separation for the most part (I haven't done a serious code review yet as Imho the top level organization could do some more thinking)

And I would love to see the code actually used in action replacing the original implementations (I also think that the actual removal and replacement of original implementations should be part of this PR) once it would be pared down.

@@ -0,0 +1,322 @@
#![allow(missing_docs, unused_variables, unused_imports, dead_code)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not add any of these pragmas. If anything I would add most of these in deny mode

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! I added those during development to decrease the signal-to-noise ratio of the compiler output. Looks like I forgot to remove it afterwards.

src/lib.rs Outdated
extern crate log;
extern crate pretty_assertions;
#[cfg(test)]
extern crate tempdir;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need it here? It's used only in loader/book.rs tests mod

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the pretty_assertions crate invaluable when you try to debug assertions where the assert!() is comparing two complex structures. However you have a point in that it's not really worth adding yet another dependency for a feature which is only used 3 or 4 times... I think I'll remove it and see how I go without it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I meant tempdir here. pretty_assertions cannot be imported anywhere else due to macros.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, what would you prefer me to do here? Do you think I should try to remove the tempdir dependency, rearrange the imports, or even move the extern crate statement to loader/book.rs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tempdir is very handy in tests so I guess that we could make the import it localized to them as long as it's used only in one suite. If you plan to use it elsewere (or in other suites) then its fine where it is.

///
/// You need to pass in the book's source directory because all the links in
/// `SUMMARY.md` give the chapter locations relative to it.
pub fn load_book_from_disk<P: AsRef<Path>>(summary: &Summary, src_dir: P) -> Result<Book> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would suggest renaming the method to just load_book

@@ -0,0 +1,726 @@
#![allow(dead_code, unused_variables)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not add these pragmas


/// Enum representing any type of item which can be added to a book.
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
pub enum BookItem {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we need separation between Chapter, SummaryItem and BookItem

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extra level of indirection from BookItem is necessary because the SUMMARY.md format says your book is (conceptually) a list of Chapters and section separators.

The difference between BookItem and SummaryItem is because summary parsing is a separate process to loading a book. Parsing the summary should just need to take in your SUMMARY.md as a string and not need to touch the filesystem or load chapters into memory. That's why Summary is a list of SummaryItems which are either section separators or Links (e.g. - [Introduction](./src/intro.md)"). Hence the distinct Link entity instead of reusing Chapter.

.iter_mut()
.enumerate()
.filter_map(|(i, item)| item.maybe_link_mut().map(|l| (i, l)))
.rev()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't just last() work here?

Copy link
Contributor Author

@Michael-F-Bryan Michael-F-Bryan Jul 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I originally tried too. What you are trying to do is get the last chapter from the sub_items so you can append the new Chapter to it, but what happens when you call last() and the last item is a Separator instead of a Chapter?

I agree that it's not overly pretty or easy to understand, I'll see if I can pull it out into a method.

Also, maybe_link_mut() feels like a bit of a hack so I don't need to write a match statement inside a closure. It needs to go...

///
/// This is roughly the equivalent of `[Some section](./path/to/file.md)`.
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
pub struct Link {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need another entity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll probably want to look at my previous comment for more info, but basically a Chapter and a Link have a very different purpose even if they both look quite similar. One is a chapter and its contents loaded from disk (with all the IO and logic that entails), whereas the other represents a single entry in the SUMMARY.md file.

/// Given a particular level (e.g. 3), go that many levels down the `Link`'s
/// nested items then append the provided item to the last `Link` in the
/// list.
fn push_item_at_nesting_level(links: &mut Vec<SummaryItem>, mut item: SummaryItem, level: usize, mut section_number: SectionNumber)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this signature is quite awkward. Why not make this a method over summary?

appveyor.yml Outdated
- rust-%RUST_VERSION%-%TARGET%.exe /VERYSILENT /NORESTART /DIR="C:\Program Files (x86)\Rust"
- SET PATH=%PATH%;C:\Program Files (x86)\Rust\bin
- rustc -V
- ps: >-
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please separate it to a different PR. This is orthogonal and is most likely to be merged much earlier than the rest of this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, I pulled it out into #373 so it won't be held up by this PR.

@Michael-F-Bryan
Copy link
Contributor Author

@budziq I definitely agree with what you are saying about over-complicating things and introducing loads of entities which may or may not even be necessary. I've had a bit of a re-think about what I'm doing and which things

I've completely removed the mdbook::loader::summary exports from mdbook's public interface (summary parsing is just an implementation detail of loading a book) and replaced the Loader struct with a single load_book() function because it doesn't really pull its weight. This removes a lot of the exposed entities while exporting the one bit of functionality we actually care about (i.e. the load_book function).

I think a lot of the complication comes about because I've considered parsing SUMMARY.md and loading a Book from disk to be two orthogonal tasks, whereas the original implementation combined both into a single function. This means I'm creating an intermediate tree structure (Summary, with its SummaryItem and Link) with all the extra complexity that entails, plus the bit which loads the book then needs to traverse this tree.

I'd kinda like to leave the SummaryParser in the summary submodule because parsing is one of those things which is hard to do well and requires a structured approach. Otherwise it'll turn into a code monster which has lots of hard to find bugs and implicit state and will make you very sad in the long run. At the very start of the PR I was just trying to clean up the existing implementation and slightly alter the Book structure but eventually gave up because of exactly those reasons.

Out of curiosity, I just went through and did a really rough count of lines of executable code excluding tests and docs after incorporating some of the feedback you've given me. This PR would replace 320 lines (230 from mdbook::parse and 90 from mdbook::book::bookitem) with about 450 lines (300 lines in mdbook::loader::summary, 120 from mdbook::loader::book, and 20 from the top level mdbook::loader). Obviously you'll want to take those numbers with a big pinch of salt because I'll be biased in what I call "executable lines of code", but they're some ballpark figures for how much change this PR is proposing.

@budziq
Copy link
Contributor

budziq commented Jul 8, 2017

Out of curiosity, I just went through and did a really rough count of lines of executable code excluding tests and docs after incorporating some of the feedback you've given me.

So lets take these out of the clean room! I very much like to see these lines replacing the original and how these interact with rest of the codebase in this PR (removing the old implementations would help show how all of it comes together).

@Michael-F-Bryan
Copy link
Contributor Author

@budziq I started integrating this PR in and surprisingly it merged quite nicely. I just needed to changed book::BookItem imports into loader::BookItem then fix a couple compilation errors and everything fell into place 😁

The only non-trivial issue is that MDBook::new() creates a book in a potentially uninitialised state, deferring checking whether a book is valid and loading it until you manually call parse_summary() which will mutate the book's state or throw an error. My Book doesn't like this because you can only construct it using the load_book() function (which requires everything to be valid), you can't create an unintialised Book then update it later on.

@budziq and @azerupi what are your thoughts on MDBook::new() checking the book directory up front and returning a Result<MDBook> so that any time you have a MDBook you know it's already been initialized and everything is correct? Then the MDBook::init() method will turn into a second constructor that also initializes everything on disk and returns your new MDBook (or an error if initialization failed for whatever reason).

I'm also a little scared because once the compiler was satisfied all the tests passed... The reason I'm worried is I know I've at least broken inter-chapter links, so how many other things could I have accidentally broken and not know about? 😨

@budziq
Copy link
Contributor

budziq commented Jul 8, 2017

what are your thoughts on MDBook::new()

I would make a builder type

@Michael-F-Bryan
Copy link
Contributor Author

I'm thinking I might need to back up a commit or two and rethink how I'm merging this into the main library. I'd prefer to try and keep the changes to MDBook as minimal as possible because this PR is mainly about the book's internal representation, whereas as I started integrating the builder in, I started creeping into configuration and overall design and broke a lot of stuff.

It'll be a bit easier when #374 lands because then I can use those tests to make sure I haven't accidentally introduced regressions.

@budziq
Copy link
Contributor

budziq commented Jul 15, 2017 via email

@Michael-F-Bryan
Copy link
Contributor Author

No stress. As I think you said at some point earlier on, I'd rather take a little longer and put some thought into the design than rush things 😁

I'm thinking the next victim for refactoring will probably be the configuration system. It feels like it's threaded through the entire codebase as it's evolved and features have been added, so it may need some TLC before we can do things like plugins and alternative renderers.

From the [pull request comment][pr], here's a rough summary of what
was done in the squashed commits.

---

\# Summary Parser

- Added a private submodule called `mdbook::loader::summary` which
contains all the code for parsing `SUMMARY.md`

- A `Summary` contains a title (optional), then some prefix, numbered,
and suffix chapters (technically `Vec<SummaryItem>`)

- A `SummaryItem` is either a `Link` (i.e. link to a chapter), or a
separator

- A `Link` contains the chapter name, its location relative to the
book's `src/` directory, and a list of nested `SummaryItems`

- The `SummaryParser` (a state machine-based parser) uses
`pulldown_cmark` to turn the `SUMMARY.md` string into a stream of
`Events`, it then iterates over those events changing its behaviour
depending on the current state,

  - The states are `Start`, `PrefixChapters`, `NestedChapters(u32)` (the
  `u32` represents your nesting level, because lists can contain lists),
  `SuffixChapters`, and `End`

  - Each state will read the appropriate link and build up the
  `Summary`, skipping any events which aren't a link, horizontal rule
  (separator), or a list

\# Loader

- Created a basic loader which can be used to load the `SUMMARY.md` in
a directory.

\# Tests

- Added a couple unit tests for each state in the parser's state
machine

- Added integration tests for parsing a dummy SUMMARY.md then asserting
the result is exactly what we expected

[pr]: https://github.com/azerupi/mdBook/pull/371#issuecomment-312636102
This is a squashed commit. It roughly encompasses the following changes.

---

\# Book

- Created another private submodule, mdbook::loader::book

- This submodule contains the data types representing a Book

- For now the Book just contains a list of BookItems (either chapters or
separators)

- A Chapter contains its name, contents (as one long string), an
optional section number (only numbered chapters have numbers,
obviously), and any nested chapters

- There's a function for loading a single Chapter from disk using it's
associated Link entry from the SUMMARY.md

- Another function builds up the Book by recursively visiting all Links
and separators in the Summary and joining them into a single
Vec<SummaryItem>. This is the only non-dumb-data-type item which is
actually exported from the book module

\# Loader

- Made the loader use the book::load_book_from_disk function for
loading a book in the loader's directory.

\# Tests

- Made sure you can load from disk by writing some files to a temporary
directory

- Made sure the Loader can load the entire example-book from disk and
doesn't crash or hit an error

- Increased test coverage from 34.4% to 47.7% (as reported by cargo
kcov)
@Michael-F-Bryan
Copy link
Contributor Author

Hey @budziq and @azerupi I've finally found time to get back to this PR 🎉

I feel like I'm going to need to make a couple major breaking changes to the way MDBook earlier than I originally intended. A lot of the configuration and initialization seems to be threaded through the type and that makes it really difficult to change anything without breaking the world. For example, the MDBook is currently used somewhat as a builder already to let you alter different bits of the configuration and relies on having an uninitialized state until you call the parse_config() method.

I'll probably need to refactor it to be less coupled to the configuration and rest of the system, and so you can only ever get an instance of MDBook if it's in a valid state ("RAII").

To help get a better idea of how things should look,

  • What is the overall purpose of the MDBook type?
  • Do you want to be able to change the configuration by calling methods on MDBook, or should you instead do all configuration up front then pass the BookConfig in?
  • Can you just pass the underlying Book (the in-memory representation of a book's contents) and BookConfig structs to some Renderer, skipping the need for a MDBook altogether? Then presumably testing is just another Renderer and the mdbook init command is just a function which takes a config.

@azerupi
Copy link
Contributor

azerupi commented Aug 5, 2017

What is the overall purpose of the MDBook type?

When I started, it was supposed to be the "facade" of the library. The type you use to manipulate the book. I am not attached to it per se, so if there is a better way to structure the code it's worth discussing. :)

Do you want to be able to change the configuration by calling methods on MDBook, or should you instead do all configuration up front then pass the BookConfig in?

Most of the time, configuration will be done once. But I would like to keep the ability to change the configuration on the fly. It doesn't mean the MDBook type needs a method for every configuration option though. I can for example imagine a method for returning a struct representing the global config (renderer independent options), another to return one (or multiple) registered renderers holding their own configuration, etc.

Can you just pass the underlying Book (the in-memory representation of a book's contents) and BookConfig structs to some Renderer, skipping the need for a MDBook altogether?

I think the MDBook type is more higher level. It will contain a list of books (multi-language), a configuration, a list of renderers, a list of plugins, etc. When we build, MDBook will figure out what to call and when to call it in the process. Does that make any sense?

I'm open to other ideas, it's just what I have come up with at this time :)

testing is just another Renderer

I like that idea! 👍

@Michael-F-Bryan
Copy link
Contributor Author

When I started, it was supposed to be the "facade" of the library. The type you use to manipulate the book.

I think it would be better in the long run if you can still load a book into memory and render it manually, so MDBook is just a convenience wrapper and not the pseudo-god object it is at the moment. I don't want to go and rewrite the MDBook convenience wrapper though because that'll result in loads of changes and breaking the entire world.

What do you think the best thing for me to do from here is? I've currently got a nicely decoupled way of loading the book from disk, now I need to convert the current system to using that instead of the old book representation. The issue is the current system assumes you can have a partially uninitialized state and be able to change all sorts of configuration things on the fly, while the way I've done things, you load the book from disk and shouldn't be able to change configuration options like the source directory afterwards because that'd result in an in inconsistent state.


Sorry it's taking so long to finish off this PR! I was really enthusiastic about making the internal representation for a book less coupled and "nicer" to start off with, but when I started trying to integrate it with the MDBook type and encountered a lot of friction trying to integrate it in I started to lose motivation. It's kinda demoralizing when a single trivial-looking change can result in 30+ compilation errors spread out over the entire crate from rendering to configuration or even the main mdbook binary :(

@budziq
Copy link
Contributor

budziq commented Aug 16, 2017

@Michael-F-Bryan

Sorry for late replies. It's pretty hectic lately. IMHO It would be best if MDBook was just a dumb iterable datastructure that happens to be renderrable and would not provide any functions for loading or composing or setters by itself. These things would be better left for some builder type (or types if there would be some mutually exclusive build variants) otherwise we'll always get inconsistent state


Sorry it's taking so long to finish off this PR!

Don't sweat it you are doing important work and doing great job at it 👍 There is no rush.

when I started trying to integrate it with the MDBook type and encountered a lot of friction trying to integrate

Yep. Largish refactors like this one tend to be on the painful side. Usually it is easier to refactor and integrate organically along the way instead of transplanting in one go.

It's kinda demoralizing when a single trivial-looking change can result in 30+ compilation errors

On the bright side I've looked on such things as a todo list given for free by the compiler. I always shudder at larger python code rewrites if test coverage is not great (end even if coverage is good it tends to require soaking most of the code to my frontal lobes before we even starte ;) ).

@azerupi
Copy link
Contributor

azerupi commented Aug 16, 2017

It would be best if MDBook was just a dumb iterable datastructure that happens to be renderrable

@budziq, I am not sure I agree with you here. I don't have much time to go into details, but the way I see it is that we indeed need a book struct that is iterable but it is not the MDBook struct.

The reason why I think that is that if we want to support multi-lang books we will need to easily duplicate those iterable book structures without duplicating all the common metadata. So I think we need an overarching structure (which was the MDBook struct) containing the different books, configurations, etc.

If I understand your stream of thoughts correctly, you want to dissociate the configuration from the book data as much as possible? That doesn't sound like a bad idea, but at some point we need something to orchestrate the whole, how do you envision that?

@budziq
Copy link
Contributor

budziq commented Aug 16, 2017

The reason why I think that is that if we want to support multi-lang books we will need to easily duplicate those iterable book structures without duplicating all the common metadata.

Right. I keep forgetting about the planned multilang support as It has been outside of my requirements (for now).

Well I kind of like the idea of Rendition/Variant/Translation (naming things is hard ;)) which would represent a single book (as in single language variant) with all the qualities that I've described previously (simple renderable collection). Then we could turn have a higher level entity spanning multiple variants. Which in turn would not be much else than a container and mapping between summaries, which may or may not be renderable (html one might be but it might make less sense for other renderers).

I'm still unclear if we would actually need any mapping between book translation summaries at all, except for some lints about translations diverging from the original. @azerupi do you have a use-case in mind in regard to the common metadata? Most likely I'm missing something there.

@Michael-F-Bryan
Copy link
Contributor Author

To me, it sounds like there are three general structures at play here. First there's a Book, this would roughly translate to the contents of your src/ directory and the markdown files inside (or a single language, if the book is multilingual), then there's some Config` struct which encapsulates all configuration.

On top of those two there'll be a MDBook which orchestrates the entire rendering process and lets you tie in the configuration with rendering and possibly also how books are loaded from disk if it's multilingual.

It sounds like the MDBook and Book structs may have been confused at some point (naming things is hard!).

Is there a issue/PR for multilingual support I can read through to get up to speed with how we're wanting to do it? From what I've seen of the MDBook struct so far, there isn't (yet) any concrete code for dealing with multiple versions of a book in different languages.

@azerupi
Copy link
Contributor

azerupi commented Aug 17, 2017

@Michael-F-Bryan Yes, that is mostly how I envisioned things :)

Is there a issue/PR for multilingual support I can read through to get up to speed

There isn't anything complete. The best I can give you is issue #146 where I described how I imagined the different structs. The related PR #147 that never got merged, could be a good starting point.

@azerupi
Copy link
Contributor

azerupi commented Aug 17, 2017

I'm still unclear if we would actually need any mapping between book translation summaries at all

I think we should leave that to the renderer. Like you said, only few renderers will need mapping between languages (mainly the HTML one), so it makes sense to let the renderers implement it to suit their needs.

@azerupi
Copy link
Contributor

azerupi commented Aug 20, 2017

@Michael-F-Bryan I propose that we try to integrate some part of this into master already.
The part I am most interested in is the book structures. So I propose that we try to integrate some mixture of the structs in this PR and #147 without all the other refactorings to begin and see where that gets us.

I am pushing this specific part to move one step closer to multi-lingual books. Does that sound ok? If you don't feel like making a new PR I can definitely take care of it. I have some free time today and this week in general :)

@Michael-F-Bryan
Copy link
Contributor Author

Michael-F-Bryan commented Aug 21, 2017

I had a couple hours between classes so I decided to finish this off. I've rolled back to when I finished the Book (before trying to integrate it into MDBook) and redone the integration, massively scaling back the number of changes.

I've made a PR (#409) and all the tests pass (woohoo! 🎉) so it looks like everything should work. @budziq and @azerupi should we start the reviewing process now?

Michael-F-Bryan referenced this pull request in Michael-F-Bryan/mdBook Sep 30, 2017
From the [pull request comment][pr], here's a rough summary of what
was done in the squashed commits.

---

\# Summary Parser

- Added a private submodule called `mdbook::loader::summary` which
contains all the code for parsing `SUMMARY.md`

- A `Summary` contains a title (optional), then some prefix, numbered,
and suffix chapters (technically `Vec<SummaryItem>`)

- A `SummaryItem` is either a `Link` (i.e. link to a chapter), or a
separator

- A `Link` contains the chapter name, its location relative to the
book's `src/` directory, and a list of nested `SummaryItems`

- The `SummaryParser` (a state machine-based parser) uses
`pulldown_cmark` to turn the `SUMMARY.md` string into a stream of
`Events`, it then iterates over those events changing its behaviour
depending on the current state,

  - The states are `Start`, `PrefixChapters`, `NestedChapters(u32)` (the
  `u32` represents your nesting level, because lists can contain lists),
  `SuffixChapters`, and `End`

  - Each state will read the appropriate link and build up the
  `Summary`, skipping any events which aren't a link, horizontal rule
  (separator), or a list

\# Loader

- Created a basic loader which can be used to load the `SUMMARY.md` in
a directory.

\# Tests

- Added a couple unit tests for each state in the parser's state
machine

- Added integration tests for parsing a dummy SUMMARY.md then asserting
the result is exactly what we expected

[pr]: https://github.com/azerupi/mdBook/pull/371#issuecomment-312636102
@Michael-F-Bryan
Copy link
Contributor Author

I'm closing this in favor of #409.

@Michael-F-Bryan Michael-F-Bryan deleted the book-representation branch December 22, 2017 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants