Skip to content
eric93 edited this page Sep 25, 2013 · 1 revision

An Overview of Servo's Layout Pipeline as of Fall 2013

This page is intended to, along with comments in the code, serve as a guide to the current incarnation of Servo's layout algorithm. In particular, the focus is on the sequence of steps to produce a display list from a DOM that has already been matched against CSS selectors. The current approach is to run a traversal on the DOM to construct a flow tree, which is an intermediate data structure that is part-way inbetween the DOM and the display list (more information on flow tree concerns can be found here). A sequence of tree traversals is then performed on the flow tree, the last one of which creates the display list.

The Traversal Abstraction

Traversals consist of two components:

  • A driver function that controls which order each node of tree is visited in
  • A visitor or kernel function that determines what happens when a node is visited. The key constraint is that the visitor function can only "see" the subtree rooted at the current node being visited. In fact, most visitors only set and read properties on immediate children.

The drivers are currently implemented in flow.rs, and the visitor logic is split between block.rs, float.rs, and inline.rs (this will probably change with parallel layout, see parallelism section below).

Over the flow tree, we perform three main traversals that have distinct purposes:

  1. Bubble widths. This computes the minimum and preferred widths of each element, as used in CSS 2.1 by floats, tables, and inline-block elements. The exact computation is unspecified, and the current implementation probably does not cover all cases effectively, but it works well enough for basic tests. This is a bottom-up or post-order traversal, and we do some other things here as well such as set up flags for use when positioning floats.

  2. Assign widths. This traversal is responsible for computing the width of every element in the flow tree. Since the width of a flow generally depends on its parent's width, this operation is top-down or pre-order. For floats, we use the previously computed minimum and preferred widths to find the shrink-to-fit width, and for blocks, the width depends on the value of the "width", "margin-left", and "margin-right" properties (see here for details).

  3. Assign height. In addition to computing heights for everything, this traversal lays out all text elements and floats. The structure of this traversal is somewhat stranger than the others due to parallelism concerns. Floats and text can generally be laid out using an in-order traversal of the tree, and often this is necessary because a float flow can extend beyond its parent flow. However, in-order traversals are inherently sequential and can't be parallelized, so we try to avoid them whenever possible. If an inorder traversal is not necessary for a certain subtree (as determined by the is_inorder flag), we traverse that subtree bottom-up.

  4. Construct display list. The final traversal computes the absolute position of everything in the flow tree and adds it to the display list if necessary. Computing the absolute position requires a top-down traversal, and currently this means the display list is more or less a pre-order representation of the flow tree. This will need to change to properly implement CSS 2.1 stacking contexts. One option may be to have multiple "add" points to the display list.

Float Layout in Servo

As hinted at before, floats are computed on an in-order traversal of the flow tree (or possibly a subtree of the overall flow tree). The way this works in practice is that each flow stores a floats_in and a floats_out property, each of which is actually a FloatContext data structure. floats_in should have been set by the parent prior to this node being visited, and it is the current node's responsibility to use this to compute floats_out for possible later use by the parent node. For example, float flows simply add themselves to floats_in: floats_out = floats_in.add_float(...);(code)

The add_float method here takes into account all previous floats with floats_in, and ensures that the new float won't collide with previous ones or otherwise break the rules.

One peculiar thing about floats is they have to exist in a number of different coordinate systems. Normally, flows only have to compute positions relative to their own frame of reference, but because floats can influence layout in other flows we must perform a translation to ensure all flows agree on where the float is. This is done using the translate method, which conceptually moves all floats in the context by a certain vector.

As may be clear from the code-snippet above, if used properly a FloatContext can be represented using a linearly-typed pointer. The current implementation, however, does not enforce this with Rust's unique pointers and instead provides a run-time guarantee of single-ownership. This is to allow for flexibility if we want to store an intermediate FlowContext, which may be necessary for incremental reflow.

Line Breaking

Line breaking in Servo is much simplified by the fact that everything that must go on a line is collapsed into a list of boxes that hang off of an inline flow. However, there are still some tricky cases when floats are present.

One point of contention is that some boxes (such as text runs), are splittable, meaning we lazily decide whether to split them into two or more boxes to lay them out across lines. This significantly complicates the code, and may inhibit data structure optimizations further down the road. The other most obvious option would be to create a new box for every unsplittable piece of text (i.e. every word), but this could incur significant allocation overhead that would dwarf any potential speedups.

In any case, the main line-breaking routine, found here, is essentially a reduction over the list of boxes (Note that this assumes that it is possible to do line layout in one pass. Servo currently doesn't handle vertical-align well enough for a lot of these issues to crop up). At each step of the reduction we keep track of the "green zone" which the current line can expand to without running into floats or the wall of the containing block. If the line overflows the region, we can either try to move the line or start a new line. Both appear to be valid layouts according to CSS 2.1, and there doesn't seem to be good consensus among existing browsers for when to do one or the other. Servo will try to move the line in cases where we don't need some kind of "lookahead" to guarantee the line will fit at the new location. Otherwise, we just split the line and continue on.

One thing that Servo does not currently support which may be tricky to get right is the case where a float is a child of an inline, and hence has an "anchor point" in the middle of some text. In this case, if the float can fit on the same line as the anchor point, then it must be placed before any further line layout continues. Otherwise, we must delay placing the float until the end of the line, to ensure the float is placed below the lowest element in the line containing the anchor point.

Parallelism

One advantage of Servo's current design is that it makes it possible for the entirety of layout -- from DOM to display list -- to be parallelized. Parallel selector matching is sort of a beast unto itself, but the rest of layout consists of tree traversals. The parallel model for tree traversals is essentially fork-join parallelism, where top-down traversals correspond to forks, and bottom-up traversals joins.

However, a naive implementation requires switching between stacks which will be slow even with the most efficient multithreaded run-time. Instead, we use a work-stealing approach where we maintain a queue of nodes to be visited. To traverse a tree top-down, we pull a node off the queue, visit it, and then add all of its children to the queue. To traverse bottom-up, after we visit the node we must check to see if all of its siblings have been visited, and if so, add the node's parent to the queue. Any number of threads can be performing these operations at once, and we can use a Chase-Lev deque to avoid synchronization overhead.

This approach also allows for some more exotic optimizations, such as "tree tiling", where multiple nodes are placed into a single block called a tile. Each tile is an atomic unit from a work-stealing perspective, and could possibly be made to fit in the L1 cache using techniques like pointer compression. Another cache optimization is to have the work-stealing be fixed across traversals. That is, the same thread accesses the same nodes on every traversal to avoid cache misses. It might also be worth investigating integrating SIMD parallelism with the work-stealing algorithm.

As of September 2013, parallel tree traversals have not been merged into the main branch, and development is going on here.

Safety

This type of work-stealing parallelism presents a problem for Rust and Servo because it breaks the abstractions that generally keep Rust code safe. A common theme in Rust is that unsafe code should present a safe interface with which it can be invoked. With tree traversals, the driver function and the tree data structure present such an interface to the visitor functions. In other words, it is impossible to type-check a visitor function such that introduces a data race or memory-safety bug.

The key to this is that visitor functions can only access the subtree rooted at a given node. This means that as long as our work-stealing algorithms don't visit a node at the same time as any of its descendants, there will be no data races. Memory-safety can be enforced using linear types, which e.g. prevent a visitor from handing off a node pointer to a separate thread.

The invariant that visitors only "see" their descendants is enforced by having each node store unique pointers to its children, and no sibling or parent pointers. However, this is problematic for bottom-up traversals, since we need to add a node's parent to the work queue with access to only the node itself. One way of describing it is that prior to being visited, the parent node is in a weird state of "collective ownership" where all the children together "own" the parent. After the last child is visited, the situation flips parent once again owns its children and itself becomes owned by the work queue. This is currently implemented using a number of unsafe pointers and manual synchronization, but it would be interesting to see if this pattern pops up elsewhere.

Future Considerations

There are a couple of things not listed above that might prove problematic down the road:

  • Percent heights are strange. Without them, height is a relatively straightforward bottom-up computation: we sum up the height of the children, and then add borders/margin/padding. But if one of the children has a height that is a percent of its parent's, there can be circular dependencies. CSS 2.1 doesn't define how these cycles are broken, and the most logical thing seems to be to solve all possible percent heights before computing content heights, and then do something sensible when there are conflicts.

However, this is not really possible when absolutely positioned elements are present, since a percent height on a positioned element must be computed as a fraction of its containing block, which may depend on its children's height. This means we must lay out the entire containing block before we lay out any part of the positioned element, essentially an inorder dependency. The rules appear to get even more complicated when tables are involved, though we may be able to get around this since tables are not standardized.

  • Display list construction doesn't mix well with parallelism. The most natural way to build a display list is to compute the size and offset of each node. We could mix this computation into already existing layout traversals. The entire display list can then be allocated at once and each node can place display items at its computed offset. The advantage is that we can still perform parallel tree traversals without having to lock on the display list. However, this would be horribly unsafe since the only thing preventing data races is the assumption that the offset and size computations are correct. Another option would be to have each node compute a display list for its subtree. This would be safer and possibly easier to incrementalize, but we would have to perform some kind of "merge" that combines the display lists of all children, and this would probably be slow. It seems like there might be a tradeoff here for speed of the initial page load vs. speed of incremental reflow, and it might be interesting to explore which is more important.
Clone this wiki locally