Skip to content
jsalvata edited this page Nov 12, 2011 · 5 revisions

The name "unmarshaller"

"Unmarshaller" is not the perfect word for this kind of object, since it doesn't really process the character sequence (the parser does that), but just the lexemes issued by the parser. Also, some unmarshallers may end up not unmarshalling (i.e. building objects) at all, but doing some other kind of processing.

I chose it because I didn't want to fall into broadly generic names such as "builder", "handler", and "processor", which are already overloaded enough in most libraries and applications. The third alternative would be a fantasy name(*), but people learning Scala have to learn enough new vocab already, and unmarshallers/furrulators will be (should be!) transparent enough to most application programmers that they don't deserve such attention.

Suggestions welcome.

(*) I would propose "furrulator", pronounced with a rolling R, with apologies for leaving non-catalans out of the joke.

Method naming

The parser will generate code calling methods in the unmarshaller.

The following options have been considered as guidance for the choice of method names:

  • Similarity to the currently existing scala.xml.ValidatingMarkupHandler

  • Some existing standard (e.g. SAX2)

  • Match to the naming used in the Scala Language Specification and the W3C XML specification.

Since one of the aspects of the Typed XML proposal is a mechanism to produce the structure of unmarshallers from the XML grammar (à la RELAX), the choice has been made to use the non-terminal names in the specification.

Current API Reference

Following is a listing of the current method names and parameter signatures. All these methods return a new unmarshaller (though this one doesn't need to implement the XMLUnmarshaller trait) on which subsequent calls will be made (see Implementation). Those with no documentation just correspond to the non-terminal of the same name in the Scala/XML grammar productions as per Scala Language Specification 2.9 and W3C XML 1.0 Recommendation.

XML Literals

startXmlExpr() Called at the start of an XML literal

endXmlExpr() Called at the end of an XML literal. Returns the desired object (a scala.xml.Node in the backward compatibility implementation)

sTag_name() Corresponds to '<' name -- attributes not included. See proposals below.

eTag()

charData() Called for all character data sequences: the CharData production, the {CharQ* | CharRef} and {CharQ* | CharRef} segments in AttValue, the CData in CDSect, comment text, and the text in the (unspecified) <xml:unparsed> tags.

cdStart()

cdEnd()

pi(target: String, text: String)

entityRef(name: String)

scalaExpr(expression) Implementations will typically overload this method. E.g. the compatibility implementation has overloadings for String, Option[Seq[Node]], and Seq[Node] for attribute values.

Attributes

These are presented separately because they are not stabilized yet.

startAttributes() called after sTag_name to begin attributes section.

endAttributes() called after the last attribute.

startAttribute_name() corresponds to Name '='

endAttribute() called at the end of each attribute

XML Patterns

startXmlPattern() Called at the start of an XML pattern

scalaPattern() Called for each embedded scala pattern other than _* and varid @ _*.

scalaStarPattern() Called for each embedded scala pattern of the form _* or varid @ _*

endXmlPattern Accessed at the end of an XML pattern: an object with an unapplySeq function returning a sequence with the values of the embedded scala patterns (one for each call to scalaPattern or scalaStarPattern preceding the access to endXmlPattern.

Proposed improvements

Better grammar for attribute handling

For better correspondence to the grammar and simplicity, sTag_name, startAttributes, and endAttributes should be replaced by:

startSTag_name() corresponding to '<' name

endSTag() corresponding to the '>' at the end of an STag

Attribute maps

Instead of the sequence of calls to startAttribute_name, charData, entityRef, and endAttribute, it is possible to make one single call to:

attribute_name(value: ?)

or possibly

attributeExpr_name(value: ?)

The unmarshaller could overload these to support different kinds of values in different ways.

The responsibility for parsing the content for entities would then be with the unmarshaller.

This can be done without changes in the parser (e.g. if we want to do it in a plug-in) by re-building the attribute value before making the call. Ugly, but it would work.

Attributes as named parameters

The sequence of calls to sTag, startAttributes, startAttribute_name, charData, entityRef, scalaExpr, and endAttributes which is currently used to express the attributes of a tag can be replaced by one single call to:

sTag_name(attr1=value1,attr2=value2,...)

This would have many advantages:

  • Resolves the issue of interleaving the sequence of attributes, making it easy to ensure that each attribute is provided at most once, and checking for mandatory attributes.

  • Makes writing application-specific unmarshallers significantly easier, specially in cases where the returned type or content grammar depends on the presence or absence of attributes (e.g. in HTML <script src=""> can't contain data.

This is difficult to implement, and requires other changes in the Scala language, as creating generic unmarshallers such as the current ScalaXMLUnmarshaller would require improving the repeated parameters language feature (SLS section 4.6.2) to support String=>value maps and providing support for those in scala.Dynamic/applyDynamic.

Namespace support

The current implementation delegates namespace support to the unmarshallers. This is sufficient for the backward compatibility implementation to provide the same level of namespace support currently available.

It would however be desirable to improve on this to to make it possible for unmarshallers to support different syntaxes and return types depending on namespace prefixes.

This can be accomplished by adding the following calls:

startPrefixed_prefix() called before any prefixed tag or attribute

endPrefixed() called just after any such tag or attribute

While this would strongly bind prefixes to namespaces upon creation of the unmarshaller, this is not an important limitation given the current design, which never uses two different unmarshallers in the same literal or pattern. The alternative would be to use the (escaped) uri in the method name, which would require a mechanism to declare namespace bindings in scopes larger than a single XML literal, as we don't want to be re-declaring them in every literal. This discussion is connected to the one in Declaring and using unmarshallers.

Going Lexical

One recurrent complaint about Scala's XML support is that some lexical information is lost before it reaches the application. E.g. CData sections, encoded vs. unencoded entities, self-closing tags vs. empty tags, ...

This project offers the opportunity to fix this by making sure that all lexical details reach the unmarshaller. It will then be up to the unmarshaller to choose what to ignore.

This is not difficult to implement, but obviously shuts down the possibility of implementing the whole project as a compiler plug-in, as it is not possible to insert a plug-in before the parser.