-
Notifications
You must be signed in to change notification settings - Fork 0
Unmarshaller api
"Unmarshaller" is not the perfect word for this kind of object, since it doesn't really process the character sequence (the parser does that), but just the lexemes issued by the parser. Also, some unmarshallers may end up not unmarshalling (i.e. building objects) at all, but doing some other kind of processing.
I chose it because I didn't want to fall into broadly generic names such as "builder", "handler", and "processor", which are already overloaded enough in most libraries and applications. The third alternative would be a fantasy name(*), but people learning Scala have to learn enough new vocab already, and unmarshallers/furrulators will be (should be!) transparent enough to most application programmers that they don't deserve such attention.
Suggestions welcome.
(*) I would propose "furrulator", pronounced with a rolling R, with apologies for leaving non-catalans out of the joke.
The parser will generate code calling methods in the unmarshaller.
The following options have been considered as guidance for the choice of method names:
-
Similarity to the currently existing scala.xml.ValidatingMarkupHandler
-
Some existing standard (e.g. SAX2)
-
Match to the naming used in the Scala Language Specification and the W3C XML specification.
Since one of the aspects of the Typed XML proposal is a mechanism to produce the structure of unmarshallers from the XML grammar (à la RELAX), the choice has been made to use the non-terminal names in the specification.
Following is a listing of the current method names and parameter signatures. All these methods return a new unmarshaller (though this one doesn't need to implement the XMLUnmarshaller trait) on which subsequent calls will be made (see Implementation). Those with no documentation just correspond to the non-terminal of the same name in the Scala/XML grammar productions as per Scala Language Specification 2.9 and W3C XML 1.0 Recommendation.
startXmlExpr() Called at the start of an XML literal
endXmlExpr() Called at the end of an XML literal. Returns the desired object (a scala.xml.Node in the backward compatibility implementation)
sTag_name() Corresponds to '<' name
-- attributes not included. See proposals below.
eTag()
charData() Called for all character data sequences: the CharData production, the {CharQ* | CharRef}
and {CharQ* | CharRef}
segments in AttValue, the CData in CDSect, comment text, and the text in the (unspecified) <xml:unparsed>
tags.
cdStart()
cdEnd()
pi(target: String, text: String)
entityRef(name: String)
scalaExpr(expression) Implementations will typically overload this method. E.g. the compatibility implementation has overloadings for String, Option[Seq[Node]], and Seq[Node] for attribute values.
These are presented separately because they are not stabilized yet.
startAttributes() called after sTag_name to begin attributes section.
endAttributes() called after the last attribute.
startAttribute_name() corresponds to Name '='
endAttribute() called at the end of each attribute
startXmlPattern() Called at the start of an XML pattern
scalaPattern() Called for each embedded scala pattern other than _*
and varid @ _*
.
scalaStarPattern() Called for each embedded scala pattern of the form _*
or varid @ _*
endXmlPattern Accessed at the end of an XML pattern: an object with an unapplySeq
function returning a sequence with the values of the embedded scala patterns (one for each call to scalaPattern
or scalaStarPattern
preceding the access to endXmlPattern
.
For better correspondence to the grammar and simplicity, sTag_name, startAttributes, and endAttributes should be replaced by:
startSTag_name() corresponding to '<' name
endSTag() corresponding to the '>'
at the end of an STag
Instead of the sequence of calls to startAttribute_name, charData, entityRef, and endAttribute, it is possible to make one single call to:
attribute_name(value: ?)
or possibly
attributeExpr_name(value: ?)
The unmarshaller could overload these to support different kinds of values in different ways.
The responsibility for parsing the content for entities would then be with the unmarshaller.
This can be done without changes in the parser (e.g. if we want to do it in a plug-in) by re-building the attribute value before making the call. Ugly, but it would work.
The sequence of calls to sTag, startAttributes, startAttribute_name, charData, entityRef, scalaExpr, and endAttributes which is currently used to express the attributes of a tag can be replaced by one single call to:
sTag_name(attr1=value1,attr2=value2,...)
This would have many advantages:
-
Resolves the issue of interleaving the sequence of attributes, making it easy to ensure that each attribute is provided at most once, and checking for mandatory attributes.
-
Makes writing application-specific unmarshallers significantly easier, specially in cases where the returned type or content grammar depends on the presence or absence of attributes (e.g. in HTML
<script src="">
can't contain data.
This is difficult to implement, and requires other changes in the Scala language, as creating generic unmarshallers such as the current ScalaXMLUnmarshaller would require improving the repeated parameters language feature (SLS section 4.6.2) to support String=>value maps and providing support for those in scala.Dynamic/applyDynamic.
The current implementation delegates namespace support to the unmarshallers. This is sufficient for the backward compatibility implementation to provide the same level of namespace support currently available.
It would however be desirable to improve on this to to make it possible for unmarshallers to support different syntaxes and return types depending on namespace prefixes.
This can be accomplished by adding the following calls:
startPrefixed_prefix() called before any prefixed tag or attribute
endPrefixed() called just after any such tag or attribute
While this would strongly bind prefixes to namespaces upon creation of the unmarshaller, this is not an important limitation given the current design, which never uses two different unmarshallers in the same literal or pattern. The alternative would be to use the (escaped) uri in the method name, which would require a mechanism to declare namespace bindings in scopes larger than a single XML literal, as we don't want to be re-declaring them in every literal. This discussion is connected to the one in Declaring and using unmarshallers.
One recurrent complaint about Scala's XML support is that some lexical information is lost before it reaches the application. E.g. CData sections, encoded vs. unencoded entities, self-closing tags vs. empty tags, ...
This project offers the opportunity to fix this by making sure that all lexical details reach the unmarshaller. It will then be up to the unmarshaller to choose what to ignore.
This is not difficult to implement, but obviously shuts down the possibility of implementing the whole project as a compiler plug-in, as it is not possible to insert a plug-in before the parser.