Examples > XML > ISO-XML

ISO-XML

Top Previous Next

On this page a few hints will follow, how the TextTransformer XML parser derives form the standard specification of XML. Who isn't interested at these details can continue to the next page.

The XML standard is described in detail at

http://www.xml.com/axml/testaxml.htm

To specify XML an Extended Backus-Naur Form (EBNF) notation is used, which again is standardized ( see: http://www.cl.cam.ac.uk/~mgk25/iso-ebnf.html).

The standardized EBNF notation ( ISO-EBNF ) is fortunately is similar to that of the TextTransformers, but isn't conceived for practical use. ISO-EBNF at first is a very elementary description, without a distinction between tokens and productions. Secondly it is not taken into consideration, whether the grammar is deterministic recognizable, especially, the grammar don't conforms to the LL(1) condition.

To transform the XML grammar three steps are necessary:

1. an import project (quick and dirty) similar to the project for the cocor import, by which the ISO-EBNF-XML rules can be imported as TextTransformer productions.

2. all productions, which only are describing character sets, are transformed to tokens (see remarks below).

3. LL(1) conflicts are solved, similar as described for the parser of email addresses

A further problem is, that an XML documents in principle supports Unicode, which the TextTransformer at the moment still doesn't. (The option to create parser code on basis of wide characters is in work.). But the first 128 characters of the ASCII-Code and of UTF-8 coded Unicode are identical. So the Tetra XML parser will read most XML documents in spite of simplified token definitions.

Some further remarks concerning the transformation of ISO-EBNF:

In ISO-EBNF there is an operator without counterpart in the syntax of Tetra:

A - B matches any string that matches A but does not match B

A translation of this operator is simple, if A and B are characters or character sets. Then A - B can be combined into one set of characters.

If A and B are sequences of characters, B can either be a permitted alternative of A - B or an occurrence of B in the input is an error.

Example:

ISO-EBNF: CData ::= (Char* - (Char* ']]>' Char*))

CDSect ::= CDStart CData CDEnd

CDEnd ::= ']]>'

Tetra: CData ::= ( Char )*

CDSect ::= CDStart CData CDEnd

CDEnd ::= ']]>'

ISO-EBNF: PITarget ::= Name - (('X' | 'x') ('M' | 'm') ('L' | 'l'))

Tetra: Name | XML EXIT

Tetra: XML ::= [Xx][Mm][Ll]

Frequently ISO-EBNF defines character sets as sequences of alternative characters. As far as possible these should be combined to a common set by '[' and ']'. This will accelerate the scanning of a text very much.

Example:

ISO-EBNF: S ::= (#x20 | #x9 | #xD | #xA)+

Tetra: S ::= [ \t\r\n]+

The character set S of the example just given even can be deleted from the project. S is just the set of ignorable characters. This as such is not specified from ISO-EBNF. Each position of the grammar, where S can or must occur is specified explicitly. So the grammar becomes quite confused an in addition many LL(1) conflicts arise.

Example:

XMLDecl ::= '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'

EncodingDecl ::= S 'encoding' ...

SDDecl ::= S 'standalone' Eq ...

After VersionInfo is recognized, there are three possibilities to continue with S. If however S is defined as ignorable, the rule is LL(1) conform. 'encoding' | 'standalone' | '?>'

follows directly on VersionInfo.

If S is defined as ignorable, concatenations of characters, where S may not be inserted, should be combined into one token.

Example:

ISO-EBNF: EntityRef ::= "&" Name ";"

Tetra: EntityRef ::= &{Name};

where {Name} is a macro for the token name..

One problem remains. There are some spaces required at some positions of the ISO-EBNF-specification. You could define special tokens ending with a space. But the elegance won by the introduction of the ignorable characters then partially would be lost again. As the XML-parser example is not thought for verification of XML conformity, but to read and process XML documents, this point isn't really a problem.

This page belongs to the TextTransformer Documentation

Home Content German