Algorithms > Scanner algorithm

Scanner algorithm

Top Previous Next

A scanner tries to find the next token that matches the actual text. In contrast to other parser generators the TextTransformer not only uses one scanner, but can use a variety of little scanners, which are specifically testing only a sub set of all possible token. These sub sets consists of the first set of the actual node and the SKIP-alternatives. However the principle of the scanner remains the same: for each token of a set is tested, if it matches or not. If there are several matching tokens, an algorithm exists to decide between the candidates. In detail:

1. Testing the first set

The TextTransformer tries to find a match of the actual text and one of the tokens of the first set of the actual node. If there is exactly one matching token, at this point the analysis is finished.

2. Preference of the longest match

If there is more than one match at the same text position, the token with the longest match will be preferred; that means the token, which covers the greatest number of characters.

3. Preference of literal tokens

If there still several token are possible, literal tokens are preferred.

4. Preference of the longest match of the first sub match

If two candidates aren't literals, the token with the longer first sub match is chosen.

5. Preference of the longest match of the next sub match

If there is still an ambiguity the length of the matches of the further sub expressions will be consulted as criterion.

6. SKIP alternatives

If there is no matching token in the first set, SKIP alternatives will be tested. Is there exactly one match the lexical analysis at this point is finished.

7. next text position

If there is more than one candidate, to which can be skipped, the token is chosen, which matches at the nearest position in the text.

8. longest match (analog point 2-5)

If there are several token of the SKIP set, which all match at the same position of the text, the decision is made according to the longest match analogous to point 2-5.

It is important, that no definition of one of alternative tokens includes the definition of a different. The TextTransformer cannot give any warning in this case.

Example:

IDENT = \w+

NUMBER = \d+

Because "\w" among others includes the set of numbers, the definition of IDENT also includes the definition of NUMBER. If there is an alternative

IDENT | NUMBER

and a number is at the input position, it isn't clear, which alternative will match (->remark). The alternative of IDENT and NUMBER must not be formulated explicitly. A problem results in every case, where conflicting tokens are hidden in a common first set.

Generally it is recommended to define token as specifically as possible. A definition of

IDENT = [[:alpha:]|_]\w*

would avoid the problem mentioned above. IDENT cannot begin with a digit and NUMBER must begin with a digit.

Remark: Indeed IDENT will match. and if IDENT were defined as "[[:alpha:]|_|\d]+" NUMBER would match. But this is determined by the internal implementation of the Regex-Library and cannot be specified by rules.

This page belongs to the TextTransformer Documentation

Home Content German