Top  Previous  Next

Examples > Text statistics > Token


In the project options all ignorable characters are deactivated. So the set of token must recognize all parts of a text, linefeeds and spaces included.


So a text consists of


WORD                words

NUMBER                numbers

ABBREVIATION        abbreviations

CONTINUATION        sequences of dots like "..."

LINEFEED                linefeeds

SENTENCE_END        ends of sentences (dot, exclamation and question mark)

SPECIAL_CHAR        the rest of characters


In the actions of the tokens the counter are actualized. For example the WORD action:



m_iChars += xState.length();


Here the counter for words is augmented by one and the counter for characters is augmented by the number of characters, of which the word consists.


A little bit more complicated is the action of the token ABBREVIATION: (\w+)\.


if(xState.length() > 2 &&





m_iChars += xState.length();


If the recognized text consists of a single letter followed by a dot or if the text preceding the dot is found in the list of abbreviations, the recognized text is interpreted as an abbreviation. Otherwise the dot marks the end of a sentence and the sentence counter is incremented.



This page belongs to the TextTransformer Documentation

Home  Content  German