Examples > Text statistics

Token

Top Previous Next

In the project options all ignorable characters are deactivated. So the set of token must recognize all parts of a text, linefeeds and spaces included.

So a text consists of

WORD words

NUMBER numbers

ABBREVIATION abbreviations

CONTINUATION sequences of dots like "..."

LINEFEED linefeeds

SENTENCE_END ends of sentences (dot, exclamation and question mark)

SPECIAL_CHAR the rest of characters

In the actions of the tokens the counter are actualized. For example the WORD action:

m_iWords++;

m_iChars += xState.length();

Here the counter for words is augmented by one and the counter for characters is augmented by the number of characters, of which the word consists.

A little bit more complicated is the action of the token ABBREVIATION: (\w+)\.

if(xState.length() > 2 &&

!m_mAbbr.findKey(xState.str(1)))

m_iSentences++;

m_iWords++;

m_iChars += xState.length();

If the recognized text consists of a single letter followed by a dot or if the text preceding the dot is found in the list of abbreviations, the recognized text is interpreted as an abbreviation. Otherwise the dot marks the end of a sentence and the sentence counter is incremented.

This page belongs to the TextTransformer Documentation

Home Content German