Token |
Top Previous Next |
Examples > Text statistics > Token
In the project options all ignorable characters are deactivated. So the set of token must recognize all parts of a text, linefeeds and spaces included.
So a text consists of
WORD words NUMBER numbers ABBREVIATION abbreviations CONTINUATION sequences of dots like "..." LINEFEED linefeeds SENTENCE_END ends of sentences (dot, exclamation and question mark) SPECIAL_CHAR the rest of characters
In the actions of the tokens the counter are actualized. For example the WORD action:
m_iWords++; m_iChars += xState.length();
Here the counter for words is augmented by one and the counter for characters is augmented by the number of characters, of which the word consists.
A little bit more complicated is the action of the token ABBREVIATION: (\w+)\.
if(xState.length() > 2 && !m_mAbbr.findKey(xState.str(1))) m_iSentences++;
m_iWords++; m_iChars += xState.length();
If the recognized text consists of a single letter followed by a dot or if the text preceding the dot is found in the list of abbreviations, the recognized text is interpreted as an abbreviation. Otherwise the dot marks the end of a sentence and the sentence counter is incremented.
|
This page belongs to the TextTransformer Documentation |
Home Content German |