Token

Top  Previous  Next

Examples > Text statistics > Token

 

In the project options all ignorable characters are deactivated. So the set of token must recognize all parts of a text, linefeeds and spaces included.

 

So a text consists of

 

WORD                words

NUMBER                numbers

ABBREVIATION        abbreviations

CONTINUATION        sequences of dots like "..."

LINEFEED                linefeeds

SENTENCE_END        ends of sentences (dot, exclamation and question mark)

SPECIAL_CHAR        the rest of characters

 

In the actions of the tokens the counter are actualized. For example the WORD action:

 

m_iWords++;

m_iChars += xState.length();

 

Here the counter for words is augmented by one and the counter for characters is augmented by the number of characters, of which the word consists.

 

A little bit more complicated is the action of the token ABBREVIATION: (\w+)\.

 

if(xState.length() > 2 &&

  !m_mAbbr.findKey(xState.str(1)))

m_iSentences++;

 

m_iWords++;

m_iChars += xState.length();

 

If the recognized text consists of a single letter followed by a dot or if the text preceding the dot is found in the list of abbreviations, the recognized text is interpreted as an abbreviation. Otherwise the dot marks the end of a sentence and the sentence counter is incremented.

 

 



This page belongs to the TextTransformer Documentation

Home  Content  German