Type | Sub category | Writing style features (Korean and English) | |
---|---|---|---|
F1 | Lexical | Character-level | # of characters per a post Frequency of alphabetic characters, normalized by total number of characters Frequency of upper case characters, normalized by total number of characters (only English) Frequency of digit characters, normalized by total number of characters Frequency of white space characters, normalized by total number of characters Frequency of tab characters, normalized by total number of characters Frequency of letters, normalized by total number of alphabetic characters Frequency of special characters, normalized |
Word-level | # of words per a post Frequency of short words (length ≤3), normalized by total number of words Frequency of characters in words, normalized by total number of characters Average word length Average sentence length in terms of characters Average sentence length in terms of words Word length frequency (length ≤20), normalized by total number of words Frequency of emoticons per a post (e.g. :), :(, ã… ã… , -.-) | ||
Richness | Total different words, normalized by total number of words Hapax Legomena, normalized by total number of words Hapax Dislegomena, normalized by total number of words Yule’s K Simpson’s D Sichel’s S Brunet’s W Honore’s R | ||
F2 | Syntactic | – | Frequency of punctuations, normalized by total number of words Frequency of stop words, normalized by total number of words Frequency of POS n-grams (n = uni, bi, tri), normalized by total number of words Frequency of roots, normalized by total number of words |
F3 | Structural | – | # of sentences per a post Has greetings ∈ {0, 1} Has URLs ∈ {0, 1} Has quoted content including news ∈ {0, 1} Has e-mail as signature ∈ {0, 1} Has telephone number as signature ∈ {0, 1} |
F4 | Content-specific | Character-level | Character n-grams (n = uni, bi, tri), normalized by total number of characters |
Word-level | Word n-grams (n = uni, bi, tri), normalized by total number of words |