Skip to main content

Table 2 Writing style features that are extracted from the posts of selected users for this study

From: Comparing writing style feature-based classification methods for estimating user reputations in social media

 

Type

Sub category

Writing style features (Korean and English)

F1

Lexical

Character-level

# of characters per a post

Frequency of alphabetic characters, normalized by total number of characters

Frequency of upper case characters, normalized by total number of characters (only English)

Frequency of digit characters, normalized by total number of characters

Frequency of white space characters, normalized by total number of characters

Frequency of tab characters, normalized by total number of characters

Frequency of letters, normalized by total number of alphabetic characters

Frequency of special characters, normalized

Word-level

# of words per a post

Frequency of short words (length ≤3), normalized by total number of words

Frequency of characters in words, normalized by total number of characters

Average word length

Average sentence length in terms of characters

Average sentence length in terms of words

Word length frequency (length ≤20), normalized by total number of words

Frequency of emoticons per a post (e.g. :), :(, ã… ã… , -.-)

Richness

Total different words, normalized by total number of words

Hapax Legomena, normalized by total number of words

Hapax Dislegomena, normalized by total number of words

Yule’s K

Simpson’s D

Sichel’s S

Brunet’s W

Honore’s R

F2

Syntactic

–

Frequency of punctuations, normalized by total number of words

Frequency of stop words, normalized by total number of words

Frequency of POS n-grams (n = uni, bi, tri), normalized by total number of words

Frequency of roots, normalized by total number of words

F3

Structural

–

# of sentences per a post

Has greetings ∈ {0, 1}

Has URLs ∈ {0, 1}

Has quoted content including news ∈ {0, 1}

Has e-mail as signature ∈ {0, 1}

Has telephone number as signature ∈ {0, 1}

F4

Content-specific

Character-level

Character n-grams (n = uni, bi, tri), normalized by total number of characters

Word-level

Word n-grams (n = uni, bi, tri), normalized by total number of words