Comparing writing style feature-based classification methods for estimating user reputations in social media

Suh, Jong Hwan

doi:10.1186/s40064-016-1841-1

Research
Open access
Published: 02 March 2016

Comparing writing style feature-based classification methods for estimating user reputations in social media

Jong Hwan Suh ORCID: orcid.org/0000-0002-5465-1103¹

SpringerPlus volume 5, Article number: 261 (2016) Cite this article

3104 Accesses
6 Citations
1 Altmetric
Metrics details

Abstract

In recent years, the anonymous nature of the Internet has made it difficult to detect manipulated user reputations in social media, as well as to ensure the qualities of users and their posts. To deal with this, this study designs and examines an automatic approach that adopts writing style features to estimate user reputations in social media. Under varying ways of defining Good and Bad classes of user reputations based on the collected data, it evaluates the classification performance of the state-of-art methods: four writing style features, i.e. lexical, syntactic, structural, and content-specific, and eight classification techniques, i.e. four base learners—C4.5, Neural Network (NN), Support Vector Machine (SVM), and Naïve Bayes (NB)—and four Random Subspace (RS) ensemble methods based on the four base learners. When South Korea’s Web forum, Daum Agora, was selected as a test bed, the experimental results show that the configuration of the full feature set containing content-specific features and RS-SVM combining RS and SVM gives the best accuracy for classification if the test bed poster reputations are segmented strictly into Good and Bad classes by portfolio approach. Pairwise t tests on accuracy confirm two expectations coming from the literature reviews: first, the feature set adding content-specific features outperform the others; second, ensemble learning methods are more viable than base learners. Moreover, among the four ways on defining the classes of user reputations, i.e. like, dislike, sum, and portfolio, the results show that the portfolio approach gives the highest accuracy.

Background

For the last decade, the development of the Internet and mobile devices has increased the popularity of social media (Sherchan et al. 2013; Beato et al. 2015). Social media varies in form and purpose, including blogs, e.g. Blogspot, microblogs, e.g. Twitter, discussion forums, e.g. Epinions, media sharing sites, e.g. YouTube, and social networks, e.g. Facebook (Kaplan and Haenlein 2010; Sun et al. 2015). People take advantage of social media as an electronic communication platform to share information, express their opinions, and construct their social networks, and furthermore to hear the voice of other users (Suh 2015; Sherchan et al. 2013; Li et al. 2013). Besides, information and opinion, shared and shaped through social media, influence individual views of society and incite on and offline political participations, e.g. South Korea candlelight protest of 2008 and United States Occupy Wall Street of 2011. Thus, now social media has become an open platform for political and social innovations (Suh 2015; de Zuniga 2012).

However, there are problems and challenging issues for social media to grow into a better online place for political and social innovations, as well as for trusty information and opinions sharing. First, the anonymous nature of the Internet makes it difficult to ensure the qualities of users and their posts in social media. If there is no online user feedback on an anonymous user’s posts, e.g. like and dislike, there is no way to find whether the anonymous user is good or bad before reading her/his post. This makes users in social media vulnerable to low quality posts and posters, offending and deceiving. Second, manipulations on user reputations arouse suspicion and mistrust on the online feedback system of social media in two ways: good quality posts and their users can be given low reputations maliciously by other users; some users can manipulate their own reputations to be high for certain reasons. However, identity changes, stemming from the anonymity of the Internet, lead to insufficient past reputation records on social media users, and make it hard to detect the manipulations on user reputations by the existing approaches, e.g. suggested in Lai et al. (2013).

To resolve the abovementioned problems and challenging issues, the user reputations in social media need to be estimated upon concrete user features. Moreover, previous works hint that writing styles of social media users can be such objective features (Koppel et al. 2009). Therefore, this paper proposes an automatic approach that adopts the writing styles of users as objective features for estimating user reputations in social media. Nevertheless, following research gaps are identified through the literature reviews: first, no study has been made on an automated classification of user reputations in social media by using writing style analysis; second, when a way of defining user reputations into Good and Bad classes is given, it is unclear which writing style feature and classification technique will be better for this study; third, there is no reference on how to define the classes of user reputations in social media for the better performance.

Therefore, this paper proposes a research framework to find out a better way in estimating user reputations in social media by using writing style features. To explain, first of all, the paper segments the test bed users into Good or Bad reputation classes by proposed four approaches: like, dislike, sum, and portfolio. Moreover, it extracts four writing style features, i.e. lexical (denoted by F1), syntactic (denoted by F2), semantic (denoted by F3), and content-specific (denoted by F4), to represent the reputations of the test bed users. Next, it evaluates the classification performance of 32 configurations, resulted by combining four feature sets, i.e. F1, F1 + F2, F1 + F2 + F3, and F1 + F2 + F3 + F4, and eight classification techniques, i.e. four base learners—C4.5, Neural Network (NN), Support Vector Machine (SVM), and Naïve Bayes (NB)—and four Random Subspace (RS) ensemble methods based on the four base learners, with respect to accuracy for a given way of defining the classes of the test bed user reputations. In addition, it statistically compares the classification performances of different feature sets and different classification techniques by conducting pairwise t tests.

To sum up the contribution of this paper, it is the first work to deal with the estimation of user reputations in social media by using writing style features. If a system is built based on the experimental results of this study, the system can remedy the abovementioned shortcomings of social media’s reputation system as follows. First, the user reputation estimation based on writing style features helps protect users from being exposed to bad users and their harmful posts in social media. Second, it contributes to establishing trust among social media users when sharing and searching information and opinions. Eventually, these help for social media to evolve into the trustworthy virtual place for political/social innovations and information/opinion sharing.

The rest of this paper is organized as follows. “Literature reviews” section briefly introduces and reviews the relevant literature. “Proposed research framework” section outlines the proposed research framework, and explains it in detail. Subsequently, “Experimental results and discussions” section demonstrates and discusses the experimental results of applying the suggested research framework to the Web forum of South Korea, Daum Agora, chosen as a test bed. “Results on comparative studies” section evaluates the results with statistical comparisons. Finally, “Conclusions” section concludes the paper with a reflection on limitations and further works.

Literature reviews

Online anonymity

Online anonymity represents the incapability of others to identify an individual in computer-mediated communication (CMC) (Christopherson 2007). The online anonymity takes many different forms, grouped into three different types: first, visual anonymity is the most common type wherein physical characteristics are hidden although other identifying information is known; second, pseudonymity refers to the case when people use avatars or usernames as indicators of their online identity; third, full anonymity is said to exist where users remain unknowable after interaction has concluded, and occurs in the absence of any long-term usernames (Christie and Dill 2016). In this paper, the term anonymity refers to pseudonymity and full anonymity.

Due to the online anonymity, digesting posts in social media sometimes requires a great deal of risk taking, like doing businesses online without any physical interaction (Enders et al. 2008). For example, cyber criminals abuse the anonymous nature in social media to conduct malicious activities such as phishing scams, identity theft, and harassment (Iqbal et al. 2013). Hence, to alleviate such risk taking in reading posts, this paper aims at the objective user reputation system of social media, which is effective even under the anonymous circumstances.

Online user feedbacks and user reputations in social media

Social media users post their opinions regarding particular objects such as products, services, companies, people, and events (Lai et al. 2013; Shad Manaman et al. 2016). Online user feedback mechanisms play crucial roles in evaluating the qualities of posts and their users. The online user feedbacks in social media are intended to offer social control mechanism, allowing social translucence for improved accountability (Erickson and Kellogg 2000). Mainly there are two types of online user feedbacks in social media: recommendations and reputations (Li et al. 2013). First, recommendations help users identify posts and users that suit their needs or preferences. They are usually used to solve information overload problems (Li et al. 2013). Recommendation systems are classified into content-based filtering, collaborative filtering, or hybrid approaches (Sarwar et al. 2001; Jin et al. 2004; Li et al. 2005; Huang et al. 2004; Liu et al. 2014; Yang et al. 2014). Second, reputations are considered as a collective measure of trustworthiness based on the referrals or ratings of users in social media (Jøsang et al. 2007). Reputation systems let users rate other users, and the ratings help determine who to trust in certain environments where users have to interact among themselves in online settings (Agudo et al. 2010).

Particularly for the reputation systems, there are two categories of calculating trust scores between users as user reputations: feature-based and graph-based. First, the feature-based method is to compute the trust score of an user from past ratings on the user’s posts (O’Donovan and Smyth 2005). Second, the graph-based approach is to derive the trust values based on explicitly specified relations (e.g. friends) or trust relationships of the user (Golbeck 2005). Between the two, this study adopts the first method that measures reputations from past ratings on the user’s posts because the anonymity accompanies uncovered or scarce relationships regarding the user.

Writing style features for characterizing user reputations in social media

According to systematic functional linguistic theory, a language has the textual dimension which individuals use to convey their ideas varying stylistic elements in their writings. The writing styles are influenced by education, gender, and vocabulary as well as subconscious factors described in the psycholinguistics works. The statistical analysis on such writing styles, a.k.a. authorship analysis, can discriminate authorship in social media (Abbasi et al. 2008a). Because Web-based channels such as e-mail, newsgroup, and chat rooms are relatively casual compared with formal publications, social media users are more likely to leave their own styles in their writings (Zheng et al. 2006). Hence, if the reputations of the social media users are characterized by their writing styles, authorship analysis can help resolve the problem of anonymity in the online communications of social media (Zhao et al. 2015; Iqbal et al. 2013). Previous works on writing style analysis in Table 1 also hint that writing styles extracted from posts can be objective features that characterize user reputations in social media. Actually, it is practical to use the writing style features for the anonymous users in social media because other features, e.g. their relationships as graph-based features, are not available in most cases. Nonetheless, to our best knowledge, there is no previous work made on classifying the reputations of social media users by using their writing styles.

Table 1 Previous works that used writing style features in social media

Full size table

Writing style markers that are known as the most effective discriminators of authorship in social media are lexical, syntactic, structural, and content-specific writing style features. Here, lexical, syntactic, and structural writing style features are called the content-free writing style features (Zhang et al. 2011; Jiang et al. 2014). Among the four writing style features in social media, the content-specific writing style features are expected to outperform the other content-free writing style features for this study because of their two characteristics: first, they consist of important keywords and phrases, so they are more meaningful with high representative ability than the other writing style features (Zhang et al. 2011); second, they contain a much larger number of n-grams extracted from the collected social media data, and the large potential feature spaces are known to be effective for online text classification (Abbasi and Chen 2008). Despite, all the writing style features used in Jiang et al. (2014) are considered for this study because there is no previous work that has shown which writing style feature is more useful for this study that aims to estimate user reputations in social media.

Classification techniques for writing style analysis

Three major types of writing style analysis tasks are identification, similarity detection, and classification (Zheng et al. 2006; Abbasi and Chen 2005). First, identification entails comparing anonymous texts against those belonging to identified entities, where the anonymous text is known to be written by one of those entities. Second, similarity detection task requires the comparison of anonymous texts against other anonymous texts in order to assess the degree of similarity. Third, classification is related to categorizing objects in regards to their properties, e.g. gender, by using their writing styles as features that represent the properties. This study belongs to the third category of classification techniques for writing style analysis.

For the classification task, this paper adopts the supervised techniques because they have been extensively studied due to their predominant classification performance (Zheng et al. 2006; Abbasi and Chen 2009). In general, supervised techniques for classification consist of two steps: first, the extraction of features from training data and their conversion to feature vectors; second, training of the classifier on the feature vectors and application of the classifier to unseen instances. Hence, feature construction and learning method selection are crucial for accurate classification.

Referring to the classification techniques of previous writing style analysis works summarized in Table 1, four main supervised techniques are adopted as base learners for this study. To explain, first, C4.5, an extension of the ID3 algorithm, is a decision-tree building algorithm developed by Quinlan (1986), and it adopts a divide-and-conquer strategy and an entropy measure for object classification. Its goal is to classify mixed objects into their associated classes based on the objects’ attribute values. Second, NN has been popular because of its unique learning capability (Widrow et al. 1994), and has achieved good performance in many different applications (Giles et al. 1998; Kim and Lewis 2000; Tolle et al. 2000). Third, SVM is a novel learning machine first introduced by Vapnik (1995), and is based on the structural risk minimization principle from computational learning theory. Because SVM can handle millions of inputs with good performance (Cristianini and Shawe-Taylor 2000; Joachims 2002), it was introduced to writing style analysis in many previous works (Argamon et al. 2003; Vel et al. 2001; Diederich et al. 2003). Fourth, based on Bayes Theorem (Barnard 1958), NB is a fairly simple probabilistic classification algorithm that uses strong independence assumptions regarding various features (Yang et al. 2002). It assumes that the presence of any feature is entirely independent of the presence of the other features, and allows building classification models efficiently.

Among the four base learners, SVM is a highly robust technique that has provided powerful classification capabilities for online authorship analysis. In head-to-head comparisons, SVM significantly outperformed other supervised learning methods such as NN and C4.5 (Zheng et al. 2006; Abbasi and Chen 2009). Similarly, SVM is expected to outperform the other base learners for this study. However, for writing style analysis, it is unclear which classification technique consistently performs better than others for a given problem in a given domain.

Moreover, for this uncertainty, it is not uncommon to conduct multiple learners and create an integrated classifier based on overall performance (Wang et al. 2014). Hence, in addition to the four base learners, this paper combines an ensemble learning method to each of the four base learners. Ensemble learning is a machine learning paradigm where multiple learners are trained to solve the same problem. In contrast to base learners that try to learn one hypothesis from the training data, ensemble learning methods try to conduct a set of hypotheses and combine them for use. In general, ensemble learning methods are divided into two categories: first, Boosting and Bagging are instance partitioning methods; second, feature partitioning methods include RS (Polikar 2006; Zhou 2012; Wang et al. 2014).

For this study, RS is selected as an ensemble learning method because it showed better accuracy than Boosting and Bagging in Wang et al. (2014). RS is an ensemble construction technique proposed by (Tin Kam 1998), and modifies the training dataset in the feature space. RS considers that, if one obtain better base learners in random spaces than in the original feature space, the combined decision of such base learners can be superior to a single classifier constructed on the original training dataset in the complete feature sets (Wang et al. 2014). Eventually, RS is combined with the four base learners selected for this study, and the resulted four multiple learner methods, i.e. RS-C4.5, RS-NN, RS-SVM, and RS-NB, are additionally adopted for this study. Considering the superiority of SVM to the other three base learners, RS-SVM is expected to outperform the other three ensemble learning methods.

Proposed research framework

To design and examine an automatic approach that uses writing style features for estimating user reputations in social media, this study proposes a research framework as outlined in Fig. 1. The research framework answers below research questions.

RQ1 How does writing style features perform for estimating user reputations in social media?
RQ2 Which writing style features are the best at estimating user reputations by classification techniques in social media?
RQ3 Which classification technique is better suited at differentiating user reputations with writing style features in social media?
RQ4 Which method to define user reputation classes, i.e. Good and Bad, works better for estimating user reputations with writing style features in social media?

Ultimately, the research framework is intended for developing a system, which is capable of differentiating between Good and Bad reputation users by using stylistic tendencies inherent their writings in social media. In a nutshell, it consists of four steps: data collection, data representation, classification, and evaluation with comparisons. The following sub sections explain the details of each step in the research framework.

Collect data

This study uses the Web forum for data collection because it is a major type of social media with a balanced nature of discussions among participants and a relatively broader range of topics (Zhang et al. 2011). The data collection from the Web forum has two steps, crawling and parsing. First, the developed Web crawler programs collect the online data from the Web forum as HTML pages. Then, users whose posts had been evaluated at least once by the others are selected, and the posts and their past ratings are parsed out for the selected users from the raw HTML pages, and are stored in a relational database.

Represent user reputations by writing style features

In this step, writing style features are extracted as independent variables to represent the reputations of the collected and selected users from the Web forum. The class of a user reputation is obtained from her/his online user feedbacks by using different ways of defining user reputation classes. The details are explained in the following sub-sections.

Extract writing style features

This study generates different feature sets containing different types of writing style features. By doing so, we can compare and evaluate the performance of different writing style sets in estimating the classes of the selected users’ reputations. Table 2 lists the writing style features in social media, adopted for this study. In this paper, the different writing style features are denoted as follows: lexical features F1, syntactic features F2, structural features F3, and content-specific features F4. The writing style features of Table 2 are based on the prior studies in Table 1, mainly from Jiang et al. (2014). In addition, unlike the previous works, emotional writing style features are included to F1 for this study. The emoticon refers to graphic representations of facial expressions, which often follow utterances in written CMC, and are produced by ASCII symbols or by graphic symbols (Skovholt et al. 2014).

Table 2 Writing style features that are extracted from the posts of selected users for this study

Full size table

As a result, the four types of writing style features, i.e. F1, F2, F3, F4, are obtained after feature extraction. Based on those different types of writing style features, four feature sets are constructed in an incremental way: feature set F1; feature set F1 + F2; feature set F1 + F2 + F3; feature set F1 + F2 + F3 + F4. This incremental order implies the evolutionary sequence of features (Zheng et al. 2006; Abbasi and Chen 2008; Zhang et al. 2011).

Next, for feature selection, information gain (IG) heuristic is adopted due to its reported effectiveness in previous online text classification. IG (C, A) measures the amount of entropy decrease on a class C when providing a feature A (Quinlan 1986; Shannon 1948; Zhang et al. 2011). The decreasing amount of entropy reflects the additional information gained by adding feature A, and higher values between 0 and 1 indicate more information gained by providing certain features (Zhang et al. 2011). In this study, writing style features with IG (C, A) > 0.0025 are selected by referring to previous related works (Yang and Pedersen 1997; Abbasi et al. 2008b; Zhang et al. 2011).

Segment the selected users’ reputations into Good and Bad classes

The selected Web forum users are segmented into Good and Bad reputation groups based on the ratings regarding their posts. In social media, there are generally two types of online user feedbacks: like and dislike, although there are various types of past ratings on their posts, e.g. helpfulness on reviews and reviewer rankings in Amazon.com, likes in Facebook, retweets in Twitter, ratings on sellers in eBay, ratings on answers in Q&A, etc. Hence, this paper proposes the four ways of defining social media users’ reputations into two classes: Good and Bad. The four approaches are named as segmenting type s = {like, dislike, sum, and portfolio}, and the reputation classes for users, reputation_s, are respectively defined as

$$ reputation_{\text{like}} \left( {{\text{user}}_{i} } \right) = \left\{ {\begin{array}{ll} {\text{Good}} &\quad {{\text{if }}like\left( {user_{i} } \right) \ge m_{\text{like}} } \\ {\text{Bad}} &\quad {\text{otherwise}} \\ \end{array} } \right\} \, , $$

(1)

where like(user_i) is the number of likes that user_i obtained per a post in social media, and m _like is the average of like(user_i) for i = 1, …, N.

$$ reputation_{\text{dislike}} \left( {{\text{user}}_{i} } \right) = \left\{ {\begin{array}{*{20}l} {\text{Good}} &\quad {{\text{ if }}dislike\left( {{\text{user}}_{i} } \right) < m_{\text{disike}} } \\ {\text{Bad}} &\quad {\text{otherwise}} \\ \end{array} } \right\} , $$

(2)

where dislike(user_i) is the number of dislikes that user_i obtained per a post in social media, and m _dislike is the average of dislike(user_i) for i = 1, …, N.

$$ reputation_{\text{sum}} \left( {{\text{user}}_{i} } \right) = \left\{ {\begin{array}{*{20}l} {\text{Good}} \hfill &\quad {{\text{if }}sum\left( {{\text{user}}_{i} } \right) \ge m_{\text{sum}} } \\ {\text{Bad}} &\quad {\text{otherwise}} \\ \end{array} } \right\} , $$

(3)

where sum(user_i) is equal to like(user_i)—dislike(user_i) and m _sum is the average of sum(user_i) for i = 1, …, N.

$$ reputation_{\text{portfolio}} \left( {user_{i} } \right) = \left\{ {\begin{array}{*{20}l} {\text{Good}} &\quad {{\text{if }}like\left( {user_{i} } \right) \ge m_{\text{like}} {\text{ and }}dislike\left( {user_{i} } \right) < m_{\text{dislike}} } \\ {\text{Bad}} &\quad {\text{otherwise}} \\ \end{array} } \right\} . $$

(4)

Estimate reputation _s of the selected users

This paper adopts four base learners, i.e. C4.5, NN, SVM, and NB, commonly used for previous studies of writing styles in social media. Moreover, RS as an ensemble learning method is combined with the four base learners, resulting in RS-C4.5, RS-NN, RS-SVM, and RS-NB. In total, these eight classification techniques are used for this study. For an experiment, randomly 100 users for each reputation class are selected from the collected data, and a tenfold validation is performed to train a classifier and evaluate it. To implement the adopted eight classification techniques, the data mining toolkit WEKA (Waikato Environment for Knowledge Analysis) version 3.7.0 is used with all the default parameters because it is the most commonly used open-source toolkit with a collection of machine learning algorithms for solving data mining problems (Wang et al. 2014). In detail, the WEKA modules for algorithms used in this study are as follows: J48 module for C4.5, Multilayer Perceptron module for NN, SMO module for SVM, Naïve Bayes module for NB, and Random Subspace module for RS.

Evaluate results with comparisons

To assess the performance of each feature set and each classification technique, this paper adopts the standard classification performance metrics. For the given segmenting type s, they are defined as

$$ accuracy_{s} = \frac{{|\{ {\text{users}}\;{\text{classified}}\;{\text{correctly}}\;{\text{either}}\;{\text{as}}\;reputation_{s} = {\text{Good}}\;{\text{or}}\;reputation_{s} = {\text{Bad}}\} |}}{{|\{ {\text{total}}\;{\text{users}}\;{\text{belonging}}\;{\text{either}}\;{\text{to}}\;reputation_{s} = {\text{Good}}\;{\text{or}}\;reputation_{s} = {\text{Bad}}\} |}} \, . $$

(5)

$$ precision_{s} (i) = \frac{{ | {\text{\{ users}}\;{\text{classified}}\;{\text{correctly}}\;{\text{as}}\;reputation_{s} = i\}|}}{{ | {\text{\{ users}}\;{\text{classified}}\;{\text{either}}\;{\text{correctly}}\;{\text{or}}\;{\text{falsely}}\;{\text{as}}\;reputation_{s} = i\} |}}\quad {\text{for }}i = {\text{Good}},\;{\text{Bad}} . $$

(6)

$$ recall_{s} (i) = \frac{{ |\{ {\text{users}}\;{\text{classified}}\;{\text{correctly}}\;{\text{as}}\;reputation_{s} = i\} |}}{{ |\{ {\text{users}}\;{\text{belonging}}\;{\text{to}}\;reputation_{s} \;{ = }\;i\} |}}\quad {\text{for }}i \, = {\text{ Good}},{\text{ Bad}} . $$

(7)

$$ F{ - }measure_{s} (i) = \frac{{2 \times precison_{s} (i) \times recall_{s} (i)}}{{(precision_{s} (i) + recall_{s} (i))}}\quad {\text{for }}i = {\text{Good}},\;{\text{Bad}} . $$

(8)

To enhance understanding, Fig. 2 illustrates each metric of Eqs. (5)–(7). These four metrics have been widely used in information retrieval and text classification studies (Abbasi et al. 2008b; Li et al. 2008; Zhang et al. 2011). Among the four standard measures, accuracy assesses the overall classification correctness, while the others evaluate the correctness regarding each class. Therefore, this paper performs comparisons with respect to accuracy.

The comparisons are done by pairwise t tests because pairwise t test comparisons are the simplest kind of statistical tests and commonly used for comparing the performance of two algorithms (Derrac et al. 2011). To check whether the average difference in two approaches’ performance is significantly different from zero, this paper repeats the same experiments 50 times in two ways. First, to examine the effect of adding one feature set on accuracy for a certain classification technique, this paper conducts 24 (=three feature set comparisons × eight classification techniques) individual pairwise t tests. Second, to compare the performance of classification techniques on accuracy for a certain feature set, the paper conducts 64 (=24 + 24 + 16) individual pairwise t tests: 24 (=six technique comparisons × four feature sets) between base learners, i.e. Base versus Base, 24 (=six technique comparisons × four feature sets) between ensemble learning methods of RS, i.e. Ensemble versus Ensemble, and 16 (=four technique comparisons × four feature sets) between base learners and ensemble learning methods of RS, i.e. Base versus Ensemble.

Experimental results and discussions

Test bed: a Korean Web forum

This study targeted South Korea as a test bed country because South Korea has shown that social media can be used not only to exchange information and opinions, but also to organize the street protests and empower people to be active in the protests (Suh et al. 2010; Suh 2015). In particular, South Korea’s Web forum, called Daum Agora (http://agora.media.daum.net), was chosen as a test bed social media for three reasons: fist, Daum Agora is one of the most popular and anonymous Web forums in South Korea; second, it has been used from the early age of social media, i.e. the mid 2000 s. Hence, it contains a large-scale data with millions of posts and comments starting from August 2004; third, it is on a national scale so most of the main topics in South Korea are discussed through Daum Agora (Suh 2015). In this sense, Daum Agora proves sufficient and ideal for the Web forum of South Korea.

The web crawler program collected the posting data from Daum Agora, which had been generated for the past 5 years from 2007 to 2011, and were stored in the relational database for the experiments. In total, the online data on 2,565,918 posts from 91,968 users were collected. Among the collected users, users, of which posts had been evaluated at least once by the others, were selected for the experiments, and they are 22,131 users. Based on the collected data, the writing style features of 22,131 users are extracted.

Next, the online user feedbacks regarding the posts of the selected 22,1131 users were extracted from the collected data. In case of the test bed Daum Agora, there are two types of online user feedbacks: like and dislike. Based on these, the classes of the selected 22,131 users’ reputations were obtained by different segmenting types. As a result, Table 3 shows the number of users that belong to each user reputation class by different segmenting types. Moreover, from the selected 22,131 users, 100 users and their posts were randomly sampled for each class of user reputations in an experiment.

Table 3 The number of users that belong to each class of reputation _s

Full size table

Evaluations and discussions

Table 4 shows experimental results on accuracy for different writing feature sets and different classification techniques. To explain key findings, first, the feature set F1 + F2 + F3 + F4 gave the best accuracy for all the segmenting types except when segmenting type s = sum. On the other hand, RS-SVM gave the highest accuracy regardless of segmenting types. Second, among the 32 combinations, the feature set F1 + F2 + F3 + F4 and RS-SVM ranked the best, i.e. 94.50 %, in terms of accuracy for all the segmenting types. Likewise, the results in Table 4 indicate the superiority of the feature set F1 + F2 + F3 + F4 and RS-SVM. It is aligned with the paper’s expectations stated in the Literature Reviews section, and the possible reason is that their common advantage of handling with tens of thousands features made them have the best teamwork.

Table 4 Accuracy (%) for different feature sets and different classification techniques

Full size table

In addition, the best accuracies were identified respectively for all four segmenting types, and were compared to each other. For classification techniques, reputation _portfolio type gave the highest accuracy, i.e. 94.50 %, if the feature set F1 + F2 + F3 + F4 and the classification technique RS-SVM are used. On the other hand, reputation _sum type gave the lowest best accuracy, i.e. 82.50 %, if the feature set F1 + F2 + F3 and the classification technique RS-SVM are used. Thus, it is seen that the more accurate way of segmenting user reputations by portfolio approach contributed to its higher accuracy than the other segmenting ways. Whereas, reputation _sum type made it more difficult to classify user reputations, of which like(user_i) and dislike(user_i) are in a tense conflict.

Table 5 shows the evaluation results on precision, recall, and F-measure. To explain, the feature set F1 + F2 + F3 + F4 achieved the highest precisions: 98.88 % with RS-NN for reputation _dislike = Good, and 94.95 % with RS-SVM for reputation _portfolio = Bad. Next, the feature set F1 + F2 + F3 + F4 and RS-NN gave the highest recalls: 100 % for reputation _sum = Good, and 99.00 % for reputation _dislike = Bad. The highest F-measures were achieved by the feature set F1 + F2 + F3 + F4 in cooperation with RS-SVM when segmenting type s = portfolio, i.e. 94.53 % for reputation _portfolio = Good, and 94.47 % for reputation _portfolio = Bad. Putting together, these results show that the feature set F1 + F2 + F3 + F4 and ensemble learning methods gave the best precision, recall, and F-measure for both Good and Bad classes of user reputations.

Table 5 Performance measures (%) for different feature sets and different classification techniques

Full size table

Results on comparative studies

On comparisons of different feature sets

Table 6 shows the results of the pairwise t tests, conducted to examine the effect of different feature sets on accuracy for a certain classification technique. It reveals that, regardless of segmenting types, adding one type of writing style features improved most of classification accuracies except adding the structural features F3. The insignificant effect of adding F3 is because its size is small so its representation capability is smaller than adding the other features.

Table 6 Pairwise t tests on accuracy for different feature sets

Full size table

Moreover, the feature set F1 + F2 + F3 + F4 gave the best results for all eight classification techniques, regardless of the segmenting type. This suggests that the four feature sets provide important complementary and discriminatory potential if they are exploited by incorporating them in unison. Thus, a large set of rich writing style features are beneficial for automated classification on the reputations of social media users. Especially, it shows adding the content-specific writing style features F4 contributes to the best accuracy as expected in the Literature Reviews section. It indicates that keywords and phrases on certain topics are more important grounds to judge users than the other content-free writing style features in social media.

On comparisons of different classification techniques

Table 7 shows the results of the pairwise t tests, performed to investigate the effect of different classification techniques on accuracy for a specific feature set. For a given segmenting type s, classification techniques were compared in three parts: Base versus Base, Ensemble versus Ensemble, and Base versus Ensemble. In Table 7, it was observed that the ranks of all eight classification techniques are different according to the selected feature set.

Table 7 Pairwise t tests on accuracy for different classification techniques

Full size table

From Table 7, there are two key findings. First, there is no single classification technique that gave the best accuracy for all the feature sets in any given segmenting type. Second, ensemble learning methods are better than base learners in most of configurations. The reason is that the ensemble learning methods consider the writing style features in its entirety whereas the base learners only consider the average of the aggregated writing style features. This difference made the ensemble learning methods preserve the important information better than the base learners, and resulted in better accuracies.

On comparisons of different ways in defining the classes of user reputations

In Table 4, regarding segmenting types, it is remarkable that the segmenting type s = dislike gave the highest precision for both Good and Bad classes. It means that segmenting users by their dislike scores is the best way in terms of precision. Moreover, the segmenting type s = portfolio gave the highest F-measure when the feature set F1 + F2 + F3 + F4 and RS-SVM are combined: 94.53 % for Good class and 94.47 % for Bad class. One possible reason is that the more accurately segmenting users by portfolio approach contributed to higher F-measure than the other segmenting types. However, because the segmenting type s = portfolio classifies users more strictly into Bad class, its bests in terms of precision and recall were worse than the bests of the other segmenting types.

Moreover, in Table 7, it is seen that, when reputation _portfolio was used, the feature set F1 + F2 + F3 + F4 and RS-SVM gave the best accuracy among all the configurations. The possible reason is that reputation _portfolio classified users into Good class if they are certainly good, and strictly filtered users into Bad class if we are unsure about whether they belong to Good or Bad class.

Conclusions

This paper proposed a research framework to design and examine an automatic system that estimates user reputations of social media into Good and Bad classes by adopting writing styles. Using the most popular Web forum in South Korea, Daum Agora, selected as a test bed, the application was conducted by following the suggested research framework of the paper.

Consequently, the experimental results in Table 4 show that the configuration of the feature set F1 + F2 + F3 + F4 and RS-SVM gave the best accuracy, i.e. 94.50 %, when segmenting type s = portfolio. It proves possible to classify user reputations by writing style features in social media with high accuracy (RQ1 is answered). In Table 6, the pairwise t tests on accuracy for different feature sets show that the feature set F1 + F2 + F3 + F4 ranked the best for all eight classification techniques regardless of segmenting types (RQ2 is answered). It represents that keywords and phrases on certain topics affect user reputations more than the other content-free writing style features. Whereas, according to Table 7, the results of pairwise t tests on accuracy for different classification techniques show that there was no single classification technique that gave the best accuracy for all the feature sets in any given segmenting type, but ensemble learning methods turned out better than base learners (RQ3 is answered). The experimental results related to RQ2 and RQ3 indicate that both the feature set F1 + F2 + F3 + F4 and the ensemble learning method are respectively better for handling with a large set of writing style features, and such common strength provided a synergy effect. In addition, the paper concluded that combining two types of online user feedbacks by using portfolio approach, i.e. segmenting type s = portfolio, gave the better accuracy than the other segmenting types (RQ4 is answered). A potential explanation is that, because the suggested portfolio approach segments user reputations more strictly into Good and Bad classes, it is better able to address the problem of this study.

This paper contributes to the literature review as follows. First, this study is the first work that adopts writing styles as objective features to automatically classify social media user reputations into Good and Bad classes. Second, this paper provided guidelines for the system implementation in two ways: (1) which writing style features and classification technique should be used together for the best accuracy; (2) which segmenting type gave the best result with respect to accuracy. In particular, because social media have similar ways in measuring user reputations, which are given as the online user feedbacks, e.g. like, dislike, or both of two, the results can be used as a reference for similar studies on the other types of social media. Third, the paper helps keep the healthy and trustful social media ecosystem by protecting users from bad users, and it enables us to manage user reputations that are manipulated to be either lower or higher than the original values. As a consequence, it helps build the trust between users by complementing the online user feedback system in social media.

Directions for further studies can be suggested based on this paper as follows. First, for this study, South Korea was selected as the test bed country for reasons, but different country targets and more various languages may lead to additional implications. Hence, future researches for various countries, e.g. US, European, China, Japan, and Mid East are recommendable as the future researches. Second, this study focused on writing styles as objective features to classify user reputations in social media, but there can be other objective features useful for this study, e.g. network structures in communications between users and their commenters. In a similar vein, third, simpler or more sophisticated approaches should be considered to tackle the computing problem that ensemble learning methods take a great deal of time. Thus, the further studies can be conducted to revisit the problems and challenging issues, which motivated this study, with different perspectives on countries, languages, features, and techniques.

References

Abbasi A, Chen HC (2005) Applying authorship analysis to extremist-group web forum messages. IEEE Intell Syst 20(5):67–75. doi:10.1109/Mis.2005.81
Article Google Scholar
Abbasi A, Chen H (2008) Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Trans Inf Syst. doi:10.1145/1344411.1344413
Google Scholar
Abbasi A, Chen HC (2009) A comparison of fraud cues and classification methods for fake escrow website detection. Inf Technol Manag 10(2–3):83–101. doi:10.1007/s10799-009-0059-0
Article Google Scholar
Abbasi A, Chen HC, Nunamaker JF (2008a) Stylometric identification in electronic markets: scalability and robustness. J Manag Inf Syst 25(1):49–78. doi:10.2753/Mis0742-1222250103
Article Google Scholar
Abbasi A, Chen HC, Salem A (2008b) Sentiment analysis in multiple languages: Feature selection for opinion classification in Web forums. ACM Trans Inf Syst. doi:10.1145/1361684.1361685
Google Scholar
Agichtein E, Castillo C, Donato D, Gionis A, Mishne G (2008) Finding high-quality content in social media. In: Proceedings of the 2008 International Conference on Web Search and Data Mining, Palo Alto, California, USA, pp 183–194. doi:10.1145/1341531.1341557
Agudo I, Fernandez-Gago C, Lopez J (2010) A scale based trust model for multi-context environments. Comput Math Appl 60(2):209–216. doi:10.1016/j.camwa.2010.02.009
Article Google Scholar
Argamon S, Šarić M, Stein SS (2003) Style mining of electronic messages for multiple authorship discrimination. In: Proceeding of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, Washington, D.C., USA, pp 475–480. doi:10.1145/956750.956805
Argamon S, Whitelaw C, Chase P, Hota SR, Garg N, Levitan S (2007) Stylistic text classification using functional lexical features. J Am Soc Inf Sci Tec 58(6):802–822. doi:10.1002/Asi.20553
Article Google Scholar
Barnard GA (1958) Studies in the history of probability and statistics: IX. Thomas Bayes’s essay towards solving a problem in the doctrine of chances. Biometrika 45(3–4):293–295. doi:10.1093/biomet/45.3-4.293
Article Google Scholar
Beato F, Meul S, Preneel B (2015) Practical identity-based private sharing for online social networks. Comput Commun. doi:10.1016/j.comcom.2015.07.009
Google Scholar
Benjamin V, Hsinchun C (2012) Securing cyberspace: identifying key actors in hacker communities. In: Proceedings of the 2012 IEEE International conference on intelligence and security informatics (ISI), Arlington, Virgina, USA, pp 24–29. doi:10.1109/isi.2012.6283296
Christie C, Dill E (2016) Evaluating peers in cyberspace: the impact of anonymity. Comput Hum Behav 55(Part A):292–299. doi:10.1016/j.chb.2015.09.024
Article Google Scholar
Christopherson KM (2007) The positive and negative implications of anonymity in Internet social interactions: “On the Internet, Nobody Knows You’re a Dog”. Comput Hum Behav 23(6):3038–3056. doi:10.1016/j.chb.2006.09.001
Article Google Scholar
Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines: and other kernel-based learning methods. Cambridge University Press, New York
Book Google Scholar
de Zuniga HG (2012) Social media use for news and individuals’ social capital, civic engagement and political participation. J Comput Mediat Comm 17(3):319–336. doi:10.1111/j.1083-6101.2012.01574.x
Article Google Scholar
Derrac J, García S, Molina D, Herrera F (2011) A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol Comput 1(1):3–18. doi:10.1016/j.swevo.2011.02.002
Article Google Scholar
Diederich J, Kindermann O, Leopold E, Paass G (2003) Authorship attribution with support vector machines. Appl Intell 19(1–2):109–123. doi:10.1023/A:1023824908771
Article Google Scholar
Enders A, Hungenberg H, Denker H-P, Mauch S (2008) The long tail of social networking: revenue models of social networking sites. Eur Manag J 26(3):199–211. doi:10.1016/j.emj.2008.02.002
Article Google Scholar
Erickson T, Kellogg WA (2000) Social translucence: an approach to designing systems that support social processes. ACM Trans Comput-Hum Interact 7(1):59–83. doi:10.1145/344949.345004
Article Google Scholar
Giles CL, Sun R, Zurada JM (1998) Neural networks and hybrid intelligent models: foundations, theory, and applications. IEEE Trans Neural Netw 9(5):721–723. doi:10.1109/TNN.1998.712147
Article Google Scholar
Golbeck JA (2005) Computing and applying trust in web-based social networks. University of Maryland, College Park
Google Scholar
Huang Z, Chung W, Chen H (2004) A graph model for E-commerce recommender systems. J Am Soc Inf Sci Technol 55(3):259–274. doi:10.1002/asi.10372
Article Google Scholar
Huang CN, Fu TJ, Chen HC (2010) Text-based video content classification for online video-sharing sites. J Am Soc Inf Sci Technol 61(5):891–906. doi:10.1002/Asi.21291
Article Google Scholar
Iqbal F, Binsalleeh H, Fung BCM, Debbabi M (2013) A unified data mining solution for authorship analysis in anonymous textual communications. Inf Sci 231:98–112. doi:10.1016/j.ins.2011.03.006
Article Google Scholar
Jiang S, Chen H, Nunamaker JF, Zimbra D (2014) Analyzing firm-specific social media and market: A stakeholder-based event analysis framework. Decis Support Syst 67:30–39. doi:10.1016/j.dss.2014.08.001
Article Google Scholar
Jin R, Chai JY, Si L An automatic weighting scheme for collaborative filtering. In: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, Sheffield, United Kingdom, 2004. ACM, 1009051, pp 337–344. doi:10.1145/1008992.1009051
Joachims T (2002) Learning to classify text using support vector machines: methods, theory and algorithms. Kluwer Academic Publishers, Dordrecht
Book Google Scholar
Jøsang A, Ismail R, Boyd C (2007) A survey of trust and reputation systems for online service provision. Decis Support Syst 43(2):618–644. doi:10.1016/j.dss.2005.05.019
Article Google Scholar
Kaplan AM, Haenlein M (2010) Users of the world, unite! The challenges and opportunities of social media. Bus Horiz 53(1):59–68. doi:10.1016/j.bushor.2009.09.003
Article Google Scholar
Kim YH, Lewis FL (2000) Optimal design of CMAC neural-network controller for robot manipulators. IEEE Trans Syst Man Cybern Part C Appl Rev 30(1):22–31. doi:10.1109/5326.827451
Article Google Scholar
Koppel M, Schler J, Argamon S (2009) Computational methods in authorship attribution. J Am Soc Inf Sci Technol 60(1):9–26. doi:10.1002/Asi.20961
Article Google Scholar
Lai CH, Liu DR, Lin CS (2013) Novel personal and group-based trust models in collaborative filtering for document recommendation. Inf Sci 239:31–49. doi:10.1016/j.ins.2013.03.030
Article Google Scholar
Li Y, Lu L, Xuefeng L (2005) A hybrid collaborative filtering method for multiple-interests and multiple-content recommendation in E-commerce. Expert Syst Appl 28(1):67–77. doi:10.1016/j.eswa.2004.08.013
Article Google Scholar
Li J, Zhang Z, Li X, Chen H (2008) Kernel-based learning for biomedical relation extraction. J Am Soc Inf Sci Technol 59(5):756–769. doi:10.1002/asi.v59:5
Article Google Scholar
Li Y-M, Wu C-T, Lai C-Y (2013) A social recommender mechanism for e-commerce: combining similarity, trust, and relationship. Decis Support Syst 55(3):740–752. doi:10.1016/j.dss.2013.02.009
Article Google Scholar
Liu H, Hu Z, Mian A, Tian H, Zhu X (2014) A new user similarity model to improve the accuracy of collaborative filtering. Knowl Based Syst 56:156–166. doi:10.1016/j.knosys.2013.11.006
Article Google Scholar
O’Donovan J, Smyth B Trust in recommender systems. In: Proceedings of the 10th international conference on intelligent user interfaces, San Diego, California, USA, 2005. ACM, 1040870, pp 167–174. doi:10.1145/1040830.1040870
Polikar R (2006) Ensemble based systems in decision making. IEEE Circuits Syst Mag 6(3):21–45. doi:10.1109/MCAS.2006.1688199
Article Google Scholar
Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106. doi:10.1007/BF00116251
Google Scholar
Sarwar B, Karypis G, Konstan J, Riedl J (2001) Item-based collaborative filtering recommendation algorithms. In: Paper presented at the Proceedings of the 10th international conference on World Wide Web, Hong Kong, Hong Kong
Shad Manaman H, Jamali S, AleAhmad A (2016) Online reputation measurement of companies based on user-generated content in online social networks. Comput Hum Behav 54:94–100. doi:10.1016/j.chb.2015.07.061
Article Google Scholar
Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27(3):379–423. doi:10.1002/j.1538-7305.1948.tb01338.x
Article Google Scholar
Sherchan W, Nepal S, Paris C (2013) A survey of trust in social networks. ACM Comput Surv. doi:10.1145/2501654.2501661
Google Scholar
Skovholt K, Gronning A, Kankaanranta A (2014) The communicative functions of emoticons in workplace e-mails: :-). J Comput Mediat Comm 19(4):780–797. doi:10.1111/jcc4.12063
Article Google Scholar
Suh JH (2015) Forecasting the daily outbreak of topic-level political risk from social media using hidden Markov model-based techniques. Technol Forecast Soc 94:115–132. doi:10.1016/j.techfore.2014.08.014
Article Google Scholar
Suh JH, Park CH, Jeon SH (2010) Applying text and data mining techniques to forecasting the trend of petitions filed to e-People. Expert Syst Appl 37(10):7255–7268. doi:10.1016/j.eswa.2010.04.002
Article Google Scholar
Sun J, Wang G, Cheng X, Fu Y (2015) Mining affective text to improve social media item recommendation. Inform Process Manag 51(4):444–457. doi:10.1016/j.ipm.2014.09.002
Article Google Scholar
Tin Kam H (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20(8):832–844. doi:10.1109/34.709601
Article Google Scholar
Tolle KM, Chen HC, Chow HH (2000) Estimating drug/plasma concentration levels by applying neural networks to pharmacokinetic data sets. Decis Support Syst 30(2):139–151. doi:10.1016/S0167-9236(00)00094-4
Article Google Scholar
Vapnik VN (1995) The nature of statistical learning theory. Springer, New York
Book Google Scholar
Vel Od, Anderson A, Corney M, Mohay G (2001) Mining e-mail content for author identification forensics. SIGMOD Rec 30(4):55–64. doi:10.1145/604264.604272
Article Google Scholar
Wang G, Sun JS, Ma J, Xu KQ, Gu JB (2014) Sentiment classification: the contribution of ensemble learning. Decis Support Syst 57:77–93
Article Google Scholar
Widrow B, Rumelhart DE, Lehr MA (1994) Neural networks: applications in industry, business and science. Commun ACM 37(3):93–105. doi:10.1145/175247.175257
Article Google Scholar
Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Paper presented at the proceedings of the fourteenth international conference on machine learning
Yang YM, Slattery S, Ghani R (2002) A study of approaches to hypertext categorization. J Intell Inf Syst 18(2–3):219–241. doi:10.1023/A:1013685612819
Article Google Scholar
Yang X, Guo Y, Liu Y, Steck H (2014) A survey of collaborative filtering based social recommender systems. Comput Commun 41:1–10. doi:10.1016/j.comcom.2013.06.009
Article Google Scholar
Zhang YL, Dang Y, Chen HC (2011) Gender classification for web forums. IEEE T Syst Man Cy A 41(4):668–677. doi:10.1109/Tsmca.2010.2093886
Article Google Scholar
Zhao L, Hua T, Lu C-T, Chen I-R (2015) A topic-focused trust model for Twitter. Comput Commun. doi:10.1016/j.comcom.2015.08.001
Google Scholar
Zheng R, Li JX, Chen HC, Huang Z (2006) A framework for authorship identification of online messages: writing-style features and classification techniques. J Am Soc Inform Sci Technol 57(3):378–393. doi:10.1002/Asi.20316
Article Google Scholar
Zhou Z-H (2012) Ensemble methods: foundations and algorithms, 1st edn. Chapman and Hall/CRC, London
Google Scholar

Download references

Acknowledgements

This work was supported by the National Research Foundation of Korea Grant funded by the Korean Government (NRF-2013R1A1A3011816).

Competing interests

The author declares that he has no competing interests.

Author information

Authors and Affiliations

Moon Soul Graduate School of Future Strategy, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon, 34141, Republic of Korea
Jong Hwan Suh

Authors

Jong Hwan Suh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jong Hwan Suh.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Suh, J.H. Comparing writing style feature-based classification methods for estimating user reputations in social media. SpringerPlus 5, 261 (2016). https://doi.org/10.1186/s40064-016-1841-1

Download citation

Received: 12 October 2015
Accepted: 15 February 2016
Published: 02 March 2016
DOI: https://doi.org/10.1186/s40064-016-1841-1

Comparing writing style feature-based classification methods for estimating user reputations in social media

Abstract

Background

Literature reviews

Online anonymity

Online user feedbacks and user reputations in social media

Writing style features for characterizing user reputations in social media

Classification techniques for writing style analysis

Proposed research framework

Collect data

Represent user reputations by writing style features

Extract writing style features

Segment the selected users’ reputations into Good and Bad classes

Estimate reputation s of the selected users

Evaluate results with comparisons

Experimental results and discussions

Test bed: a Korean Web forum

Evaluations and discussions

Results on comparative studies

On comparisons of different feature sets

On comparisons of different classification techniques

On comparisons of different ways in defining the classes of user reputations

Conclusions

References

Acknowledgements

Competing interests

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Estimate reputation _s of the selected users