Automatic topic identification of health-related messages in online health community using text classification
© Lu; licensee Springer. 2013
Received: 11 December 2012
Accepted: 11 June 2013
Published: 10 July 2013
To facilitate patient involvement in online health community and obtain informative support and emotional support they need, a topic identification approach was proposed in this paper for identifying automatically topics of the health-related messages in online health community, thus assisting patients in reaching the most relevant messages for their queries efficiently. Feature-based classification framework was presented for automatic topic identification in our study. We first collected the messages related to some predefined topics in a online health community. Then we combined three different types of features, n-gram-based features, domain-specific features and sentiment features to build four feature sets for health-related text representation. Finally, three different text classification techniques, C4.5, Naïve Bayes and SVM were adopted to evaluate our topic classification model. By comparing different feature sets and different classification techniques, we found that n-gram-based features, domain-specific features and sentiment features were all considered to be effective in distinguishing different types of health-related topics. In addition, feature reduction technique based on information gain was also effective to improve the topic classification performance. In terms of classification techniques, SVM outperformed C4.5 and Naïve Bayes significantly. The experimental results demonstrated that the proposed approach could identify the topics of online health-related messages efficiently.
KeywordsOnline health community Text classification Topic identification Topic classification
Health issues are a primary concern for many people. Especially for the patients diagnosed with some serious diseases and their caregivers, they usually seek explanatory information about their disease or treatment. Traditionally, health professionals such as doctors are the primary source of medical information for patients. However, health professionals often fail to meet the patients' information needs fully. Many patients feel that doctors are too busy to answer their questions (Umefjord et al. 2003), or many doctors just tell their patients basic medical information but are not willing to take the time required to fully explain the details (Dickerson et al. 2006). This view was supported by the argument of Tyson (Tyson 2000), who suggested that there was a lack of attention to detail in the current doctor-patient relationship. Recently, the Internet is changing the way people experience life, including healthcare. Many studies on health communication have shown that patients are increasingly using the Internet for health information and support. A study of US-based cancer patients and their caregivers indicated that 80% of them were interested in health-related information on the Internet and 65% expressed an interest in online support groups (Monnier et al. 2002). Especially for those patients with chronic diseases, they would prefer to search online health information to be better informed about their illnesses (Bansil et al. 2006). In recent years, with the advent of some social media services, such as Wikipedia, FaceBook, online forums and medical Q&A, patients are more likely to obtain health information and share health experiences on these social media websites (Kinnane & Milne 2010). A recent survey showed that 34% of Internet users have read someone else’s commentary or experience about health issues in online news group, website, or blog, and 24% of them have consulted online reviews of particular drugs or medical treatments (Fox & Jones 2009). These health-related social media services enable patients to take a more active role in making decisions about their health through the use of social support and the ability to explore treatment options (Gerber & Eiser 2001). In addition, convenience and anonymity are also important reasons why patients turning to the Internet, where patients could obtain health-related knowledge easily and quickly, and meanwhile they are not embarrassed to ask health professionals online or communicate with online members about their conditions (Anderson et al. 2003).
Although different types of social media applications can be used to obtain health-related information, online health community is among the most popular social media services. In online health community, patients and their caregivers can share their experiences and exchange interesting information. The emotional support and encouragement offered by community members is also important for patients suffering serious illness and help them cope with their diseases significantly better than those who address serious diseases by themselves.
Different participants sought different types of information and support. Some patients undergoing treatment for their diseases would prefer to talk of more treatment knowledge. Some survivors, even if their diseases have been permanently cured, were still willing to share their experiences and advice on how they deal with life's daily challenges through online health community. Moreover, some patients with serious illness often posted some messages with their blessing only to get mutual emotional support. So different participants have their own purpose and they only showed interest in some specific topics. However, currently in online health community, the messages that belong to the same topic scattered across many different threads, which makes it difficult for the participants to summarize and search their interesting messages quickly. An alternative solution is to classify these messages into different categories according to their topics, thus enables participants to obtain a sense of what online health community is, quickly find the issues they concerned about, become involved in online health community more easily, thereby gaining valuable information for their health self-education, self-care and self-management. In addition, for the websites that provide health-related social media services, topic identification and categorization could assist the web designers and developers into optimizing the human-computer interface, providing personalized tools and functions such as topic tags to facilitate the users to search their interesting topics quickly. However, with the explosion of online messages, it is becoming impractical to classify the messages manually. To address this issue, we proposed a topic identification approach based on text classification technique for identifying automatically topics of the health-related messages in online health community.
Over the recent years, text classification techniques have been widely used for topic identification of online text. In terms of medical informatics, some previous studies primarily focused on medical literature and clinical narratives. Take the example of MEDLINE biomedical database. It is well known that the MEDLINE medical literature contains millions of references to biomedical journal articles, making it very hard for researchers to efficiently reach the most relevant documents by searching. To address this issue, many studies attempted to design text classification system for classifying MEDLINE documents automatically (Yetisgen-Yildiz & Pratt 2005). In addition, text classification techniques were also often applied into clinical narratives to distinguish different types of patients, so that personalized medical services could be provided for them. For example, researchers have proposed a topic modeling for matching patient education materials to patients clinical notes so that relevant education articles could be recommended automatically to the patients and help patients find appropriate patient educational materials (Kandula et al. 2011). However, these studies mainly focused on professionally written medical text, which differed significantly from user-generated medical text in online health community. Few studies have attempted to apply text classification techniques into topic identification of these user-generated medical text.
In the data collection step, we first focused on an online health community and collected all the web pages by using the Web crawler software Offline Explorer. Then we parsed the pages to extract available health-related messages, meanwhile tagged these messages with their given topic category. Next, some noisy and unreliable data are filtered by text preprocessing, including stop words removal and word stemming.
Feature set generation
N-gram-based features. As mentioned above, current text classification techniques are predominantly based on BOW model, that is, identifying terms with all the words occurring in the document and performing classification based mainly on the presence or absence of these words. Some researchers have used different classification techniques but the results demonstrated that various classification techniques produced similar results for BOW-based features. After that, some researchers attempted to improve the BOW-based text representation and their research results demonstrated that the addition of word n-grams (sequences of words of length n) to the BOW-based text representation indeed improved performance. However, sequences of length n > 3 were shown to be not useful and might decrease the performance. So in this study, n-gram-based features(n<=3) were incorporated into feature set.
- (2)Domain-specific features. Studies have found that integrating domain-specific knowledge into textual feature representation could improve classification performance (Zhu et al. 2009). Since health-related messages in online community contain much medical knowledge, incorporating the medical domain-specific features could enhance the performance of topic identification significantly. So medical domain-specific features were introduced in this study as an additional feature dimension. In some previous studies, UMLS Metathesaurus as the world's largest repository of biomedical concepts, were widely used to extract medical terminology from medical text. So we also used UMLS to extract health-related medical terminology as domain-specific features. UMLS Metathesaurus consists of 1.7 million biomedical concepts where each concept is assigned to at least one of the 134 semantic types and the health-related semantic types used in our study were listed in Table 1.Table 1
The UMLS semantic types used
Amino acid, peptide, or protein
Injury or poisoning
Mental or behavioral dysfunction
Body location or region
Biomedical occupation or discipline
Body part, organ, or organ component
Disease or syndrome
Sign or symptom
Therapeutic or preventive procedure
Sentiment features. In addition to obtaining healthcare knowledge and medical information regarding disease-related symptoms, medication and treatments, patients involved in health online community also expect to obtain emotional support from other community members. Particularly for the patients diagnosed with cancers and chronic diseases, they usually posted some messages to vent their frustration, seek sympathetic encouragement and show compassion or empathy for others. So these messages usually contained many sentiment words with high sentiment polarity and these sentiment words could be used to measure effectively whether health-related messages provide informative support or emotional support, thus helpful to improve topic identification performance. In order to get these sentiment features, we exploited SentiWordNet as lexical resource to extract the terms with high sentiment polarity scores as sentiment features. SentiWordNet provides for each synset of WordNet a triple of polarity scores (positivity, negativity and objectivity) and now SentiWordNet consists of around 207,000 word-sense pairs or 117,660 synsets, which has been widely used as lexicon in recent sentiment analysis studies. So in this study we selected the terms with subjectivity score more than 0.5 as sentiment features and incorporated them into feature set.
where G(t) denote the information gain for feature t,P(ci) denote the probability of class ci,P(ci|t)denote the conditional probability of ci given t, P(t) denote the probability of feature t occurring, and denote the probability of feature t not occurring.
Classification and evaluation
In this study, three state-of-the-art classification techniques, SVM with polynomial kernel, C4.5 and Naïve Bayes, were used to perform classification task. SVM is a powerful statistical machine-learning technique first introduced by Vapnik (Vapnik 1995). Due to the ability to handle millions of inputs and good performance, SVM was widely used in text classification studies. C4.5 is a decision-tree building algorithm developed by Quinlan (Quinlan 1986). Based on a divide-and-conquer strategy and the entropy measure, C4.5 focuses on classifying mixed objects into categories according to attribute values of objects. Based on Bayes’ theorem with strong independence assumptions, the Naïve Bayes classifier is a probabilistic classifier and uses the feature values of a new instance to estimate the probability of each category. It has also been used to perform text classification tasks in previous studies.
Research testbed and data collection
In this study, we conducted our experiment on a popular breast cancer online community (http://apps.komen.org/Forums), supported by Susan G. Komen Breast Cancer Foundation. Breast cancer in the US continued to be a serious problem for women, characterized by high incidence and mortality rates and the detrimental impact on the quality of life. Many studies have shown that breast cancer was one of the most common cancers that Internet users searching for information online about (Castleton et al. 2011). The Susan G. komen forum is one of the largest online communities for breast cancer survivors and activists, and more than twenty thousand members got involved in it. More importantly, this online health community was already classified into several message boards based on different topics, thus providing a gold standard to evaluate the performance of our topic classification.
We collected the health-related messages in three message boards: treatment board, emotional support board and survivorship board. In the treatment board, members share their experiences, thoughts and advice on surgeries, reconstruction and chemotherapy. In the emotional support board, members post their well wishes and provide emotional support to those suffering from breast cancer. In the survivorship board, members share their experiences and advice on how they deal with life's daily challenges, and how to find a new routine of life while dealing with breast cancer. To evaluate the performance of our topic classification model, we randomly selected 4,041 messages from the collected data and tagged them with a predefined topic category label. Among these messages, 1,224 messages were tagged as treatment, 991 messages were tagged as emotional support, and 1,826 messages were tagged as survivorship.
Results and discussion
Comparison of different feature sets
Pairwise t-tests on accuracy and F-measure for different feature sets
P-value on accuracy
P-value on F-measure
Comparison of different classification techniques
Pairwise t-tests on accuracy and F-measure for different classification techniques
P-value on accuracy
P-value on F-measure
Comparison of different topics
Performance measures of different topic groups
FS4 (selected F1+F2+F3)
Conclusions and future research
With the development of online health community, automatic topic identification of health-related messages could assist patients in searching their interesting topics and facilitate their involvement in online health community. So in this paper, we proposed a topic identification model based on text classification techniques. To evaluate the effectiveness of this model, we conducted experiments on one breast cancer online community using different feature sets and classification techniques. We found that n-gram-based features, domain-specific features and sentiment features have significant influence on improving topic identification performance. In terms of classifiers, SVM outperformed C4.5 and Naïve Bayes significantly. So we finally found the by combining of feature set FS4 and SVM classifier, our topic classification model could produce the best classification results. The experimental results also demonstrated that the proposed approach could identify the topics of online health-related messages effectively.
The paper also has some limitations that need to be considered further. First, the study has proved that incorporating medical domain-specific features and sentiment features could enhance the performance of topic classification significantly, however, some other features should be considered to be used to further improve the performance. For example, messages within a single thread most likely belong to the same topics, so these structural features should be considered in the feature set in the further research. Second, we adopted information gain to remove irrelevant or redundant n-gram-based features to generate better feature sets, however, there are some other feature reduction methods such as Markov blanket, which were proved effective in some studies. So further research could explore and compare the performance of different feature reduction methods to obtain the best feature sets for topic classification. Lastly, other classification techniques, in addition to SVM, C4.5 and Naïve Bayes, should be considered to improve the performance of topic identification in further research.
This research was supported by the National Natural Science Foundation of China under Grant 71171131.
- Abbasi A, Chen H, Salem A: Sentiment Analysis in Multiple Languages: Feature Selection for Opinion Classification in Web Forums. ACM Trans Inf Syst 2008, 26(3):1-34.View ArticleGoogle Scholar
- Anderson JG, Rainey MR, Eysenbach G: The impact of cyberhealthcare on the physician-patient relationship. J Med Syst 2003, 27(1):67-84. 10.1023/A:1021061229743View ArticleGoogle Scholar
- Bansil P, Keenan NL, Zlot AI, Gilliland JC: Health-related information on the web: results from the HealthStyles survey, 2002–2003. Preventing Chronic Disease: Public Health Research, Practice, and Policy 2006, 3(2):1-10.Google Scholar
- Castleton K, Fong T, Wang-Gillam A, Waqar MA, Jeffe DB, Kehlenbrink L, Gao F, Govindan R: A survey of Internet utilization among patients with cancer. Support Care Cancer 2011, 19(8):1183-1190. 10.1007/s00520-010-0935-5View ArticleGoogle Scholar
- Dickerson SS, Boehmke M, Ogle C, Brown JK: Seeking and managing hope: Patients’ experiences using the internet for cancer care. Oncol Nurs Forum 2006, 33(1):E8-E17. 10.1188/06.ONF.E8-E17View ArticleGoogle Scholar
- Fox S, Jones S: The social life of health information. Pew Internet. 2009. 11 June 2009. Available from: http://www.pewinternet.org/reports/2009/8-the-social-life-of-health-information.aspx Google Scholar
- Gerber BS, Eiser AR: The patient-physician relationship in the internet age: future prospects and the research agenda [electronic version]. J Med Internet Res 2001, 3(2):e15. 10.2196/jmir.3.2.e15View ArticleGoogle Scholar
- Kandula S, Curtis D, Hill B, Zeng-Treitler Q: Use of topic modeling for recommending relevant education material to diabetic patients. AMIA Annu Symp Proc 2011, 2011: 674-682.Google Scholar
- Kinnane N, Milne D: The Role of the Internet in Supporting and Informing Carers of People with Cancer: a Literature Review. Support Care Cancer 2010, 18: 1126.View ArticleGoogle Scholar
- Monnier J, Laken M, Carter CL: Patient and caregiver interest in Internet-based cancer services. Cancer Pract 2002, 10(6):305-310. 10.1046/j.1523-5394.2002.106005.xView ArticleGoogle Scholar
- Quinlan JR In Machine Learning. In Induction of decision trees. Netherlands: Kluwer Academic Publisher; 1986:81-106.Google Scholar
- Salton G, Wong A, Yang CS: A vector space model for automatic indexing. Commun ACM 1975, 18: 613-620. 10.1145/361219.361220View ArticleGoogle Scholar
- Tyson T: The Internet: tomorrow's portal to non–traditional health care services. J Ambul Care Manage 2000, 23: 1-7.View ArticleGoogle Scholar
- Umefjord G, Petersson G, Hamberg K: Reasons for consulting a doctor on the internet: web survey of users of an Ask the Doctor Service. J Med Internet Res 2003, 5(4):e26. 10.2196/jmir.5.4.e26View ArticleGoogle Scholar
- Vapnik VN: The nature of statistical learning theory. New York: Springer-Verlag; 1995.View ArticleGoogle Scholar
- Witten I, Frank E: Data Mining: Practical Machine Learning Tools and Techniques. san Fransisco: Morgan Kaufmann Publishers; 2005.Google Scholar
- Yetisgen-Yildiz M, Pratt W: The effect of feature representation on MEDLINE document classification. AMIA Annu Symp Proc 2005, 849-853.Google Scholar
- Zhu S, Zeng J, Mamitsuka H: Enhancing MEDLINE document clustering by incorporating MeSH semantic similarity. Bioinformatics 2009, 25: 1944-1951. 10.1093/bioinformatics/btp338View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.