Extraction of a group-pair relation: problem-solving relation from web-board documents

This paper aims to extract a group-pair relation as a Problem-Solving relation, for example a DiseaseSymptom-Treatment relation and a CarProblem-Repair relation, between two event-explanation groups, a problem-concept group as a symptom/CarProblem-concept group and a solving-concept group as a treatment-concept/repair concept group from hospital-web-board and car-repair-guru-web-board documents. The Problem-Solving relation (particularly Symptom-Treatment relation) including the graphical representation benefits non-professional persons by supporting knowledge of primarily solving problems. The research contains three problems: how to identify an EDU (an Elementary Discourse Unit, which is a simple sentence) with the event concept of either a problem or a solution; how to determine a problem-concept EDU boundary and a solving-concept EDU boundary as two event-explanation groups, and how to determine the Problem-Solving relation between these two event-explanation groups. Therefore, we apply word co-occurrence to identify a problem-concept EDU and a solving-concept EDU, and machine-learning techniques to solve a problem-concept EDU boundary and a solving-concept EDU boundary. We propose using k-mean and Naïve Bayes to determine the Problem-Solving relation between the two event-explanation groups involved with clustering features. In contrast to previous works, the proposed approach enables group-pair relation extraction with high accuracy.

types whilst an 'agent' in linguistic typology is an initiator of an event, and a 'patient' is an entity undergoing change. Khoo and Na (2006) stated that "concepts and relations are the foundation of knowledge and thought while the concepts are the building blocks of knowledge and the relations are the cement linking up the concepts into the knowledge structures. " (p. 157). The relations and the concepts of knowledge structures are necessary not only for a search engine (Lei et al. 2006), but also for both reasoning and inference in information extraction, information retrieval, question-answering, and text summarization applications through certain web sites (Katrenko et al. 2010).
In much research (Konstantinova 2014;Kim et al. 2009;Girju 2003), the semantic relation determination from texts for various applications mostly relies on the relations, i.e. is-a, part-of, and cause-effect, between two entities of noun phrases without any explanation. Some of the previous researches (Song et al. 2011; Pechsiri and Piriyakul 2010) on knowledge acquisition for reasoning applications attempted to determine the semantic relations, i.e. disease-treatment and cause-effect, which are the relations connecting either one entity concept or one event concept without explanation to either a vector of entity concepts or a vector of event concepts as the explanation. However, our research focuses on extracting the group-pair relation as a Problem-Solving relation from web-board documents. The group-pair relation links two event-explanation groups (two vectors of event concepts) where each group is explained by several event concepts, including its boundary determination. Thus, a Problem-Solving relation links a problem-concept group and a solving-concept group. The web-board documents that contain the Problem-Solving relations expressed by experts or practitioners can provide the declarative knowledge and the procedural knowledge for reasoning and inference in other systems of web applications, where the declarative knowledge is "knowing that something is the case/problem" (Hardin 2002, pp. 227) and the procedural knowledge is "knowing how to do something or to solve the problem including motor skills, cognitive skills, and cognitive strategies" (Hardin 2002, pp. 227). Therefore, our research concerns the extraction of the Problem-Solving relation, i.e. a DiseaseSymptom-Treatment relation and a CarProblem-Repair relation, from Thai documents of two domains, a medical-healthcare domain and a car-repair domain, downloaded from the hospital's web-board on a non-government-organization website (http://haamor.com/) and the car-repair-guru web-boards (https://www.gotoknow.org/posts/113664), respectively for an application with an open source recommendation engine as in the question answering system on the web based system. The Problem-Solving relation on the web-board documents is mostly based on the event explanation with the event semantics of verbs (Pustejovsky 1991) on both the problem-concept group as the problem explanation and the solving-concept group as the solving explanation, described by patients/users and experts, i.e. professional medical practitioners and mechanics. Each medical-healthcare-consulting/car-repair-guru document contains both the disease-symptom-event/ carProblem-event explanation and the treatment-event/repair-event explanation, which are expressed in the form of several EDUs [an EDU is an elementary discourse unit, which is a simple sentence/clause defined by Carlson et al. (2003)]. In addition to the solving-event explanation of the Problem-Solving relation, there are two kinds of solution on web-board documents; the actual solution notified by patients/users from their experience, and the recommended solution recorded by experts. For example, each medical-healthcare-consulting document from the web-board contains several EDUs of the symptom concepts along with either the actual-treatment-concept EDUs, followed by the recommended-treatment-concept EDUs or only the recommended-treatmentconcept EDUs as shown in the following EDU-Sequence form. where: Dsym, AT, and RT are a group of disease-symptom-concept EDUs (as a symptom-concept EDU boundary or vector), a group of actual-treatment-concept EDUs (as a treatment-concept EDU boundary or vector), and a group of recommended-treatmentconcept EDUs (as a treatment-concept EDU boundary or vector) respectively, as follows: n1-n7 are the number of sequence EDUs and are ≥0 except n2 and n6 which are ≥1.
Moreover, the extracted DiseaseSymptom-Treatment relation from medical-healthcare-consulting documents is represented by a Problem-Solving-Map (PSM), which is the graphical representation of the symptom events with the corresponding treatment events (Fig. 2). The PSM representation helps non-professional people to understand easily how to solve their health problems at the preliminary stage. Thus, the extracted Problem-Solving relation of our research will then benefit the automatic questionanswering system on the preliminary problem-solving web-boards while the patients wait for experts.
There are several techniques (Yeleswarapu et al. 2014;Abacha and Zweigenbaum 2011;Fader et al. 2011;Song et al. 2011;Rosario 2005) that have been used to extract the semantic relations of the problems and solutions or effects from documents (see section "Related work"). The group-pair relation as the problem-solving relation in our research is extracted from the downloaded Thai documents of medical-healthcare consultation and carProblem consultation from the hospital web-boards and the car-repair-guru web-boards, respectively. However, the Thai documents have some specific characteristics, such as zero anaphora or implicit noun phrases, without word and sentence delimiters, etc. All of these characteristics are involved in three main problems when extracting the Problem-Solving relation from the web-board documents (see section "Researchproblems of problem-solving relation extraction"). The first problem is how to identify a problem-concept EDU, i.e. a symptom-concept EDU and a carProblem-concept EDU, and a solving-concept EDU, i.e. a treatment-concept EDU and repair-concept EDU. The second problem is how to identify the problem-concept EDU boundary, i.e. the symptom-concept EDU boundary (Dsym) and the CarProblem-concept EDU boundary, and the solving-concept EDU boundary, i.e. the treatment-concept EDU boundary (AT/RT) and the repair-concept EDU boundary. In addition, the third problem is how to determine the Problem-Solving relation, i.e. the DiseaseSymptom-Treatment relation and the CarProblem-Repair relation, from the medical-healthcare-consulting documents and the car-repair-guru documents, respectively. To represent these problems, we need to develop a framework which combines a machine learning technique and the linguistic phenomena to learn the several EDU expressions of the Problem-Solving relations. Therefore, we apply a learning relatedness value (Guthrie et al. 1991;Chaudhari et al. 2011) for the words of a word co-occurrence (called "Word-CO") with a problem concept or a solving concept to identify a problem-concept EDU or a solving-concept EDU. The Word-CO expression in our research is the event expression of two or three adjacent words (after stemming words and eliminating stop words) as a word order pair or a sequence of words existing in one EDU with either a problem concept or a solving concept. The first word of the Word-CO is a verb expression on an EDU with a general Thai linguistic expression (see section "Research-problems of problem-solving relation extraction") where "verb → verb strong |verb weak -noun1| verb weak -noun2". This verb expression can be represented as v co (v co → verb strong |verb weak -noun1| verb weak -noun2). The second word of the Word-CO is the co-occurred word, w co , of v co and exists immediately after v co , after stemming words and eliminating stop words. Three different machine learning techniques, Maximum Entropy (ME) (Csiszar 1996;Berger et al. 1996;Fleischman et al. 2003), Support Vector Machines (SVM) (Cristianini and Shawe-Taylor 2000), and Logistic Regression Model (LR) (Freedman 2009), are applied to solve the problem-concept EDU boundary and also the solving-concept EDU boundary from the consecutive EDUs. There are two reasons for using these machine learning techniques for the boundary determination; (1) our data on each group of consecutive EDUs (i.e. Dsym as a symptom-concept EDU vector and AT/RT as a treatment-concept EDU vector) are based on a vector of binary features of Word-CO occurrences on the problemconcept EDU vector and the solving-concept EDU vector, and (2) there is a diversity of Word-CO occurrences including some Word-CO occurrences with dependency, where ME is a probabilistic classifier that belongs to the class of exponential models (Csiszar 1996), and SVM is based on the concept of hyperplanes in a multidimensional space that is separated into different class labels (Cristianini and Shawe-Taylor 2000). LR is used to describe data and to explain the relationship between one dependent binary variable and one or more metric (interval or ratio scale) independent variable (Freedman 2009). We also propose using the Naïve Bayes (Mitchell 1997) to determine the Problem-Solving relation from documents after clustering the objects of posted problems on the webboards and clustering solving features as the feature reduction. Our research is organized into six sections. In section "Related work", related work is summarized. Research problems in extracting Problem-Solving relations from Concept Level Group-Pair Relation Level

Fig. 2
The problem-solving-map representation of the DiseaseSymptom-treatment relation documents are described in sections "Research-problems of problem-solving relation extraction", and A framework for problem-solving relation extraction shows our framework for extracting the Problem-Solving relation. In section "Evaluation and discussion", we evaluate our proposed model including discussion and then present the conclusion in section "Conclusion".

Related work
Several strategies (Yeleswarapu et al. 2014;Abacha and Zweigenbaum 2011;Fader et al. 2011;Song et al. 2011;Rosario 2005) have been proposed to extract a disease treatment relation, a symptom-treatment relation, a drug-adverse-event relation, and other relations from textual data. Rosario (2005) extracted the semantic relations from bioscience texts. In general, the entities are often realized as noun phrases, and the relationships often correspond to grammatical functional relations, as shown in the following example.
Therefore administration of TJ-135 may be useful in patients with severe acute hepatitis accompanying cholestasis or in those with autoimmune hepatitis.
The disease hepatitis and the treatment TJ-135 are entities, and the semantic relation is: hepatitis is treated or cured by TJ-135. The goals of her work are to identify the semantic roles DIS (Disease) and TREAT (Treament), and to identify the semantic relation between DIS and TREAT from bioscience abstracts. She identified the entities (DIS and TREAT) by using MeSH, and the relationships between the entities by using a neural network based on five graphical models with lexical, syntactic, and semantic features. Her results were 79.6 % accurate in the relation classification when the entities were hidden, and 96.9 % when the entities were given.
In 2011 (Abacha and Zweigenbaum 2011) extracted the semantic relations between medical entities (as the treatment relations between a medical treatment and a problem, e.g. a disease symptom) by using a linguistic pattern-based method to extract the relation from selected MEDLINE articles.
where E1, E2, or Ei is the medical entity (as well as UMLS concepts and semantic types) identified by MetaMap.
Their treatment relation extraction was based on a couple of medical entities or noun phrases occurring within a single sentence, as shown in the following example:

Fosfomycin (E1) and amoxicillin-clavulanate (E2) appear to be effective for cystitis (E3) caused by susceptible isolates.
Finally, their results showed 75.72 % precision and 60.46 % recall. Song et al. (2011) extracted the procedural knowledge from MEDLINE abstracts as shown in the following example by using Supporting Vector Machine (SVM) compared to Conditional Random Field (CRF), along with Natural language Processing. "… 〈In a total gastrectomy〉 (Target), 〈clamps are placed on the end of the esophagus and the end of the small intestine〉 (P1). 〈The stomach is removed〉 (P2) and 〈the esophagus is joined to the intestine〉 (P3) …", where P1, P2, and P3 are the solution procedures.
Linguistic Pattern : . . . E1 . . . be effective for E2 . . . | . . . E1 was found to reduce E2 . . . , They defined procedural knowledge as a combination of the Target and a corresponding solution consisting of one or more related procedures/methods. SVM and CRF were utilized with four feature types: content feature (after word stemming and stop-word elimination) with a unigram and bi-grams in a target sentence, position feature, neighbor feature, and ontological feature to classify the Target. In addition, the other features: word feature, context feature, predicate-argument structure, and ontological feature, were utilized to classify procedures from several sentences. The results were 0.7279 and 0.8369 precisions of CRF and SVM, respectively with 0.7326 and 0.7957 recalls of CRF and SVM, respectively. Fader et al. (2011) identified the relation between two noun-phrase arguments occurring within one sentence from an open IE (Information Extraction). The open IE contained a massive corpus in which pre-specified vocabulary was not required and the target relations could not be specified in advance. A relation phrase or a verb phrase was then applied to connect the two arguments whilst some relation phrases induced the uninformative and incoherent extractions. To solve this problem, Fader et al. (2011) introduced syntactic constraints and lexical constraints. The syntactic constraints, such as "every multi-word relation phrase must begin with a verb, end with a preposition, and be a contiguous sequence of words in the sentence", i.e. 'has a cameo in' , 'made a deal with' , etc., can eliminate the problems of uninformative and incoherent extractions. If the relation phrase has too many words, a lexical constraint is used to separate valid relation phrases with a confidence score using a logistic regression classifier. Their precision and recall were 0.8 and 0.62, respectively.
In 2014 (Yeleswarapu et al. 2014) applied the semi-automatic pipeline detection and the extraction of drug-adverse event (drug-AE) pairs from unstructured data, such as user-comment blogs and MEDLINE abstracts, and the structure database (Food and Drug Administration Adverse Event Reporting System). The 12 drugs, diseases and symptoms or adverse events were based on noun phrases, including name entity recognition by using the PubMed dictionary. The Information Component (IC) value by using the Bayesian Confidence Propagation Neural Network is a measure of the disproportionality between entities of the drug-adverse event pairs. The standard deviation for each IC provides a measure of the robustness of the value. The IC is thus a measure of the strength of the dependency between a drug and an AE (Adverse Event). An IC value of zero indicates that there is no quantitative dependency between the drug and AE combinations. If the IC value increases over time and is positive, the positive quantitative association between the drug and the AE is likely to be high. Thus, each extracted drug-AE pair from multiple data sources by Yeleswarapu et al. (2014) implies the relation/association between a certain drug and its adverse events. However, their proposed model extracts the drug-AE pairs from user blogs with less strength of the drug-AE association (based on IC values) than both the MEDLINE abstracts and the adverse event databases.
In most of the previous works, i.e. (Abacha and Zweigenbaum 2011; Rosario 2005), the treatment relation between the medical treatment and the problem (as a disease or a symptom) occurs within one sentence. The drug-AE relation (Yeleswarapu et al. 2014) also occurs within one sentence with several noun phrases including name entities. Furthermore (Fader et al. 2011) worked on the verb phrase as the relation phrase linking two noun-phrase arguments within one sentence whereas Song et al. (2011)'s work could determine several sentences of the treatment method, but there was only one sentence of the problem as the Target disease or symptom. The Problem-Solving relation of this research is a group-pair relation between two groups of several sentences/EDUs, the problem-concept EDU group and the solving-concept EDU group, which result in many Word-CO features with ambiguity, diversity, and dependency occurrences when considering the Problem-Solving relation determination. This research still has another research-problem consideration in which the Problem-Solving relation occurrence and the non-Problem-Solving relation occurrence can occur in the same group pair that has the same problem-concept EDU group and the same solving-concept EDU group. However, the expression of our Problem-Solving relation is based on the event explanation with several EDUs providing more interesting information for people to understand clearly. Therefore, we propose using the Naïve Bayes classifier to determine the Problem-Solving relation from documents where clustering is required to enhance the correct relation extraction. The clustering technique is applied to organize similar problem objects from the problem-concept EDU groups (i.e. symptom-concept EDU vectors and carProblem-concept EDU vectors) and to reduce Word-CO features by clustering the Word-CO features with similar solving concepts to the solving-concept EDU groups (i.e. treatment-concept EDU vectors and repair-concept EDU vectors).

Research-problems of problem-solving relation extraction
The group-pair relation extraction of this research involves several problems based on the following general Thai linguistic expression of each EDU after stemming words and eliminating stop words: where NP1 and NP2 are noun phrases, VP is a verb phrase, adv is an adverb, adj is an adjective, AdjPhrase is an adjective phrase, and PrepPhrase is a preposition phrase. For example: (a) "ผู ้ ป่ วยมี อาการแน่ นหน้ าอก" (A patient has a tight chest symptom.) ="(ผู ้ ป่ วย/patient-noun1)/NP1 (มี /have-verb weak อาการ/symptom-noun2 แน่ นหน้ าอก/ tight_chest-AdjPhrase)/VP" (b) "แผลที บริ เวณนิ ้ วมื อเป็ นสี เขี ยวคล้ ำ า" (A scar at the finger area is dark green color.) Therefore, to extract the Problem-Solving relation from documents after passing the pre-processing step of the word-cut and EDU determination, there are three problems that must be solved: how to identify a problem-concept EDU and a solving-concept EDU, how to determine the problem-concept EDU boundary and the solving-concept EDU boundary, and how to determine the Problem-Solving relation from the medicalhealthcare-consulting documents and the car-repair-guru documents.

How to identify problem-concept EDU and solving-concept EDU
According to the corpus behavior study of the medical-healthcare domain and the carrepair domain, most of the symptom/carProblem-concept EDUs and the treatment/ repair-concept EDUs are the event expressions expressed by verb phrases. This concept-EDU identification problem can be solved by learning the relatedness from two consecutive words on each EDU after stemming words and eliminating stopwords to form the Word-CO of each EDU with the symptom/carProblem concept or the treatment/repair concept. Where the first word of the Word-CO is a verb expression, v co , related to the symptom/carProblem concept or the treatment/repair concept (where v co ∈ V co , V co = V co1 ∪V co2 , V co1 is a set of verbs related to the symptom/carProblem concepts, and V co2 is a set of verbs related to the treatment/repair concept set). The second word of the Word-CO is a co-occurred word, w co (w co ∈ W co ; W co = W co1 ∪W co2 ). W co1 and W co2 are co-occurred word sets inducing the v co1 w co1 co-occurrence and the v co2 w co2 co-occurrence to have the symptom/carProblem concept and treatment/repair concept, respectively, where v co1 ∈ V co1 , w co1 ∈ W co1 , v co2 ∈ V co2 and w co2 ∈ W co2 . All concepts of V co1 , V co2 , W co1 , and W co2 from the annotated corpus are obtained from WordNet (Miller 1995)  According to the medical-healthcare-consulting document shown in Fig. 1, there is no clue (i.e. 'และ/and' , 'หรื อ/or' , etc.) in both EDU4 and EDU11 to identify the symptom boundary (EDU2-EDU4) and to identify the treatment boundary (EDU9-EDU11), respectively. In addition, in the car-repair-guru documents, there is also no the clue in EDU5 and EDU7 to identify the carProblem-concept EDU boundary (EDU1-EDU5) and the repair-concept EDU boundary (EDU6-EDU7), respectively as shown in the following example. After the problem-concept EDU and the solving-concept EDU have been identified by using the Word-CO from section "How to identify problem-concept EDU and solvingconcept EDU", we then solve the problem-concept EDU boundary and the solving-concept EDU boundary by applying ME, SVM, and LR to learn a Word-CO pair from the sliding-window size of the two consecutive EDUs with one sliding EDU distance.

How to determine the problem-solving relation
The relation results of a problem-concept group and a solving-concept group vary between people, i.e. patients, drivers, and other users, even though they have the same problems. For example:  According to examples m) and n), the CarProblem-Repair relation occurs only on n) because EDU6 of n) contains 'เป็ นปกติ /be normal' as Class-cue-word of the Problem-Solving relation.

CarProblem-Repair
Therefore, we propose automatically learning the Problem-Solving relation in documents by using the Naïve Bayes classifier, with clustering objects from several symptom/carProblem-concept EDU vectors and clustering features as the feature reduction of all features from treatment/repair-concept EDU vectors. Where each symptom/ carProblem-concept EDU and each treatment/repair-concept EDU are represented by the Word-CO with the symptom/carProblem concept, v co1 w co1 , the Word-CO with the treatment/repair concept, v co2 w co2 , respectively. Each symptom/carProblem-concept EDU boundary and each treatment/repair-concept EDU boundary is represented by a symptom/carProblem-concept EDU vector, 〈v co1-1 w co1-1 , v co1-2 w co1-2 , …, v co1-a w co1-a 〉, and a treatment/repair-concept EDU vector, 〈v co2-1 w co2-1 , v co2-2 w co2-2 , …, v co2-b/c w co2-b/c 〉, respectively.

A framework for problem-solving relation extraction
There are five steps in our framework. The first step is the corpus preparation step followed by the step of Word-CO concept learning, especially problem concepts (i.e. symptom/carProblem concepts) and solving concepts (i.e. treatment/repair concepts). The feature extraction step for the Problem-Solving relation learning step is then carried out, which is followed by the Problem-Solving relation extraction step as shown in Fig. 3.

Corpus preparation
This step is the preparation of a medical-healthcare corpus and a car-repair corpus in the form of EDUs from the medical-healthcare-consulting documents and the car-repair documents downloaded from the hospital web-board (http://haamor.com/) and the carrepair-guru web-board (https://www.gotoknow.org/posts/113664, http://pantip.com/ topic/31660469), respectively. The step involves using Thai word segmentation tools (Sudprasert and Kawtrakul 2003), including named entities (Chanlekha and Kawtrakul 2004). After the word segmentation is achieved, EDU segmentation is then dealt with (Chareonsuk et al. 2005). Thus, there are 6000 EDUs in the medical-healthcare corpus and 2000 EDUs in the car-repair corpus. The medical-healthcare corpus consists of three disease categories with 2000 EDUs in each disease category, i.e. a Gastro-intestinal disease, a Heart-Brain disease, and a Childhood disease. These corpora are separated into 2 parts; a learning part (4500 EDUs from the medical-healthcare-consultation documents and 1500 EDUs from the car-repair documents) and an evaluation part (1500 EDUs from the medical-healthcare-consultation documents and 500 EDUs from the car-repair documents). The learning part is used to learn the Word-CO concepts, the boundaries (the problem-concept EDU boundary and the solving-concept EDU boundary), and the Problem-Solving relation, based on tenfold cross validation. The evaluation part is used to test or evaluate the feature extraction (as the correct boundary determination) and the Problem-Solving relation extraction (see section "Evaluation and discussin"). In addition to this step, the corpus semi-automatically annotates the Word-CO concepts of the problem concepts and the solving concepts along with Class-cue-word annotation to specify the cue word of the Problem-Solving relation with the Class-type set {"yes", "no"} as shown in Fig. 4 as an example of the Problem-Solving relation annotation. All the concepts of the Word-CO refer to WordNet (http://word-net.princeton.edu/obtain) and MeSH after translating from Thai to English, by using Lexitron (the Thai-English dictionary) (http://lexitron.nectec.or.th/).

Word-CO concept learning
According to Guthrie et al. (1991), Chaudhari et al. (2011), the relatedness value, r, was applied in this research to indicate the relatedness between two consecutive words of the Word-CO, v coi w coi from the annotated corpora after stemming words and eliminating stop words, with either the problem concept (i.e. a symptom/carProblem concept) or the solving concept (i.e. a treatment/repair concept) as shown in Eq. (1). Where each v coi w coi existing on several EDUs of documents has a relatedness r(v coi , w coi ) value with either a positive or a negative concept. For example, if v coi is v co1 , one relatedness value of a v co1 w co1 occurrence is the problem concept (i.e. a symptom/carProblem concept) as the positive concept. Another relatedness value of the same v co1 w co1 occurrence is the nonproblem concept (i.e. a non-symptom/non-carProblem concept) as the negative concept.
If v coi is v co2 , one relatedness value of a v co2 w co2 occurrence is the solving concept (i.e. a

Problem -Solving Relation Extraction
Word-CO Learning of Problem-Concept / Solving Concept

WordNet
PubMed Longdo Word-CO: Problem Concept / Solving Concept

Problem-Solving Relation
Problem EDU Identification along with Boundary Determination Solving EDU Identification along with Boundary Determination Feature Extraction Fig. 3 System overview where the input is text or downloaded documents and the output is the problemsolving relation i.e. a DiseaseSymptom-treatment relation and a CarProblem-repair relation treatment/repair concept) as the positive concept. Another relatedness value of the same v co2 w co2 occurrence is the non-solving concept (i.e. a non-treatment/non-repair concept) as the negative concept. Only the v coi w coi occurrence of the positive concept (the problem concept or the solving concept) with a higher r(v coi , w coi ) value than the one of the negative concept (the non-problem concept or the non-solving concept) is collected as an element of VW problem or VW solving respectively. Where v co1 w co1 ∈VW problem ; VW problem is a set of Word-COs with the problem concepts, and v co2 w co2 ∈ VW solving ;VW solving is a set of Word-COs with the solving concepts. VW problem and VW solving are used to identify the problem concept EDU and the solving concept EDU, respectively.
where r(v coi , w coi ) is the relatedness of Word-Co with a problem/symptom concept if coi = co1 or a solving/treatment concept if coi = co2. v coi ∈ V coi , w coi ∈ W coi V co1 is a set of verbs with the problem/symptom concepts. V Co2 is a set of verbs with the solving/treatment concepts. W co1 is the co-occurred word set having the problem/symptom concept in the v co1 w co1 co-occurrence. W co2 is the cooccurred word set having the solving/treatment concept in the v co2 w co2 co-occurrence. fv coi is the number of v coi occurences. fw coi is the number of w coi occurences. fv coi w coi is the number of v coi and w coi occurences.

Feature extraction
This step involves the extraction of two feature groups, a problem feature group and a solving feature group, to learn the Problem-Solving relation in the next step, for example, (1) r(v coi , w coi ) = fv coi w coi fv coi + fw coi − fv coi w coi .

Fig. 4 DiseaseSymptom-treatment relation annotation
the feature extraction on the medical-healthcare domain; the problem feature group is the symptom-concept EDU boundary (Dsym represented by a symptom-concept EDU vector, 〈v co1-1 w co1-1 , v co1-2 w co1-2 , …, v co1-a w co1-a 〉) and the solving feature group is the treatment-concept EDU boundary (AT/RT represented by a treatment-concept EDU vector, 〈v co2-1 w co2-1 , v co2-2 w co2-2 , …, v co2-b/c w co2-b/c 〉). Therefore, after the starting EDU of the problem-concept EDU boundary and the solving-concept EDU boundary have been identified by v coi w coi from VW problem and VW solving , the problem-concept EDU boundary (i.e. Dsym) and the solving-concept EDU boundary (i.e. AT/RT) are determined by each of the following techniques: ME, SVM, and LR, along with sliding the window size of two adjacent EDUs with one EDU distance. (Where coi = co1, v co1 w co1 is Word-CO with a symptom/carProblem concept called a "symptom/carProblem Word-CO" or "Problem Word-CO", and coi = co2, v co2 w co2 is Word-CO with a treatment/repair concept called a "treatment/repair Word-CO" or "Solving Word-CO") ME (Csiszar 1996;Berger et al. 1996;Fleischman et al. 2003) can be used as the classifier of the r class when the probability p(r|x) is the argmax p(r|x) to determine either the Dsym boundary classes or the AT/RT boundary classes as shown in Eq. (2). Where r is the Dsym boundary classes or the AT/RT boundary classes (the boundary is ending if r = 0, otherwise r = 1), and x is the binary vector of Word-CO (v coi w coi ) features containing all Word-CO pairs, v coi-j w coi-j v coi-j+1 w coi-j+1 . According to Eq. (2), both λ l of each v co1-j w co1-j and λ l of each v co2-j w co2-j are the results from the supervised learning of ME by sliding the window size of two adjacent EDUs with one EDU distance through the problem/symptom-concept EDU boundary and through the solving/treatment-concept EDU boundary, respectively. Then, all λ l of v co1 w co1 and all λ l of v co2 w co2 from the ME learning are used to determine and extract Dsym and the AT/RT, respectively from the testing corpus with Eq. (2).
where v coi_j+1 w coi_j ∈ VW problem and v coi_j+1 w coi_j+1 ∈ VW problem if coi = co1 and VW probelm is a set of Work-CO with the problem/symptom concepts. v coi_j+1 w coi_j ∈ VW solving and v coi_j+1 w coi_j+1 ∈ VW solving if coi = co2 and VW solving is a set of Work-CO with the solving/treatment concepts.
SVM (Cristianini and Shawe-Taylor 2000) with the linear kernel: The linear function, f(x), of the input x = (x 1 …x n ) assigned to the positive class if f(x) ≥0, and otherwise to the negative class if f(x) < 0, can be written as where x is a dichotomous vector number, w is a weight vector, b is a bias, and (w,b)∈ R n × R are the parameters that control the function. The SVM learning is to determine (3) f (x) = �w · x� + b = n j=1 w j x j + b w j and b for each Word-CO feature, v coi-j w coi-j (x j ) in each Word-CO pair, v coi-j w coi-j v coi-j+1 w coi-j+1 , from the supervised learning of SVM by sliding the window size of two consecutive EDUs with one sliding EDU distance where j = 1, 2, …, n and n is Endof-Boundary. The weight vector of all v co1-j w co1-j and the weight vector of all v co2-j w co2-j from the SVM learning were used to determine the boundary of Dsym and the boundary of AT/RT, respectively from the testing corpus with Eq. (3)., All Dsym features and all AT/RT features are then extracted for the Problem-Solving/DiseaseSymptom-Treatment relation learning. LR (Freedman 2009): The logistic regression model of the research is based on the linear logistic regression with binary vector data. The distinguishing feature of the logistic regression model is that the variable is binary or dichotomous. Usually, the input data with any value from negative to positive infinity would be used to establish which attributions are influential in predicting the given outcome with values between 0 and 1, and hence is interpretable as a probability. The logistic function can be written as: F(x) is interpreted as the probability of the given outcome to be predicted where x 1 and x 2 are attribute variables and ß 0 , ß 1 , and ß 2 are the model estimators which play the role of momentum for each attribute. The research applies Eq. (4) to extract the features within each boundary (Dsym, AT/RT) with F(x) interpreted as the probability of either "Continue" as the "C" class or "End-of-Boundary" as the "E" class by the following rules.
Rule1(C-Class): If (F(x) C-Class >= 0.5 then "Continue" (sliding two consecutive EDUs) Rule2(E-Class): If (F(x) E-Class >= 0.5 then "End-of-Boundary"(stop sliding two EDUs) where x 1 and x 2 are the attribute variable pair of each Word-CO pair, v coi-j w coi-j v coi-j+1 w coi-j+1 , of each EDU pair from the supervised learning of LR in Eq. (5) by sliding the window size of two adjacent EDUs with one sliding EDU distance where j = 1,2,..,n and n is End-of-Boundary.

Problem-solving relation learning
The Problem-solving relation occurrence on documents in this research contains several problem EDUs and several solving EDUs, which result in several problem-Word-CO features and several solving-Word-CO features, i.e. 197 different symptom-Word-CO features and 118 different treatment-Word-CO features. Hence, the research enhances the correct Problem-Solving relation determination by applying a clustering technique to group the similar problem objects and to reduce the solving-concept Word-CO features as the feature reduction before learning the Problem-Solving relation. The research clustered the n samples of the posted problems on the web-board by using k-mean as shown in Eq. (6) (Aloise et al. 2009) where k 1 is the number of k-clusters for the problem object clustering and k 2 is the number of k-clusters for the solving feature clustering. k 1 and k 2 are predefined from 2 to 10. The expert then select k 1 = 6, k 2 = 7 and k 1 = 5, k 2 = 6 for the DiseaseSymptom-Treatment relation learning and the CarProblem-Repair relation learning, respectively.
where x j is a problem-concept EDU vector, i.e. Dsym, of an object 〈v co1-1 w co1-1 , v co1-2 w co1-2 , …, v co1-a w co1-a 〉 and j = 1, 2, …, n posted problems. μ k is the mean vector of the kth cluster. The highest number of v co1-i w co1-i occurrences in each cluster is selected as its cluster representative. For example, the symptom cluster set (Y) {rhinorrhoea-basedcluster, abdominalPain-based-cluster, brainSymptom-based-cluster, …, nSymptombased-cluster} is obtained in this research.
Equation (6) is replaced x j with x j to cluster the solving features, i.e. AT/RT, where x j is a Word-CO element. For example, x j is a Word-CO element, v co2-i w co2-i , of AT ∪ RT and j = 1, 2, …, mWord-COs, v co2 w co2 . After clustering the treatment features, the highest number of the general concept (based on WordNet and MesH) of v co2-i w co2-i occurrences in each cluster is selected as its cluster representative. The treatment cluster set (Z) {relax-based-cluster, foodControl-based-cluster, injectionControl-based-cluster, …, mTreatment-based-cluster}is then obtained in this research.
With regard to clustering the extracted feature vectors from section "Feature extraction", the Problem-Solving relation, i.e. the DiseaseSymptom-Treatment relation, is learnt by using Weka to determine the probabilities of y 1 , …, y a , z 1 , …, z h with the Classtype set of the DiseaseSymptom-Treatment relation,{'yes' 'no'} where y 1 , …, y a ∈Y, z 1 , …, z h ∈Z, and h is max(b, c) from AT and RT. The Class-type set is specified on any five EDUs right after AT or RT. An element of the Class-type set is determined from the following set of Class-cue-word patterns.

Problem-solving relation extraction
The objective of this step is to recognize and extract the Problem-Solving relation from the test corpus by using Naïve Bayes. For example, the DiseaseSymptom-Treatment relation extraction by Naïve Bayes is shown in Eq. (7) with probabilities of y 1 , …, y a , z 1 , …, z h from the previous step with the algorithm shown in Fig. 5.
Moreover, the extracted DiseaseSymptom-Treatment relation of this step can be used to construct PSM as shown in Fig. 6. where y 1 , y 2 , … y a ∈ Y, Y is a problem/symptom cluster set. z 1 , z 2 , … z h ∈ Z, Z is a solving/ treatment cluster set. dt = DiseaseTopic Class = {"yes", "no"} (6) Cluster(x j ) = arg min 1≤k≤K x j − µ k � 2 (7) SymTreat_RelClass = arg max class∈Class P(class|y 1 , y 2 , . . . , y a , z 1 , z 2 , .., z h , dt) = arg max class∈Class P(y 1 |class)P(y 2 |class) . . . P(y a |class)P(z 1 |class) P(z 2 |class) . . . P(z h |class)P(dt|class)P(class) car-repair-guru documents from the hospital's web-boards and the car-repair-guru webboards, respectively. The test corpora, which consist of 500 EDUs for each disease category (Gastro-intestinal disease, Heart-Brain disease, and Childhood disease) and 500 EDUs for the GeneralCar-Problem category are used to test or evaluate the feature extraction and the Problem-Solving relation extraction based on three experts with max win voting. Each category of the test corpora holds on average of 30 posted problem-solving documents with several topic names. The feature extraction as Problem Word-CO occurrences and Solving Word-CO occurrences is evaluated as the problem EDU identification and solving EDU identification, respectively. The feature extraction is also evaluated as the boundary determination of a problem-concept EDU boundary, i.e. Dsym and a carProblem-concept EDU boundary, and a solving-concept EDU boundary, i.e. AT/RT and a repair-concept EDU boundary. The evaluations of the Problem Word-CO identification and the Solving Word-CO identification are based on the precision and the recall of using VW problem and VW solving to identify the problem-concept EDUs and the solving-concept EDUs, respectively. In addition, the results of using three different models (ME, SVM, and LR) for the learning boundary of each EDU group (a problem-concept EDU group and a solving-concept EDU group) are evaluated by the correctness percentage of the EDU boundary determination (see Table 1).
From Table 1, the average precision of using VW problem and VW solving to identify the symptom concept EDUs and the treatment concept EDUs are 0.889 and 0.896, respectively, with average recalls of 0.769 and 0.852, respectively. The reason for having low recall of the symptom-concept-EDU identification is that a Word-CO with two adjacent words after the stop-word removal and stemming words as v co1 w co1 is insufficient to cover the symptom concept, i.e. 'รู ้ สึ ก/feel มี /there is อะไร/something กดทั บ/pressing on หน้ าอก/chest' (feel tight chest). Moreover, there is Cause-Effect relation occurrence which involves the problem Word-CO occurrence and results in reducing the precision of the symptom-concept-EDU identification by incorrectly identifying the symptom-concept EDUs as shown in the following topic name of the AbdominalDisease category (where the problem Word-CO occurrence, 'has + diarrhea' , of EDU at-1 is an effect from taking the flatulence relief medicine but is not an abdominal disease symptom).  Table 1 also shows two boundary evaluations of the problem-concept group (the problem-concept EDU boundary) and the solving-concept group (the solving-concept EDU boundary) in two different domains, a medical-healthcare domain and a car-repair domain. According to the disease categories in Table 1, each disease category shows The reason for the low recall in determining the Problem-Solving relation is the variation of the posted problems and solving steps between people with the same topic name, for example the 'Food Poisoning' topic name in the medical-healthcare domain; the variation of the posted food-poisoning symptoms is shown in the following sets {'have a headache' , 'have a colic' , 'vomit' , 'be dizzy'}, {'have diarrhea' , 'have fever' , 'be nauseated' , 'vomit'}, {'have diarrhea' , 'vomit'}, {'have diarrhea' , 'have a colic'}, etc., which results in varying their actual treatments. Both the symptom variation and the actual treatment variation affects both object clusters and feature clusters in the relation learning step.

Conclusion
In this paper, we presented the extraction of a group-pair relation between two eventexplanation groups expressed by several EDUs with boundary consideration from downloaded documents. The group-pair relation that we addressed in our research is the Problem-Solving relation, i.e. a DiseaseSymptom-Treatment relation and a CarProblem-Repair relation, where disease symptoms and car problems are the problem-event explanation group, and the treatment steps and repair steps are the solving-event explanation group. With regard to the limited literation of determining the semantic relation, particularly a group-pair relation, from texts for preliminary problem diagnosis, our research extracted the group-pair relations as an explanation based relation from webboard documents for preliminary problem solving. Our proposed method of extracting the group-pair/Problem-Solving relation from texts is based on two EDU vectors, a problem-concept EDU vector and a solving-concept EDU vector, where each EDU is represented by a Word-CO feature. Each Word-CO feature consists of a verb as the first word and the second word is a co-occurring word right after the first word with either a problem-event concept or a solving-event concept. To evaluate the proposed method, the accuracy of the Problem-Solving relation extraction depends on the corpus domain and also the corpus behavior, i.e. the number of different Word-CO features, the number of Word-CO features etc. In contrast to previous works where the relations occur within one sentence or one vector of sentences, our proposed approach (based on two vectors