
Research

Open

 Published:
Detecting evolution of bioinformatics with a content and coauthorship analysis
SpringerPlusvolume 2, Article number: 186 (2013)
Abstract
Bioinformatics is an interdisciplinary research field that applies advanced computational techniques to biological data. Bibliometrics analysis has recently been adopted to understand the knowledge structure of a research field by citation pattern. In this paper, we explore the knowledge structure of Bioinformatics from the perspective of a core open access Bioinformatics journal, BMC Bioinformatics with trend analysis, the content and coauthorship network similarity, and principal component analysis. Publications in four core journals including Bioinformatics – Oxford Journal and four conferences in Bioinformatics were harvested from DBLP. After converting publications into TFIDF term vectors, we calculate the content similarity, and we also calculate the social network similarity based on the coauthorship network by utilizing the overlap measure between two coauthorship networks. Key terms is extracted and analyzed with PCA, visualization of the coauthorship network is conducted. The experimental results show that Bioinformatics is fastgrowing, dynamic and diversified. The content analysis shows that there is an increasing overlap among Bioinformatics journals in terms of topics and more research groups participate in researching Bioinformatics according to the coauthorship network similarity.
Background
Bioinformatics is an interdisciplinary field that involves research, development, or application of computational tools and methods for utilizing biological, medical, behavioral, or health data. Uzounis and Valencia (Ouzounis & Valencia2003) have provided a review of the early stages of the long history of the bioinformatics discipline. Recently, evolution and trends in bioinformatics research have been studied (Patra & Mishra2006; Bhaskar et al.2006; PerezIratxeta et al.2007). The field has been characterized as an emerging discipline that has arisen from the needs of biologists to utilize and help interpret the vast amounts of data that are constantly being gathered in genomic, proteomics and functional genomics research.
Studying a particular research field by its publication pattern is the realm of Bibliometrics analysis that is a research method used in library and information science. It utilizes quantitative analysis and statistics to describe patterns of publication within a given field or body of literature (Osareh1996). Bibliometric analysis has recently been applied to identify the development of the Bioinformatics field. Bansard et al. (2007) analyzed the bioinformatics and medical informatics literature to identify trends that are shared among both research fields to derive benefits from potential collaborative initiatives for their future. Their study shows that Bioinformatics and Medical Informatics are independent developments with limited overlaps although both undergo fast changes and apply advanced computer techniques to processing massive biological data. Huang et al. (2011) analyzed the citation patterns in bioinformatics journals by normalizing the journal impact factor available in Journal Citation Report (JCR) published by Thomson Reuters. Glänzel at al. (2009) retrieved the core literature in bioinformatics by combining textual components with bibliometric, citationbased techniques. Janssens et al. (2007) conducted a study to analyze the domain based on text mining and bibliometrics aided techniques, and aimed at improving classification of literature through the combination of linguistic and bibliometric tools. Ibáñez et al. (2009) developed a supervised learning technique to predict the possibility of a journal having a tool capable of predicting the citation count of an article within the first few years after publication would pave the way for new assessment systems. Jeong et al. (2009) investigated whether active members of conferences, who are conference organizers, keynote speakers, program committee members, etc, for scholarly events are representative of scholars' prominence by citation counts and Hindex.
For the past decades, Bioinformatics has been expanded rapidly, and an understanding of the field of Bioinformatics becomes of paramount importance. The goal of this paper is to detect the trend and the knowledge structure of the field of Bioinformatics with the various approaches including tread analysis, the content and the coauthorship network similarity, Principal Component Analysis (PCA) of keywords and visualization of author Coauthorship. To our best knowledge, there are no studies analyzing Bioinformatics with coauthorship network analysis, a variation of social network analysis uses authors as the units of analysis and is constructed by connecting pairs of researchers authoring the same paper. This is different from author cocitation network constructed with two authors being cocited by a paper. The cocitations of pairs of authors are considered as the variable that indicates their “distances” from each other. Coauthorship network has been widely used to study the structure of collaborations and the status of individual researchers (Liu et al.2005; Newman2004). For example, Cunninham (2001), Mutschke (2001), and Liu et al. (2005) have studied the structure of scientific collaborations of the digital library discipline. Most of the existing coauthorship network studies focused on topological features of static coauthorship network, including centrality, largest component, diameter, clustering coefficient, average separation, average number of collaborator etc. (Newman2004; Ding2011; Milojevi2010). The content similarity is helpful to understand if there are overlapping or related topics in different disciplines. The coauthorship network similarity helps us to understand the similarity of collaborative groups in different disciplines. A high coauthorship network similarity between two disciplines implies that there are collaborative groups participating in the scientific work on both disciplines. These two disciplines are likely to have common interests and/or highly relevant topics. However, the content similarity and coauthorship network similarity are not necessarily to be correlated. That is, it is probable that two disciplines have many common topics but the collaborative groups who work on these common topics in these two disciplines are not the same. The experimental results of our study verify this possibility. We analyze keywords extracted from the datasets by PCA to understand what topics constitute the core literatures. We also analyze the coauthorship network with visualization.
Methods
Problem definition and notation
We first present the definitions of publication and coauthor network.
Definition 1 (Publication)
A publication can be a peerreviewed paper from any conference or journal. It has attributes including title, year published, conference name (or journal name if it is published in a journal) and author list. A publication is the smallest unit in our study. We represent a publication t_{ i } by a tuple$\phantom{\rule{0.25em}{0ex}}\left\{{W}_{{t}_{i}},{N}_{{t}_{i}},\mathit{op}{t}_{{t}_{i}}\right\}$, where${W}_{{t}_{i}}$ is the TFIDF term vector of publication t_{ i }.${W}_{{t}_{i}}=\left\{{w}_{1}^{{t}_{i}},{w}_{2}^{{t}_{i}},\dots ,{w}_{\left{t}_{i}\right}^{{t}_{i}}\right\}\phantom{\rule{0.25em}{0ex}}$ is formed by the terms from t_{ i } ‘s title, where${w}_{j}^{{t}_{i}}$ denotes the TFIDF score of the j^{th} term of t_{ i }. Similarly,${N}_{{t}_{i}}$ is the coauthor network (defined below) associated with the publication t_{ i } and$\mathit{op}{t}_{{t}_{i}}\phantom{\rule{0.25em}{0ex}}$ is the year published of publication t_{ i }.
Definition 2 (Coauthor Network)
A coauthor network associated with a publication t_{ i } is a fully connected graph${N}_{{t}_{i}}=<{V}_{{t}_{i}},{E}_{{t}_{i}}>$, where${V}_{{t}_{i}}$ is a set of authors,$\left\{{p}_{1}^{{t}_{i}},{p}_{2}^{{t}_{i}},\dots ,{p}_{\left{V}_{{t}_{i}}\right}^{{t}_{i}}\right\}$, coauthored in publication t_{ i }, and${E}_{{t}_{i}}$ denotes the coauthor relationships between authors in${V}_{{t}_{i}}$. Every two coauthors are connected so that the coauthor network is a fully connected network.
In this study, we are interested in the content similarity and social network similarity between BMC Bioinformatics and the other core Bioinformatics journals/conferences in a prespecified period of time. We define A_{ x } as a conference/journal x in the time period${t}_{{A}_{x}}$. A_{ x } can be represented by a triple$\phantom{\rule{0.25em}{0ex}}\left\{{W}_{{A}_{x}},{N}_{{A}_{x}},\mathit{op}{t}_{{A}_{x}}\right\}$ where${W}_{{A}_{x}}$ is the centroid of the TFIDF term vectors of all publications of x in time period${t}_{{A}_{x}}$,${W}_{{A}_{x}}=\frac{1}{\left{A}_{x}\right}{\displaystyle \sum}_{\mathit{op}{t}_{{t}_{i}}\in \mathit{op}{t}_{{A}_{x}}}{W}_{{t}_{i}}$;${N}_{{A}_{x}}$ is the aggregated coauthorship network of all${N}_{{t}_{i}}$ of the publications from x with$\mathit{op}{t}_{{t}_{i}}\in \mathit{op}{t}_{{A}_{x}}$;$\mathit{op}{t}_{{A}_{x}}$ is the prespecified period of time. For example, let A_{ x } represent the publications of BMC Bioinformatics in 2001. So that,${W}_{{A}_{x}}$ is the centroid of the TFIDF term vectors of all BMC Bioinformatics publications during 2001.${N}_{{A}_{x}}$ is the coauthorship network integrating all coauthorship networks of the BMC Bioinformatics publications in 2001.$\mathit{op}{t}_{{A}_{x}}$ then denotes the period of year 2001.
Content similarity and social network similarity
As defined in the above section, the content of an individual publication is represented as a TFIDF term vector and the content of a conference/journal is represented by the centroid of the term vectors of the publications from this conference/journal. Thus, the content similarity between two conferences/journals is the cosine similarity of their centroid term vectors, defined by$\mathrm{cos}\left({W}_{{A}_{i}},{W}_{{A}_{j}}\right)$. Content similarity measures the topical commonality between two conferences/journals in terms of the keywords used in these domains. The scientific articles published in two totally different journals seldom use similar keywords; for example, the literature in criminology has very little keywords overlapping with the literature in cancer research. However, BMC Bioinformatics and Bioinformatics are two core journals in Bioinformatics with their own focus but there are also overlapping topics in these two journals. By using content similarity, we measure how similar the literatures of different journals are in terms of the vocabularies adopted by the corresponding authors.
As defined in Section 2, A_{ x } is represented by the centroid of the term vectors of A_{ x }. Thus, the content similarity between A_{ i } and A_{ j } is measured by the cosine similarity of their centroid term vectors, defined as:
where${w}_{k}^{{A}_{i}}$ represents the k^{th} term of the centroid${W}_{{A}_{i}}$. The numerator is the dot product of the TFIDF term vectors of${W}_{{A}_{i}}\phantom{\rule{0.25em}{0ex}}\mathrm{and}\phantom{\rule{0.25em}{0ex}}{W}_{{A}_{j}}$ and the denominator is the product of the Euclidean lengths of${W}_{{A}_{i}}\phantom{\rule{0.25em}{0ex}}\mathrm{and}\phantom{\rule{0.25em}{0ex}}{W}_{{A}_{j}}$. The effect of the denominator is normalizing${W}_{{A}_{i}}\phantom{\rule{0.25em}{0ex}}\mathrm{and}\phantom{\rule{0.25em}{0ex}}{W}_{{A}_{j}}$ to unit vector. The cosine similarity function measures the angle between two vectors in an ndimensional space (n is the number of terms in the term vector). The smaller the angle is, the more similar the two TFIDF term vector is.
Content similarity is capable of measuring the similarity of two journals in one aspect; however, it cannot measure the similarity in terms of the contributing authors and their collaboration networks. It is possible that the literatures of two journals have high content similarity but the contributing authors in these two journals can be quite different. For example, the topics covered in Bioinformatics and Medical Informatics are very similar; however, the contributing authors in these two domains can be very different. They are studying similar diseases but one is focusing on the prevention issue in the general public and another one is focusing on the findings in the treatment of these diseases. As a result, there is a high similarity in content but there may be a low similarity in terms of the contributing authors. In this work, we propose to measure the similarity of coauthorships networks to complement the content similarity analysis.
We propose the social network similarity analysis by considering the intersection of important authors involving in two journals. It is worth emphasizing that, rather than purely counting the number of overlapping authors in both journals, we assign a significant score to each individual author of a journal and measure the similarity by the significant scores of the contributing authors in two journals. Two coauthorship networks are considered as similar if the importance of the contributing authors is very close in both networks. If a contributing author is important in one coauthorship network but not as important in another coauthorship network, it will attribute to a lower social network similarity value. Two coauthorship networks will not be considered as similar only because they have a similar set of contributing authors.$\mathit{Overlap}\left({N}_{{A}_{i}},{N}_{{A}_{j}}\right)$ is defined as:
where$\mathit{SS}\left({P}_{k}^{{A}_{m}}\right)$ represents the significant score of the author P_{ k } in A_{ m }. In this work, we employ degree centrality to measure the significance of an author. In a coauthorship network, edges are nondirectional. There is no difference between indegree and outdegree. Given a node i, its degree centrality (deg(i)) equals to the number of edges attached to this node which reflects the importance of this node in the given network. As a result, in this work$\mathit{SS}\left({P}_{k}^{{A}_{i}}\right)=\mathrm{deg}\left({P}_{k}^{{A}_{i}}\right)$ in the coauthorship network${N}_{{A}_{i}}$. The larger the degree centrality of an author in a coauthorship network is, the more coauthors this author have.${N}_{{A}_{i}}{\displaystyle \cap}{N}_{{A}_{j}}$ denotes the intersection of authors in two coauthorship networks, and similarly${N}_{{D}_{i}}{\displaystyle \cup}{N}_{{D}_{j}}$ denotes the union of authors in two coauthorship networks. From the definition above we can see that the larger the intersection of important authors, the higher the value of Overlap(•,•) is. For example, suppose${N}_{{A}_{i}}$ represents the coauthorship network of journal i of year 2002 and${N}_{{A}_{j}}$ represents the coauthorship network of journal j of year 2002,$\mathit{Overlap}\left({N}_{{A}_{i}},{N}_{{A}_{j}}\right)$ can be employed to quantify the coauthorship network similarity of these two journals in year 2002.
In general, the social network similarity over time may increase, decrease, or unchanged. When the social network similarity over time increases, it may due to the same group submitting to both journals in that particular year. It may also due to other reasons such as the authors with high degree centrality in one journal also have high degree centrality in another journal. In addition, the social network similarity is measured in each individual year. It does not accumulated over time, which means the similarity in 2010 does not necessary have more sufficient data than the previous year.
Based on the notations above, the similarity between A_{ i } and A_{ j } regularizing by Social Network Analysis is defined as
Where${W}_{{A}_{i}}\phantom{\rule{0.25em}{0ex}}\mathit{and}\phantom{\rule{0.25em}{0ex}}{W}_{{A}_{j}}$ are the centroids of TFIDF vectors associated with A_{ i }and A_{ j } respectively;${N}_{{A}_{i}}\phantom{\rule{0.25em}{0ex}}\mathit{and}\phantom{\rule{0.25em}{0ex}}{N}_{{A}_{j}}$ are social networks associated with A_{ i }and A_{ j } respectively and 0 < ϵ < 1.
When ϵ is 1, it only measures the content similarity. When ϵ is 0, it only measures the social network similarity.
PCA of keywords and coauthor analysis in the field of Bioinformatics
First, we extract keywords from the datasets. Keywords of a journal within one year unit are selected based on their TFIDF score. First of all, given a set of publications from a journal inside of a selected time interval, we extract the title for each publication and aggregated them together to represent the content of the journal of this time interval. Secondly, we conduct preprocessing on the content, including lowercasing, stemming, and removing stop words. Finally, we computed the TFIDF score for each word of a journal within the time unit and ranked these words according to their TFIDF score in descending order. Top words were returned as the keywords of the journal in this time unit. After we extract keyword lists, we apply PCA to the list. PCA in multivariate statistics is widely adopted as an effective unsupervised dimension reduction method and is extended in many different directions. The main justification of dimension reduction is that PCA uses singular value decomposition (SVD) which gives the best low rank approximation to original data in L2 norm.
For coauthor analysis, we compute every pair of coauthors in the data collection. We then select top 500 coauthor pairs to build an adjacency matrix. We used the betweenness centrality to calculate the node distance. Betweenness centrality is a measure based on the number of shortest paths between any two nodes that pass through a particular node. Nodes around the edge of the network tend to have a low betweenness centrality whereas a high betweenness centrality indicates that the individual is connecting various different parts of the network together.
Results
In this section, we describe the dataset used in this study. As introduced in previous sections, the main focus of this paper is to analyze the trend and the content and social network similarity among BMC Bioinformatics and other core Bioinformatics literatures.
Data collection
We constructed the dataset by extracting publication data from DBLP. The dataset consists of 16,061 peerreviewed papers in Bioinformatics areas from 2001 to 2010. We extracted the title, year of published, author list, and conference proceeding name or journal name for each individual paper. It is important to note that our dataset covers the publications in four major conferences and four major journals of these two areas from 2001 to 2010. Formally speaking, we extracted papers from conferences including International Conference on Intelligent Systems for Molecular Biology (ISMB), Pacific Symposium on Biocomputing (PSB), International Conference on Bioinformatics & Computational Biology (BIOCOMP), and IEEE International Symposium on Biomedical Imaging (ISBI). In addition, we extracted papers from journals including Bioinformatics, Journal of Biomedical Informatics, PLoS Computational Biology and BMC Bioinformatics. We then empirically chose one year as a unit and then divided the dataset into 10 nonoverlapping time intervals. It is worthy to mention that by the time that we collected our dataset some conference information was not completely available in DBLP. For example, DBLP only provides bibliography data for BIOCOMP from 2006 to 2010; ISMB 2009 and 2010 are unavailable; PLoS Computational Biology from 2001 to 2004 are unavailable; similarly, ISBI 2001 and 2003 are unavailable in DBLP. Last but not least, name disambiguation is an important yet very challenging preprocessing step in bibliometrics. DBLP employed a simple heuristic method to differentiate homonym persons. Besides automatic method, some daily manual efforts are also devoted by DBLP developers to further alleviate the ambiguous name problem. However, as mentioned in [DBLPSome Lesson Learned], in many case, homonyms remain undetected. Name disambiguation is out of the scope of this paper. We simply used the disambiguated names provided by DBLP, although it may not be perfect. Table 1 and2 summarize the basic statistics of the data collection.
The trend in the coauthorship networks of Bioinformatics
Before we examine the content similarity and social network similarity, we first look at the trend in the coauthorship networks of Bioinformatics journals and conferences. Figure 1 presents the number of components in the coauthorship networks of Bioinformatics journals and conferences over 10 years.
A social network component is a subgraph that is connected within but disconnected from other subgraphs. That means a node n_{ i } in a social network component must have a path to all other nodes in the same social network component but do not have a path to any other nodes of other components.
A node n_{ i } is part of a component c_{ j } as long as there is a link between n_{ i } and any one node of c_{ j }. Therefore, a component of a social network can be easily identified iteratively starting from a node and its links until no other nodes can be found.
As shown in Figure 1, the Bioinformatics journal has the largest number of component in 20012007 and has consistently increased during this time frame; BMC Bioinformatics and PLoS Computational Biology has also steadily increased in the number of components during 2001 and 2007. Journal of Medical Informatics and all four conferences (ISMB, PSB, BIOCOMP, and ISBI) have a small number of components and do not change as much as the other three journals. In 20072010, the number of components in BMC Bioinformatics is larger than those in Bioinformatics. It indicates that the number of collaborative groups in Bioinformatics journals except for JBI has increased substantially over a ten year period. BMC Bioinformatics shows the particularly rapid and consistent increase trend during the entire period of experiments. In BMC Bioinformatics, the biggest increase was made in 20032005. Note that the increase in the number of components was from 185 to 548. It is important to note that the number of components can be affected by the number of papers published in a period. When the number of papers published in a period is high, the probability of having a higher number of components is also higher.
Figure 2 and3 present the number of papers per component and the number of authors per component, respectively, in the coauthorship networks of four journals and four conferences in ten years. It is found that BMC Bioinformatics has the largest number of papers per component in the journal category and ISMB has the largest number of papers per component in the conference category. However, the differences observed in the number of papers in component were not statistically significant. In the number of authors per component, it is found that Bioinformatics has the largest number of papers per component followed by BMC Bioinformatics in the journal category. In the conference category, ISBI has the largest number of papers per component. The differences observed in the number of papers in component were not statistically significant.
In spite of the rapid growth in the number of components in BMC Bioinformatics, the number of papers per component and the number of authors per component in BMC Bioinformatics do not have the same increasing trend. Overall we observed that there exist similar patterns in the number of authors as well as in the number of papers in a component in both journals and conferences, which shows the moderate increase over 20012010. In addition, the experimental results show that there is a marginal deviation in the number of papers in component, which ranges from 2.93 to 3.63 on average. The number of authors in component shows a bigger deviation than in the number of papers whose range is from 4.21 to 7.06 on average. Both the lowest and the highest average number are of conference (BIOCOMP is lowest and ISBI is highest). The results of trend analysis indicate that a rapid growth of the number of components and an increase in the number of papers and authors within a component is not as high as an increase in the number of components. It implies that collaborative groups in Bioinformatics have been diversified and departmentalized which in turn presumably results from the fact that more researchers enter the field with specific research interests and background.
Similarity analysis based on content and Coauthorship network
We also conducted similarity analysis among conferences, conferences and journals, and journals in terms of content and coauthorship network similarity. Figures 4 and5 show relationships among these three categories by content similarity and coauthorship network similarity, respectively.
Figure 4 illustrates the content similarity among journals, conference and journals, and conferences in ten consecutive periods of time. It shows that there is a dramatic increase in content similarity among journals in periods of ten year. The similarity among journal and conferences has a sharp increase between 2005 and 2006, and then the curve shows fluctuation from 2006 to 2011. The relationship among conferences shows the low content similarity where the pick is below 0.2 in 2008.
As illustrated in Figure 5, the overall coauthorship network similarity is considerably lower than the content similarity. Particularly, the coauthorship similarity among conferences is severely low. The similarity among journals is dramatically increased from 2003 to 2006 and then stabilized around at 0.1. The similarity between journals and conferences is fluctuated from 0.02 to 0.06 in ten year of the time span. Figure 6 presents the combination of content and coauthorship network similarity among journals, conference and journals, and conferences in ten consecutive periods of time. When both the content and the coauthorship network similarity are taken into consideration, the results show that the relationship among conferences shows the low similarity patterns as observed in both content and coauthorship network similarity. The experimental results indicate that 1) main themes and topics of conferences are somewhat different from topics covered in journals and 2) contribution groups of journals and conferences do not much overlap. This implies that the field of Bioinformatics becomes diversified. In addition, there is a high content similarity among journals whereas the coauthorship network similarity is not as high as the content similarity.
Identification of keywords and key authors in the field of Bioinformatics
Figures 7 and8 show the results of applying PCA for keywords and visualization of the coauthorship network extracted from the datasets respectively. As illustrated in Figure 7, keywords are grouped into three discriminated clusters. In the upper left corner of two dimensional space, there are terms related to System Biology such as diverg (divergent), key (key), trace (trace), cytoscap (Cytoscape), illumina (Illumina), cgh (CGH), sbml (SBML), channel (Channel), interactom (Interactome), sirna (siRNA), coeffici (coefficient), and block (block). To reduce the term variations, we applied stemming for terms, and a term within parenthesis above is the original term appearing in the text. In the lower right corner, there appear terms related to medical imaging such as coher (coherent), enhanc (enhancement), tissu (tissue), mri (MRI), ct (CT), ultrasound (ultrasound), nonrigid (nonrigid), echocardiographi (echocardiography), morpholog (morphology), threedimension (threedimension) simultan (simultaneous). In the lower left corner, there is a dense cluster to which most terms are grouped. The core of the cluster consists of terms related to Bioinformatics including synapt (synaptic), neuron (neuron), plastic (plastic), algori (algorithm), care (care), proteinligand (proteinligand), database (database), mirna (miRNA), metagenom (metagenomics). In the right side of the cluster, there are terms related to medical informatics including vertebr (vertebrate), network (network), human (human), extract (extract), cancer (cancer), inform (information), semisupervis (semisupervised), gap (gap), diabet (diabetes), bayesian (Bayesian).
Figure 8 shows visualization of coauthorship with the betweenness centrality measure. Several disconnected subgraphs indicate that in Bioinformatics, a variety of segmented research group work on similar research topics in Bioinformatics. The biggest connected graph is located in lower right corner, which the main subject is pertinent to computational biology that includes researchers like Masao Nagasaki, Arthur W. Toga, and Lei Xie. This subgraph also includes bio imaging researchers such as Paul M. Thompson and Agatha D. Lee. A sparsely connected subgraph in the middle includes researchers like Ivo L. Hofacker who is interested in RNA Bioinformatics, Kathleen Marchal in Molecular Biology, and Lawrence Hunter in Computational Pharmacology. Broadly speaking, researchers located in the middle subgraph have a specialty in computational biology.
The experimental results show that Bioinformatics is fastgrowing, dynamic and diversified which confirms the findings by Huang et al. (2011) and Bansard et al. (2007). In addition, we have gained interesting findings from the experimental results. Our analysis shows that there is a substantial growth of collaborative groups in journals. It can be attributed to the solid increase of papers published in these journals. Such trend of growth is not observed in conferences. With respect to content similarity, the comparison among journals indicates a steady increase in content similarity in ten consecutive periods of time. This implies that the works published in these journals become more similar than before. From the perspective of coauthorship network similarity, no uniform pattern in three comparisons was observed. It is found that the social network similarity between conferences is very low and the similarity between conferences and journals is also low. The coauthorship network similarity among journals shows a steady increase until 2006 and then saturated. The content similarity between journals is relatively high and yet the coauthorship network similarity between these journals is moderately low. That means the collaborative groups are contributing very similar work to Bioinformatics journals but the contributing groups are not completely the same in these journals. It can be attributed to the different properties of the communities in these journals. In the future, it will be interesting to identify the properties of these two communities in order to understand how these two journals are different.
Conclusions
The scientific literature is continuously developing. To gain a better understanding of such development, we conduct the trend analysis, the content and the coauthorship analysis, and PCA of keywords and visualization of coauthorship in the Bioinformatics research domain. The content similarity helps us to understand the development of topical similarity between different journals/conferences and the coauthorship network similarity helps us to understand the similarity of collaborative groups between different journals/conferences. In this work, we find that the field of Bioinformatics keeps growing. More researchers enter the field and collaborate with others although the collaboration rate is not as high as the publication growth in the field. It is also found that bioinformatics related journals are highly similar in terms of contents. It is interesting to note that there is the moderate increase in content similarity between conferences and journals but have fluctuation in terms of coauthorship network similarity. It indicates that both journals and conferences cover some overlapping topics but the contributing collaborative groups are dynamic and quite not similar. In addition, we find that there are three distinct clusters by PCA that is applied to the keyword list. Visualization of the coauthorship network reveals that several disjoint research groups that study the similar topics. This implies that the community is big and sparse so that they do not have a chance to collaborate with each other. Another interpretation is that it is the closed community so that the collaboration among different research groups does not frequently occur.
This work illustrates how content similarity and coauthorship network similarity supplement each other in bibliometric studies, which is useful in understanding the development of scientific literature of Bioinformatics. In the future, we plan to further investigate the coauthorship network similarity across consecutive periods of time so that we may understand the development of collaborative groups within a discipline. For example, we may understand if there is any dominating collaborative group and how such group is developing in consecutive periods of time.
References
Bansard Y, RebholzSchuhmann D, Cameron G, Clark D, van Mulligen E, Beltrame E, Barbolla E, Hoyo D, MartinSanchez H, Milanesi L, Tollis I, van der Lei J, Coatrieux JL: Medical informatics and bioinformatics: a bibliometric study. IEEE transactions on information technology in biomedicine : a publication of the IEEE Engineering in Medicine and Biology Society 2007, 11(3):237243.
Bhaskar H, Hoyle D, Singh S: Machine learning in bioinformatics: A brief survey and recommendations for practitioners. Comput Biol Med 2006, 36(10):11041125. 10.1016/j.compbiomed.2005.09.002
Cunningham SJ: The birth of a discipline: An analysis of the 19942000 ACM digital libraries conferences. Proceedings of 8th international conference on Scientometrics and infometrics 2001.
Ding Y: Scientific collaboration and endorsement: Network analysis of coauthorship and citation networks. Journal of informetrics 2011, 5(1):187203. 10.1016/j.joi.2010.10.008
Glänzel W, Janssens F, Thijs B: A comparative analysis of publication activity and citation impact based on the core literature in bioinformatics. Scientometrics 2009, 79: 109129. 10.1007/s1119200904071
Huang H, Andrews J, Tang J: Citation characterization and impact normalization in bioinformatics journals. J Am Soc Inf Sci Technol 2011. 10.1002/asi.21707
Ibáñez A, Larrañaga P, Bielza C: Predicting citation count of Bioinformatics papers within four years of publication. Bioinformatics 2009, 25(24):33039. 10.1093/bioinformatics/btp585
Janssens F, Glanzel W, De Moor B: Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis. Proc. of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 07) 2007, San Jose, California, Aug 2007, 360369.
Jeong S, Lee S, Kim HG: Are you an invited speaker? a bibliometric analysis of elite groups for scholarly events in bioinformatics. J Am Soc Inf Sci Technol 2009, 60(6):11181131. 10.1002/asi.21056
Liu X, Bollen J, Nelson ML, Van de Sompel H: Coauthorship networks in the digital library research community. Inf Process Manag 2005, 41: 14621480. 10.1016/j.ipm.2005.03.012
Milojevi S: Modes of collaboration in modern science: Beyond power laws and preferential attachment. J Am Soc Inf Sci Technol 2010, 61(7):14101423. 10.1002/asi.21331
Mutschke P: Enhancing information retrieval in federated bibliographic data sources using author network based stratagems. Proceedings of research and advanced technology for digital libraries: 5th European conference 2001, 2163: 287299. 10.1007/3540447962_25
Newman MEJ: Coauthorship networks and patterns of scientific collaboration. Proc Natl Acad Sci U S A 2004, 101: 52005205. 10.1073/pnas.0307545100
Osareh F: Bibliometrics, Citation Analysis and CoCitation Analysis: A Review of Literature I. Libri 1996, 46: 149158.
Ouzounis C, Valencia A: Early bioinformatics: the birth of a discipline  a personal view. Bioinformatics 2003, 19(17):21762190. 10.1093/bioinformatics/btg309
Patra SK, Mishra S: Bibliometric study of Bioinformatics Literature. Scientometrics 2006, 67: 477489.
PerezIratxeta C, AndradeNavarro MA, Wren JD: Evolving research trends in bioinformatics. Brief Bioinform 2007, 8(2):8895.
Acknowledgements
This work was supported by National Research Foundation of Korea Grant funded by the Korean Government (NRF20122012S1A3A2033291) and by the National Research Foundation of Korea (NRF) grant funded by the Korea government (No. 2012033242).
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
MS leads the collaboration to design and conduct the experiments and write up the paper. All authors read and approved the final manuscript. CCY supervises XT and participates in writing up the paper. XT collects the datasets for the experiments and develops the modules for content analysis and network analysis.
Christopher C Yang contributed equally to this work.
Authors’ original submitted files for images
Rights and permissions
About this article
Received
Accepted
Published
DOI
Keywords
 Betweenness Centrality
 Collaborative Group
 Medical Informatics
 Content Similarity
 Network Similarity