- Open Access
Influence analysis of Github repositories
© The Author(s) 2016
- Received: 14 December 2015
- Accepted: 22 July 2016
- Published: 5 August 2016
With the support of cloud computing techniques, social coding platforms have changed the style of software development. Github is now the most popular social coding platform and project hosting service. Software developers of various levels keep entering Github, and use Github to save their public and private software projects. The large amounts of software developers and software repositories on Github are posing new challenges to the world of software engineering. This paper tries to tackle one of the important problems: analyzing the importance and influence of Github repositories. We proposed a HITS based influence analysis on graphs that represent the star relationship between Github users and repositories. A weighted version of HITS is applied to the overall star graph, and generates a different set of top influential repositories other than the results from standard version of HITS algorithm. We also conduct the influential analysis on per-month star graph, and study the monthly influence ranking of top repositories.
- Social coding
- Influence analysis
The rapid development of social coding tools is leading to a revolution in software product development. Social interactions have become an important factor in the evaluation of the software development process.
Version control systems (VCS) are the essential part of a social coding platform. Nowadays, various VCS tools, e.g. CVS, SVN, Git and etc., are frequently used by software development teams. With them, decentralized team work is possible, and the development process becomes more productive. Software developers can work on their own versions, and submit changes into the decentralized VCS systems. Different versions of software are managed by the VCS system, and potential conflicts of software products are avoided.
Early VCS systems are used only by relatively small software development teams, and are mostly deployed within small area networks, like company LANs. The number of projects maintained within those early VCS systems is also relatively small. As Git can make distributed coding collaboration easier, it is gaining its popularity.
With the recent advances in Internet and cloud computing technology, distributed social coding receives a big boost. Popular social coding platforms can now host millions of software projects. Nowadays, more and more people accept the idea of “social coding”. Contributions to a software development process are most likely made or to be made by a distributed, collaboration-motivated virtual community. Software developers across the world can take part in the same software project, modifying different parts of the code and generating different branches in the project source tree. There are now no explicit boundaries of a software team. A software project may be developed by an ever-changing set of software engineers, and a software engineer may contributed to a set of different software projects hosted in a remote server.
Social coding has tremendously changed the style of software development activities. The social network of software developers continuously interacts with the life cycle of software projects. There have been several social coding platforms that facilitate software engineers around the world to contribute to software projects together. Distributed development tools, e.g. Git, act as the foundation of social coding platforms. Based on Git, the Github platform has attracted many developers to work on millions of open source software projects. In Github, projects have evolved into repositories. Repositories have more information inside. The number of Github users and repositories keep growing.
Github is not only a host of software projects, but also a data source that records software development activities. Many researchers perform analysis on Github Repositories and Github data. Some investigate the collaboration of Github users based on their activities on repositories (Avelino et al. 2015; Jurado and Marín 2015; Lima et al. 2014; Vasilescu et al. 2015b). Some study language importance, or predict the trends of popular programming languages (Casalnuovo et al. 2015; Ray et al. 2014).
As an open social coding platform, there are no restrictions to the creation of new users and repositories. New developers keep coming into Github, new public repositories are being created from time to time. It is now a more important issue to pick out capable or influential ones from millions of Github users. Naturally, the expertise level of a developer is judged by the quality of repositories owned by him, and by his contributions made to Github repositories. Ranking the importance of Github repositories, is thus an necessary work for the evaluation of the Github ecosystem.
In Github, each repository is associated with a set of meta information. The size of the repository, the set of people who starred the repository, etc., are provided by the open Github API. The direct ranking of Github repository based on the size, number of stars, number of forks have been studied. However, ranking of repositories considering social relations in the Github platform, has not been studied yet.
In this paper, we analyzed the importance of Github repositories by considering the social relationship between users and repositories. We consider the two important features of Github Repositories: star, and fork. We use the star relationship to create a star graph, and apply social analysis algorithms on the star graph. The results are then analyzed and the social influence factor of Github repositories are calculated.
We built a data acquisition module, which collects Github data from multiple data sources. The retrieved data is processed, and used to build the important social graphs.
We proposed a HITS based repository influence analysis, on the star graph constructed from the star relationship between Github users and repositories.
We evaluated the weighted version of HITS algorithm. By comparing the results, we found that more reasonable ranking is generated by combining the fork number and the star relationship.
We proposed a language-specific analysis, and evaluated the difference of the programming language influence on Github repositories.
In this paper, we analyze the importance of software repositories using social analysis techniques. In this section, we will present some background information, including link analysis, social coding platform, and the Github timeline data.
Link analysis algorithms
The basic idea in this paper is to perform social influence analysis on Github repositories using link analysis techniques. Link analysis is first used in ranking web pages. HITS and PageRank are the two major link analysis algorithms, which we will explain in some detail.
PageRank is a link analysis algorithm used to rank the result pages of Google search engine (Kaplan 2008). PageRank was named after one of the founders of Google, Larry Page.
PageRank is a way of measuring the importance of Web site pages. Google definition: “PageRank works by counting the number and quality of links to determine a rough estimate of how important the web site is. The underly assumption is that more important web sites are likely to receive more link from other web sites”.
The rank of pages are calculated iteratively until the result converges.
Hyperlink-Induced Topic Search (HITS) is a link analysis algorithm which is proposed in 1999, by Dr. Jon Kleinberg of Cornell University (Kleinberg 1999). HITS algorithm divides the Web pages into two types, namely hub pages and authority pages. The authority pages are generally recognized as the important pages on a particular topic. The hub pages, which can be regarded as the pages of evaluating pages, are the pages that link to a collection of authority pages on a particular topic. There is a mutually reinforcing relationship between authority pages and hub pages: a good authority pages should be pointed to by many hub pages, while a good hub page should point to many authority pages. HITS algorithm makes use of the mutually reinforcing relationship between them and gets the page ranks by an iterative computation loop. During the iterative computation, authority weight and hub weight are recalculated and updated, until the values converge.
We adopt HITS algorithm as the basic social analysis technique, and improve HITS algorithm with Github meta information as weights.
Social coding platform
Distributed coding tools, including CVS, SVN, GIT, have changed the ways of software development. Those social coding platforms have become containers for software collaborations, among software developers on software repositories. Several social coding platforms, including SourceForge and GoogleCode, have contributed to the prosperity of open source projects. As more and more people are used to code maintenance with Git, the Git-backed coding hosting platform now attracts millions of developers to put their software projects there.
Github is a Web based Git repository hosting service, which offers all of the distributed version control and source code management (SCM) functionality of Git. Github provides a Web based graphical interface. It also provides access control and several features such as bug tracking, feature requests, task management, and wikis for every project. Github provides star, fork functionalities to make Github users and repositories form a real social network.
Although there are other hosts of open source projects that also advocate social coding, like bitbucket and gitorious, Github is still the most popular one.
In Feb 2012, Github publicly announced that its timeline data is available on big query for analysis. Moreover, it offers prizes for the best visualization of the data.
Github provides the social interaction data for free. It faithfully records important actions a Github user performed on repositories. A clean API is provided for interesting people to access the event data. Project timeline can be constructed from those Github events. This functionality makes Github even more popular, not only as a software project hosting service, but also as a target of software engineering research.
As the Github platform is becoming popular, analyzing the social activities on Github platform is a new trend in software engineering (Lima et al. 2014). People observe user activities on Github repositories, and analyze the Github repository features to gain insights into the Github data. Two broad categories of research work are closely related to the work in this paper: user collaboration, and repository analysis.
Hauff and Gousios (2015) observe the activities of users on Github, and conduct quantitative analysis of user’s skills and interests based on the observation. Casalnuovo et al. (2015) take a step further, and try to relate the social links between users and users’ language experience to the productivity of developers. User following relationship demonstrates user’s interests to other Github users. Yu et al. (2014) mine from follow networks, and discover several social patterns on Github. People are also interested in other social features of Github users, e.g. leadership, team diversity, gender diversity. McDonald et al. (2014) explore the concepts of distributed leadership, and propose a theory of leadership sharing, to support a model of developer contribution to open source projects. Vasilescu et al. (2015b) present a large data set of social diversity attributes of programmers in Github teams, for researchers to study the effect of team diversity in decentralized teams. Vasilescu et al. (2015a) also study the correlation of gender and tenure diversity to team productivity. Their results show that the gender and diversity are positive predictors of productivity.
As Github repositories are important assets of Github users, their popularity and quality are strong indicators of their owner’s capability. Therefore, analysis of Github repositories becomes one important research branch. Researchers studied variant features of Github repositories, trying to analyze them from different aspects. Jurado and Marín (2015) perform a study over the project issues with Github repositories. They observe the sentimental aspects of Github project issues. Yu et al. (2015) study the pull requests, discuss the complex issue of pull request evaluation latency on Git enabled social coding platforms. Avelino et al. (2015) study the truck factor of popular Github repositories. A project’s truck factor is the number of developer it would need to lose to destroy its progress. Cosentino et al. (2014) evaluates the openness of Github projects with three metrics: the distribution of the project community, the rate of acceptance of external contributions, and the time it takes to become an official collaborator of the project. Tsay et al. (2014) study how to evaluate contributions on Github.
Recent works on Github analysis have revealed many secrets in Github data. However, we found that more efforts should be made to combine social interactions and Github repository features, in order to give a reasonable ranking of Github repositories.
There have been work on evaluating the popularity of Github users (Xavier et al. 2014). We focus on analyzing the popularity (influence) of Github repositories. Similar to the work on evaluating the effect of programming languages on open source projects (Ray et al. 2014), we build language-specific social graph, and conduct language-specific analysis to get the per-language repository influence. People are also interested in the dynamics of Github data. Loyola and Ko (2014) evaluated how the contributor groups on a Github project evolves over time. Considering the evolution nature of Github activities, we also perform an evolutionary study of repository influence ranking.
In this section, we present the HITS based social influence analysis for Github repositories. The details of the HITS based analysis are presented. Apart from the basic HITS analysis, we also discuss how to perform language-specific analysis, and how to use the Github meta information to build weighted HITS analysis of Github repositories.
HITS based analysis
a(v) = sigma h(v)
h(v) = sigma a(v)
To perform topic distillation, we create language specific star graph for major languages. We then apply HITS algorithm on those language-specific graphs to analyze the influence of repositories in different programming languages.
The database userMetaDb will later be used to attract user id, and user names.
Weighted HITS analysis
For Github ecosystem, treating all the link information of user-star-repository relationship as the same may not be appropriate. The importance of Github repositories vary. One important factor is the fork counts and size of Github repositories.
We can use the features of Github repository as weights and perform weighted HITS analysis. Line 26 of Algorithm 3 should change to “node.hub = w”, where the node’s hub value is initialized with the weight w.
Improve HITS algorithm with repository’s fork information
The forking rate of Github repository is deemed as one of the most important features to indicate the popularity of a specific fork count.
Experimental setup for the Github social influence analysis is presented in this section. Firstly, The setup of data collector and experimental environments are explained in detail. Secondly, we present the results of basic HITS analysis and weighted analysis on the complete star graph. Thirdly, we present the monthly influence analysis of Github repositories, and track the dynamics of repository ranking.
We collect the experimental data from two data sources: the original Github API, and githubarchive Web site. The meta information about Github users and repositories are retrieved with the Github API. This work is done by scraping the Github data URLs.
We get the Github events data from githubarchive. The timeline data keeps growing, and needs to be crawled every 2–3 months due to Github restrictions. As Github oriented analysis is becoming popular, there are special archive site focusing on crawling Github data continuously and archive the data for researchers to download. Ghtorret and githubarchive are the two typical Github data archive site, which have been used in recent studies on Github analysis.
The experiments are conducted on Intel i5, BSD Unix, 8G RAM. The star graph is created using the SNAP library. Currently, we construct the star graph with 4 years star events data from Jan 1st, 2012 until Dec 31, 2015. SNAP is a C++ library that facilitates social analysis on large graphs. We use the Python wrapper for SNAP, and connects with python Github database.
Analysis of Github repositories
Top 10 Github repositories with HITS algorithm
Top 10 Github repositories with fork-weighted HITS algorithm
Top 10 Github repositories with size-weighted HITS algorithm
Monthly analysis of Github repository ranking
Popular Github repositories keep attracting attentions of capable software developers. The influence of a repository may change overtime. Based on this observation, we analyze the Github events data in a monthly fashion.
angular/angular.js is a very popular HTML client-side enhancement library. It keeps a relatively stable high rank during the observed months.
hmbostock/d3: It achieves relatively high ranks in all the 48 months.
FortAwesome/Font-Awesome: It is a popular font and css toolkit.
vhf/free-programming-books: It is an authoritative page that keeps freely available programming books. It is popular information source, but not a software project. It gets its first top1 rank in month 21, and after then stays popular as the monthly ranking indicates.
facebook/react-native: It is a framework for building native apps with React. It is a newly popular application March 2015.
daneden/animate.css: It is a cross-browser library of CSS animations. It keeps relatively steady high rankings throughout the 4 years.
hakimel/reveal.js: It is a popular HTML representation framework. It keeps relatively steady high rankings throughout the 4 years.
Figure 5 shows the monthly ranking of the fork-weighted HITS top 10 results. Those 10 repositories presents steady ranks than those in Fig. 4. The repository octocat/Spoon-Knife is a high-forked repository, but is not actually popular. It is a project with few files, and has a small size. As we check it out, it is a project that is made typically as an example to demonstrate the fork feature of Github. This explains why it has high forks while exhibiting low monthly influence ranks. All other applications comes with monthly ranks that fit well with the fork-weighted ranks.
We also studied the monthly ranking of the top 10 size-weighted HITS algorithm results. Figure 6 shows the monthly ranking of the size-weighted HITS top 10 results. As shown in Fig. 6, Github repositories of very large size tend to be starred by fewer developers. Among the top 10 repositories returned by size-weighted HITS algorithm, only three repositories (practicalswift/osx, Tmustafaramadhan/kloxo, angular/angular.js) are continuously influential during the 4 years.
From the results depicted by Figs. 4, 5, 6, we can see that the social influence of a repository tends to change month by month. For those top influential repositories, they won’t be high influential each month. However, they often have a steady influential rank, while having high ranks in several months.
Language-specific monthly analysis of Github repository ranking
Two popular repository, angular/angular.js, stand out top in the results of standard HITS and fork-weighted HITS algorithms.
YH provides the ideas, and writes the draft. JZ, XB, SY help improve the research idea, and revise the paper draft. ZY helps revise the paper, and help design the experiments. All authors read and approved the final manuscript.
Many thanks to the detailed comments given by anonymous reviewers. This work is partially supported by the National Natural Science Foundation of China under Grant(NSFC) 61300017.
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- Avelino G, Valente MT, Hora A (2015) What is the truck factor of popular Github applications? A first assessment. PeerJ Prepr 3:e1233View ArticleGoogle Scholar
- Casalnuovo C, Vasilescu B, Devanbu PT, Filkov V (2015) Developer onboarding in Github: the role of prior social links and language experience. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering, ESEC/FSE 2015, Bergamo, Italy, August 30–September 4, 2015Google Scholar
- Cosentino V, Izquierdo JLC, Cabot J (2014) Three metrics to explore the openness of Github projects. CoRR. arXiv:1409.4253
- Hauff C, Gousios G (2015) Matching Github developer profiles to job advertisements. In: 12th IEEE/ACM working conference on mining software repositories, MSR 2015, Florence, Italy, May 16–17, 2015, pp 362–366Google Scholar
- Jurado F, Marín PR (2015) Sentiment analysis in monitoring software development processes: an exploratory case study on Github’s project issues. J Syst Softw 104:82–89View ArticleGoogle Scholar
- Kaplan DT (2008) Google’s pagerank and beyond: the science of search engine rankings by Amy N. Langville; Carl D. Meyer. Am Math Mon 115:765–768Google Scholar
- Kleinberg JM (1999) Authoritative sources in a hyperlinked environment. J ACM 46:604–632View ArticleGoogle Scholar
- Lima A, Rossi L, Musolesi M (2014) Coding together at scale: Github as a collaborative social network. CoRR. arXiv:1407.2535
- Loyola P, Ko IY (2014) Population dynamics in open source communities: an ecological approach applied to Github. In: 23rd international world wide web conference, WWW ’14, Seoul, Republic of Korea, April 7–11, 2014, companion volume, pp 993–998Google Scholar
- McDonald N, Blincoe K, Petakovic E, Goggins SP (2014) Modeling distributed collaboration on Github. Adv Complex Syst 17:7–8View ArticleGoogle Scholar
- Ray B, Posnett D, Filkov V, Devanbu PT (2014) A large scale study of programming languages and code quality in Github. In: Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering, (FSE-22), Hong Kong, China, November 16–22, 2014, pp 155–165Google Scholar
- Tsay J, Dabbish L, Herbsleb JD (2014) Influence of social and technical factors for evaluating contribution in Github. In: 36th international conference on software engineering, ICSE ’14, Hyderabad, India, May 31–June 07, 2014, pp 356–366Google Scholar
- Vasilescu B, Posnett D, Ray B, van den Brand MGJ, Serebrenik A, Devanbu PT, Filkov V (2015a) Gender and tenure diversity in Github teams. In: Proceedings of the 33rd annual ACM conference on human factors in computing systems, CHI 2015, Seoul, Republic of Korea, April 18–23, 2015, pp 3789–3798Google Scholar
- Vasilescu B, Serebrenik A, Filkov V (2015b) A data set for social diversity studies of Github teams. In: 12th IEEE/ACM working conference on mining software repositories, MSR 2015, Florence, Italy, May 16–17, 2015, pp 514–517Google Scholar
- Xavier J, Macedo A, Maia MA (2014) Understanding the popularity of reporters and assignees in the Github. In: The 26th international conference on software engineering and knowledge engineering, Hyatt Regency, Vancouver, BC, Canada, July 1–3, 2013, pp 484–489Google Scholar
- Yu Y, Yin G, Wang HM, Wang T (2014) Exploring the patterns of social behavior in Github. In: Proceedings of the 1st international workshop on crowd-based software development methods and technologies, CrowdSoft 2014, Hong Kong, China, November 17, 2014, pp 31–36Google Scholar
- Yu Y, Wang HM, Filkov V, Devanbu PT, Vasilescu B (2015) Wait for it: determinants of pull request evaluation latency on Github. In: 12th IEEE/ACM working conference on mining software repositories, MSR 2015, Florence, Italy, May 16–17, 2015, pp 367–371Google Scholar