Herzig and Zeller define ”mining software archives” as a process to ”obtain lots of initial evidence” by extracting data from software repositories. Further they define ”data sources” as product-based artefacts like source code, requirement artefacts or version archives and claim that these sources are unbiased, but noisy and incomplete.
The idea in coupled change analysis is that developers change code entities together frequently for fixing defects or introducing new features. These couplings between the entities are often not made explicit in the code or other documents. Especially developers new on the project do not know which entities need to be changed together. Coupled change analysis aims to extract the coupling out of the version control system for a project. By the commits and the timing of changes, we might be able to identify which entities frequently change together. This information could then be presented to developers about to change one of the entities to support them in their further changes.
Commit Analysis
There are many different kinds of commits in version control systems, e.g. bug fix commits, new feature commits, documentation commits, etc. To take data-driven decisions based on past commits, one needs to select subsets of commits that meet a given criterion. That can be done based on the commit message, or based on the commit content.
Documentation generation
It is possible to generate useful documentation from mining software repositories. For instance, Jadeite computes usage statistics and helps newcomers to quickly identify commonly used classes. When one focuses on certain kinds of structured documentation such as subclassing directives, more advanced techniques can synthesize full sentences.
Data & Tools
The primary mining data comes from version control systems. Early mining experiments were done on CVS repositories. Then, researchers have extensively analyzed SVN repositories. Now, Git repositories are dominant, but special care must be given to handle branches and forks. Tools:
is a Java library that allows existing tools to analyse the evolution of software systems by providing a common API for different version control systems and issue trackers.
is a Python Framework to analyse Git repositories.
is a Java tool to search for patterns in past commits.