August 7, 2023

Benchmarking Risk & Quality KPIs in Popular Open Source Projects

by Mark Greene


Analyzing six popular open-source repositories, Shepherdly examined how various established and novel quality KPIs are reflected in some of the most extensively used software tools.

The following repositories were selected:


How does bug prediction work?

In short, Shepherdly’s approach extends existing research but incorporates a core principle of using bug-fix data as its primary labeling mechanism. From there, the technique involves tracing back to the PR that introduced the change. Features are then collected and a predictive model is constructed.

Past research has established common features that have proven useful such as Lines Changed, Author Familiarity, Bug Proneness, and File Size to name a few. While these examples may feel intuitive, they are not consistent across repositories or time. Additionally, their thresholds and importances vary across repositories and time. For this reason, static heuristics are not suitable to solve the problem of bug prediction, instead, a predictive model that is periodically retrained or even capable of online learning is required.

What makes a PR risky?

If you were to ask engineers this question, you’d likely hear at least one of the aforementioned features that are commonly cited in research.
While there are common directional trends across repositories with varying degrees of importance, some deviate from intuition entirely.

Below we select the most important feature groups in each repositories model and compare the differences in importance. Feature importance provides a measure of how much each feature contributes to the model’s predictions. It’s a way to understand which features are most influential in the model’s decisions.

Feature Group Definition:
Size of Change: How many modifications (i.e. Lines, Files, etc) are in the pull request.
Bug Proneness: How many bug reports are linked to the files within the pull request.
Size/Age of File: How many lines a file contains, the age of the file, and how long it has been since its last modification.
Review Activity/Reviewer Familiarity: How many comments a PR received as well as the familiarity those reviewers have with the repository or organization.
Author Familiarity: A proxy for how familiar the author is with the source files within the repository or organization. For example, commit frequency.

How many latent bugs are expected?

By having a predictive model, it is possible to also measure the expected value of a bug being introduced. In the aggregate, this is incredibly powerful as it introduces a novel quality KPI; latent bugs. This is accomplished by measuring the delta between bugs predicted and bug fixes over a 60-day period.

Now, one important thing to note is that these repositories do not have equivalent pull request activity. To control for that, we’ll create a ratio of latent bugs per 100 pull requests.

How often are bug fixes happening?

Shepherdly incorporates human labels for bug fixes but is able to extend this even further by using NLP classification, giving a more complete picture of this often underreported development activity. This metric is expressed as a percentage of PRs directly linked to fixing a bug. While this percentage is team specific and can change over time, this gives you an up-to-date view of how much maintenance activity is happening.

How long does it take to identify and resolve bugs?

Today, it’s common to start the timer when the bug was first discovered/reported. But that’s not the complete picture. Shepherdly digs deeper into the file revision history to truly understand the age of a bug by measuring when it was introduced into the code.

How does risk relate to cycle time?

Note, none of the contributors in the projects analyzed had the risk score as input so no behavior change could be observed.

Outside of that stipulation, many but not all of the repositories analyzed skew towards spending more time in review for higher risk change. This maximizes the chance of finding any obvious regressions and adding appropriate remediation.

Do riskier PRs get more attention?

Note, none of the contributors in the projects analyzed had the risk score as input so no behavior change could be observed.

While PRs can certainly be approved without commentary, active and rigorous review would be considered a better practice in this context. Furthermore, given typical time constraints on engineers, focusing that energy on the highest risk changes is an optimization teams are currently operating without. Shepherdly can give developers a cheat sheet of prioritized PRs to review.

Deep Dives

Stay tuned, we’ll be doing deep dives into some of these repositories to examine development behaviors against risk scores.

Table Of Contents

A. IdentifiersContact details, such as real name, alias, postal address, telephone or mobile contact number, unique personal identifier, online identifier, Internet Protocol address, email address, and account nameYES
B. Personal information categories listed in the California Customer Records statuteName, contact information, education, employment, employment history, and financial informationNO
C. Protected classification characteristics under California or federal lawGender and date of birthNO
D. Commercial informationTransaction information, purchase history, financial details, and payment informationNO
E. Biometric informationFingerprints and voiceprintsNO
F. Internet or other similar network activityBrowsing history, search history, online behavior, interest data, and interactions with our and other websites, applications, systems, and advertisementsNO
G. Geolocation dataDevice location
H. Audio, electronic, visual, thermal, olfactory, or similar informationImages and audio, video or call recordings created in connection with our business activitiesNO
I. Professional or employment-related informationBusiness contact details in order to provide you our Services at a business level or job title, work history, and professional qualifications if you apply for a job with usNO
J. Education InformationStudent records and directory informationNO
K. Inferences drawn from other personal informationInferences drawn from any of the collected personal information listed above to create a profile or summary about, for example, an individual’s preferences and characteristicsNO
L. Sensitive Personal InformationNO