Benchmarking Risk & Quality KPIs in Popular Open Source Projects

by Mark Greene

Analyzing six popular open-source repositories, Shepherdly examined how various established and novel quality KPIs are reflected in some of the most extensively used software tools.

The following repositories were selected:

square/okhttp
facebook/react
scikit-learn/scikit-learn
redis/redis
shopify/polaris
stripe/stripe-ios
microsoft/vscode

How does bug prediction work?

In short, Shepherdly’s approach extends existing research but incorporates a core principle of using bug-fix data as its primary labeling mechanism. From there, the technique involves tracing back to the PR that introduced the change. Features are then collected and a predictive model is constructed.

Past research has established common features that have proven useful such as Lines Changed, Author Familiarity, Bug Proneness, and File Size to name a few. While these examples may feel intuitive, they are not consistent across repositories or time. Additionally, their thresholds and importances vary across repositories and time. For this reason, static heuristics are not suitable to solve the problem of bug prediction, instead, a predictive model that is periodically retrained or even capable of online learning is required.

What makes a PR risky?

If you were to ask engineers this question, you’d likely hear at least one of the aforementioned features that are commonly cited in research.
While there are common directional trends across repositories with varying degrees of importance, some deviate from intuition entirely.

Below we select the most important feature groups in each repositories model and compare the differences in importance. Feature importance provides a measure of how much each feature contributes to the model’s predictions. It’s a way to understand which features are most influential in the model’s decisions.

Feature Group Definition:
Size of Change: How many modifications (i.e. Lines, Files, etc) are in the pull request.
Bug Proneness: How many bug reports are linked to the files within the pull request.
Size/Age of File: How many lines a file contains, the age of the file, and how long it has been since its last modification.
Review Activity/Reviewer Familiarity: How many comments a PR received as well as the familiarity those reviewers have with the repository or organization.
Author Familiarity: A proxy for how familiar the author is with the source files within the repository or organization. For example, commit frequency.

How many latent bugs are expected?

By having a predictive model, it is possible to also measure the expected value of a bug being introduced. In the aggregate, this is incredibly powerful as it introduces a novel quality KPI; latent bugs. This is accomplished by measuring the delta between bugs predicted and bug fixes over a 60-day period.

Now, one important thing to note is that these repositories do not have equivalent pull request activity. To control for that, we’ll create a ratio of latent bugs per 100 pull requests.

How often are bug fixes happening?

Shepherdly incorporates human labels for bug fixes but is able to extend this even further by using NLP classification, giving a more complete picture of this often underreported development activity. This metric is expressed as a percentage of PRs directly linked to fixing a bug. While this percentage is team specific and can change over time, this gives you an up-to-date view of how much maintenance activity is happening.

How long does it take to identify and resolve bugs?

Today, it’s common to start the timer when the bug was first discovered/reported. But that’s not the complete picture. Shepherdly digs deeper into the file revision history to truly understand the age of a bug by measuring when it was introduced into the code.

How does risk relate to cycle time?

‍
Note, none of the contributors in the projects analyzed had the risk score as input so no behavior change could be observed.

Outside of that stipulation, many but not all of the repositories analyzed skew towards spending more time in review for higher risk change. This maximizes the chance of finding any obvious regressions and adding appropriate remediation.

Do riskier PRs get more attention?

‍
Note, none of the contributors in the projects analyzed had the risk score as input so no behavior change could be observed.

While PRs can certainly be approved without commentary, active and rigorous review would be considered a better practice in this context. Furthermore, given typical time constraints on engineers, focusing that energy on the highest risk changes is an optimization teams are currently operating without. Shepherdly can give developers a cheat sheet of prioritized PRs to review.

Deep Dives

Stay tuned, we’ll be doing deep dives into some of these repositories to examine development behaviors against risk scores.

‍

Text Link

Category	Examples	Collected
A. Identifiers	Contact details, such as real name, alias, postal address, telephone or mobile contact number, unique personal identifier, online identifier, Internet Protocol address, email address, and account name	YES
B. Personal information categories listed in the California Customer Records statute	Name, contact information, education, employment, employment history, and financial information	NO
C. Protected classification characteristics under California or federal law	Gender and date of birth	NO
D. Commercial information	Transaction information, purchase history, financial details, and payment information	NO
E. Biometric information	Fingerprints and voiceprints	NO
F. Internet or other similar network activity	Browsing history, search history, online behavior, interest data, and interactions with our and other websites, applications, systems, and advertisements	NO
G. Geolocation data	Device location
H. Audio, electronic, visual, thermal, olfactory, or similar information	Images and audio, video or call recordings created in connection with our business activities	NO
I. Professional or employment-related information	Business contact details in order to provide you our Services at a business level or job title, work history, and professional qualifications if you apply for a job with us	NO
J. Education Information	Student records and directory information	NO
K. Inferences drawn from other personal information	Inferences drawn from any of the collected personal information listed above to create a profile or summary about, for example, an individual’s preferences and characteristics	NO
L. Sensitive Personal Information		NO

Benchmarking Risk & Quality KPIs in Popular Open Source Projects

How does bug prediction work?

What makes a PR risky?

How many latent bugs are expected?

How often are bug fixes happening?

How long does it take to identify and resolve bugs?

How does risk relate to cycle time?

Do riskier PRs get more attention?

Deep Dives

Table Of Contents

Suggested Articles

Inside the prediction of a bug that led to a CVE in Redis

Go deeper on Cycle Time with Risk & Resilience

Benchmarking Risk & Quality KPIs in Popular Open Source Projects

Ship safer with Resilience Coverage