Build failure rate - why and how to find your frequently failing builds?

:clapper: You can find the video version of this content at: https://youtu.be/fZwsr1H0A4g

Why is it important to track frequently failing builds?

For every failed build, you’ll most likely have to 1) spend time to fix the failure, 2) then try the build again, and 3) then wait for the retried build to finish.

Even if you can switch to do something else while the build is running, there will be a context switch when you have to switch back once the build is done.

If your build failure rate is high on a specific app’s or Workflow’s builds, that means people frequently spend time debugging, fixing, and then retrying builds.

The failure rate is especially crucial for long builds — [see Monitoring and optimizing your mobile builds - what, why and how to track ], as the engineers will have to wait even more when they do a fix and then run a new build.

Tracking and reducing the frequency of failed builds can help minimize the time and effort spent on resolving build failures and increase the overall efficiency and productivity of your team.

The main goal is to reduce the wait time throughout the development process. If you have builds that fail frequently, that means that sooner rather than later, you’ll have a failing build where you’ll have to check why it failed, fix the issue, try it again, and then wait for the build to hopefully pass. As part of this, you’ll likely lose time either because of context switching or because you can’t do anything useful while you’re waiting for the results of that build.

How to find frequently failing builds and how to diagnose what’s causing failures?

When you open Bitrise Insights you can find the Build failure rate chart on the Overview page.

Here you can see your overall build failure rate trend in the whole workspace and you can also see the top 5 most frequently failing apps’ build failure rate trends.

From here you can continue your investigation using either the View details button or by clicking the Builds page under the Explore section in the left sidebar:

After you open the Builds Explore page, switch to the Failure rate tab:

On the Build Explore page, you most likely start from the workspace level (unless some filters are already applied) which you can see on the upper chart. Using the intelligent breakdown (the 2nd, lower chart) you can drill into the data to find which application, which workflow, and which step is causing the build failure rate trend that you are checking.

Let’s go through an example. Here on the breakdown chart can see that we have this application which is failing most frequently in this workspace:

Filter down to that application, and on the next level, you’ll find the per workflow breakdown.

The upper chart now shows what is filtered on, so in this case, it’s the selected app’s failure rate. On the lower, breakdown chart you can see which Workflow fails most frequently:

Let’s filter down to that Workflow:

The upper chart now reflects this filtering, and the breakdown chart switched over to per step failure rate. Using the breakdown chart you can find out which step is causing the failure rate trend that we are investigating.

Under the graphs, you can also see the build history which is filtered based on the filters that you set at the top and also on the time range that you set in the top right corner.

Hovering on the bars you can see how long specific steps took and in which build that step failed:

When you find the builds which correlate with the trend that you’re checking then you can quickly jump to the relevant build’s page and then continue your investigation there:

There’s another page that worth checking periodically, the Bottlenecks page:

On the Bottlenecks page, Insights shows you negative trends from the last 7 days. The relevant Bottleneck is the Failing workflows one, which lists workflows that consumed the most amount of time to fail. It lists the workflows based on the time impact of the failing builds.

This is usually a good place to check as the time impact calculation here reflects both how frequently the builds of a given workflow fail as well as how long those failing builds take. The time impact listed on this page is the total amount of build time of the failing builds for that workflow in the last 7 days. In the example above it means that the bullseye app’s test workflow builds failed in 73.68% of the cases, and in total, those failed builds consumed 1 hour and 38 minutes in the last 7 days.

By listing the workflows based on time impact instead of based on just the failure rate, the Bottlenecks page helps you to focus on the most impactful build failure trends. As an example, you have a workflow that had only a few builds and those all failed, while another workflow had tens or hundreds of builds and it failed in 50% of the cases. If both Workflow builds are about the same length the second Workflow will be ranked higher, as overall those failed builds caused more wait time for engineers.

Keeping an eye on and improving your build failure rate helps you to reduce wait time during the app development process and to increase the efficiency and productivity of your team.

If you’d have any questions or feedback, please let us know using the Give feedback button in the bottom left corner on any of the Bitrise Insights pages:

Happy data digging!

You can find the rest of the “What, Why and How to track” learning series at: what-why-how-track