Repeated DNS resolution failures

Recently (about 1-2 weeks ago) our builds on bitrise.io started to fail randomly due to failures in name resolution.
A few examples:
https://www.bitrise.io/build/8f8aeff445e4cf83
https://www.bitrise.io/build/6b5ea6aeb5beae07
https://www.bitrise.io/build/4950f573ba5a574c
https://www.bitrise.io/build/4b53b075fc8d97be
Issues have never occurred in rebuilt builds.

There are no changes in DNS configuration of these domains (which we are aware of) and there are no reachability issues outside bitrise.io, however such usages were not as frequent as from bitrise builds.

1 Like

Hey @koral

Do you have any network issue anywhere in the other steps?

1 Like

No, in failed builds these are only issues.

We will check this issue if it’s source is on our side. These VMs are hosted on Google Compute Engine with the standard DNS configuration, so there is a very small probability that this issue comes from us. Thank you for the report :slight_smile:

OK, I’ve also set up an external DNS reachability monitor at port-monitor.com. It performs checking every minute, so if issue occurs again we’ll know whether it was only on bitrise.io or globally.

1 Like

Thank you very much, please let us know about the results :slight_smile:

1 Like

Issue appeared again on 2017-10-27T18:28:50Z (+20/-0 s).
Here is the build log: https://www.bitrise.io/build/19b44839b2e52522
That domain has not been monitored however, there is 2nd one monitored and handled by the same DNS servers, pointing to the different IP and that one has 100% uptime.

Additionally, there was another issue about 2017-10-27T18:33:00Z (+60/-0 min).
In this build: https://www.bitrise.io/build/924f50f7c14e28ab
That build was aborted due to timeout (75 min) but, it usually takes 20-25 min.

All that indicates that something went wrong on bitrise.io side.

1 Like

Thanks for the feedback! I made anninternal note to check this.

1 Like

Is it always the same URL which fails? Asking because we did not see any DNS resolution issues, in fact we had no DNS resolution issue in any of our continuous “control” builds.

If it’s always the same URL which fails then it’s more likely that the issue is either on that service’s side, or somewhere in-between (e.g. on domain / DNS server level), but not on bitrise.io / on the bitrise.io build workers.

There are 2 domains: 1. serving REST API with several URLs, 2. serving Cisco VPN.

It seems that services are not even reached because domains cannot be resolved from bitrise.

What do you suggest then? Changing DNS server to something else, e.g. CloudFlare?

Hard to say. Keep in mind that these build VMs are always clean, in every build, meaning if there’s a blip/temp DNS issue that might not affect your monitoring service as it already resolved the IP, but an environment which never resolved the IP might be affected. AFAIK this might happen for example if the DNS - IP refresh frequency is too low. But again quite hard to say as we can’t reproduce this issue :confused:

1 Like

Issue reappeared today (random order):
https://www.bitrise.io/build/7e8e639e221f780c
https://www.bitrise.io/build/7e8e639e221f780c
https://www.bitrise.io/build/f4a2695b29f671a1
https://www.bitrise.io/build/f8ff3e150cb92a0d

Few minutes after latest occurrence I could not repro that:
https://www.bitrise.io/build/f8ff3e150cb92a0d

Can you add a retry in the related Script for the call? We don’t see any issue with any of our tests and no reports either. Would help a lot if you could add a retry there and let us know if that helps or not.

OK, I’ve added steps like that after affected ones:

- script:
    is_always_run: true
    run_if: .IsBuildFailed
    inputs:
    - content: |-
        #!/usr/bin/env bash
        host <host from previous step>
1 Like

Let us know how it goes! :wink: