Continuous integration playbook
- Maintainers: DevInfra Team.
- Audience: any software engineer, no prior infrastructure knowlegde required.
- TL;DR This document sums up what to do in various scenarios that can block the CI.
Sourcegraph’s continuous integration (CI) is what enables us to feel confident when delivering our changes to our users, and is one of the key components enabling Sourcegraph to deliver quality software. While the DevInfra team is in charge of managing the CI as a tool, it is essential for every engineer to be able to unblock themselves if there is a problem in order be autonomous.
This page lists common failure scenarios and provides a step by step guide to get the CI back in an operational state.
Prerequisites
In order to handle problems with the CI, the following elements are necessary:
- Have access to the
sourcegraph-ci
project on Google Cloud Platform. - Ask #it-tech-ops for access if you do not have access.
- Have the
gcloud
CLI installed. - Have the
kubectl
CLI installed. - Gain access to the CI cluster by authenticating against it with
gcloud
andkubectl
. - Request access to the DevX Day2Day entitle bundle by typing
/access_request
in Slack.
Scenarios
buildchecker
has locked the main
branch
- Severity: major
- Impact:
- No pull requests may be merged except by authors of
- Pull request builds may be failing as well
- Possible causes:
buildchecker
will lock/restrict push access to themain
branch if a series of failed builds is detected - this can indicate that a regression has been merged intomain
or that critical build infrastructure is failing.
Actions
buildchecker
will still allow the authors of the last few failed builds, as well as the @dev-infra team, to push to the main
branch so as to make any changes necessary to restore the pipeline to a healthy state.
- Follow the “Build has failed on the
main
branch” guide. - If the issue has been resolved, wait for
buildchecker
to unlock the branch or manually trigger a run (click “Run workflow”).
Build has failed on the main
branch
- Severity: minor
- Impact: that commit won’t be deployed on
k8s.sgdev.org
andsourcegraph.com
until an ulterior build passes. - Possible causes:
- The
main
branch runs additional checks compared to Pull Requests builds. So it’s possible that one of those checks failed.- 💡 The checks are dynamically generated by our pipeline generation tool. The
main
branch has notably much more exhaustive checks than other branches.
- 💡 The checks are dynamically generated by our pipeline generation tool. The
- The
main
branch have changes that weren’t in the Pull Request branch and those changes are causing a failure. - The
main
branch is failing due to a previous build.
- The
Actions
- Check your build on Buildkite.
- Find its link directly in the #buildkite-main channel.
- 💡 Or run
sg ci status
in your shell, with themain
branch checked out.
- Search for the failing steps, and browse the logs (💡 run
sg ci logs
in your shell, with themain
branch checked out) .- Look for a failure explanation: it can be a test that failed or a command that return a non zero exit code.
- Check the previous builds on the
main
branch on Buildkite- Are they failing with the same exact error?
- Yes: see the Builds are failing in the
main
branch with the same error - No: see next point.
- Yes: see the Builds are failing in the
- Are they failing with the same exact error?
- Is that a real failure or a flake?
- Restart that step. Maybe it will fail again, but if it doesn’t it’ll save you time.
- 💡 You can go to 3. while it runs.
- See Is that a failure or a flake scenario
- Did restarting it fixed the problem?
- Yes: that’s a flake. See the Spotted a flake scenario
- No: see next point.
- Does the failure points to problem with the code that was shipped on that commit?
- Yes, and it’s a very quick fix that can get merged promptly:
- Write a short message on #buildkite-main and tell others that you’re fixing it.
- Submit the fix with another PR and get it merged as soon as possible.
- Yes, but it’s not easily and/or quickly fixed
- Revert the incriminating Pull Request.
- Open a GitHub issue mentioning the build and the context to explain to the team owning that test what happened.
- Checkout the PR branch.
- Rebase it so it includes the changes that broke it when merged in the
main
branch. - Create a build using
sg ci build main-dry-run
in order to get the CI to run the same exact checks it does on themain
branch.
- No, but it seems to fail in step or code from another team.
- Reach out a member of the team responsible for that test.
- go for a. or b. from the previous points.
- Yes, and it’s a very quick fix that can get merged promptly:
- No, and there is suspicion of a flake.
- Yes: that’s a flake. See the Spotted a flake scenario
- Restart that step. Maybe it will fail again, but if it doesn’t it’ll save you time.
Builds are all failing on the main
branch with the same error
- Severity: major
- Impact: no commits are being deployed on DogFood and
sourcegraph.com
until the problem is resolved. Cutting a release is impossible. - Possible causes:
- A previous Pull Request introduced a change that causes a test to fail.
- A previous Pull Request introduced a change that modified state in an unexpected way and broke the CI.
- An external dependency is not available anymore and is causing builds to fail.
- Some rate limiting API is throttling us and causing builds to fail.
Actions
- Identify the error in common with the recent builds on Buildkite.
- 💡 See How to use loki here
- Find the build where the problem appeared for the first time.
- 💡 Often it’s the first build that became red, but check that the error is the same to be sure.
- Is this an external failure or an internal one?
- 💡 External failures are about downloading a dependency like a package in a script or a in a Dockerfile. Often they’ll manifest in the form of an HTTP error.
- 💡 If unsure, ask for help on #dev-chat.
- Yes, it’s an external failure:
- See the SSH into an agent scenario
- Try to reproduce the faulty HTTP request so you can observe what’s the problem. Is it the same failure?
- Yes: Do you know how to fix it? If no escalate by creating an incident (
/incident
on Slack). - No: escalate by creating an incident (
/incident
on Slack).
- Yes: Do you know how to fix it? If no escalate by creating an incident (
- No, it’s an internal failure:
- Is it involving a faulty build environment in the agents? (a given tool is not found where it should have been present, or have incorrect version)
- See the SSH into an agent scenario
- Try to find an agent that recently successfully ran the faulty step (look for a green build on the
main
branch)- Can you see a difference? If yes take note.
- Do you know how to fix it?
- Yes: apply the fix.
- No: escalate by creating an incident (
/incident
on Slack).
- Is it involving a faulty build environment in the agents? (a given tool is not found where it should have been present, or have incorrect version)
Build are failing on the main
branch with different errors
- Severity: major
- Impact: no commits are being deployed on DogFood and
sourcegraph.com
until the problem is resolved. Cutting a release is impossible. - Possible causes:
- A previous Pull Request introduced a change that causes a test to fail.
- An external dependency is not available anymore and is causing builds to fail under certain conditions.
- Some rate limiting API is throttling us and causing builds to fail.
Actions
- Escalate by creating an incident (
/incident
on Slack). - Get some help by pinging
@dev-infra-support
on Slack in the #buildkite-main or #discuss-dev-infra channels.
Builds are all failing in my branch, on Bazel jobs, with many timeouts or cache/disk related errors or container errors.
- Severity: major
- Impact: no commits are being deployed on DogFood and
sourcegraph.com
until the problem is resolved. Cutting a release is impossible. - Possible causes:
- A previous Pull Request introduced a change that causes a test to fail. If that’s the case you should see the problem on the
main
build corresponding to the commit you branched out from. - A previous Pull Request introduced a change that modified state in an unexpected way and broke the CI. If that’s the case you should see the problem on the
main
build corresponding to the commit you branched out from. - A previous build did not properly teardown containers used in e2e test suites.
- Agents are in a corrupted state due to a previous build.
- Agents ran out of disk space.
- A previous Pull Request introduced a change that causes a test to fail. If that’s the case you should see the problem on the
Actions
- Escalate by creating an incident (
/incident
on Slack). - Get some help by pinging
@dev-infra-support
on Slack in the #buildkite-main or #discuss-dev-infra channels. - Request access to the DevX Day2Day entitle bundle by typing
/access_request
in Slack. - Restart the agents by scaling the corresponding deployment to 0 then to 2 again.
kubectl scale --replicas=0 -n buildkite-bazel deployments/buildkite-agent-bazel
- Observe the pods count going down.
kubectl scale --replicas=2 -n buildkite-bazel deployments/buildkite-agent-bazel
- The agent autoscaler will adjust the final replicas count on its own.
- If you saw cache releated errors in the job logs, restart the remote-cache by scaling the corresponding deployment to 0 then to 1 again.
kubectl scale --replicas=0 -n buildkite-bazel deployments/ci-bazel-remote-cache
- Observe the pods count going down.
kubectl scale --replicas=1 -n buildkite-bazel deployments/ci-bazel-remote-cache
- Do not scale it above 1 instance, it uses a persistent disk that can only be accessed by a single instance.
Spotted a flake
- Severity: minor
- Impact: Some builds will fail randomly, creating noise and slowing down the engineering team
- Possible causes:
- Tests relying on timing.
- Race conditions.
- End to end tests are delicate by nature and can fail randomly due to the complexity of all involved components.
Actions
- What kind of step is failing?
- Is this an End-to-end tests?
- 💡 E2E tests are fragile by nature, there is no way around it.
- Take note.
- Is this a Docker image build step?
- 💡 This should really not be happening.
- Is the error about the Docker daemon?
- Yes, this is a CI infrastructure flake. Ping
@dev-infra-support
on Slack in the #buildkite-main or #discuss-dev-infra channels. - No: reach out to the team owning that Docker image immediately.
- Yes, this is a CI infrastructure flake. Ping
- Anything else
- Take note of the failing step and go to next point.
- Is that flake related to the CI infrastructure?
- The CI infrastructure often involves:
- Docker daemon not being reachable.
- Missing tools that we use to run the steps, such as
go
,node
,comby
, … - Errors from
asdf
, which is used to manage the above tools.
- Yes: ping
@dev-infra-support
on Slack in the #buildkite-main or #discuss-dev-infra channels.- If nodoby is online to help:
- Reach out for help in #dev-chat
- If nodoby is online to help:
- Is that flake related to the code:
- See the process describe in the flaky tests page
Is this a failure or a flake?
- Gravity: minor
- Impact: Some builds will fail randomly, creating noise and slowing down the engineering team
- Possible causes:
- Tests relying on timing.
- Race conditions.
- End to end tests are delicate by nature and can fail randomly due to the complexity of all involved components.
Actions
- Immediately restart the faulty step.
- 💡 It will save you time while you’re looking at the logs.
- Is the step passing now?
- Yes: See Spotted a flake scenario
- No: Give it another try, and see next point.
- Check on Grafana if there are any occurrences of the failures that were previously observed:
- Go the the “Explore” section
- Make sure to select
grafanacloud-sourcegraph-logs
in the dropdown at the top of page. - Scope the time window to
7 Days
to make sure to find previous occurrences if there are any - Enter a query such as
{app="buildkite"} |= "your error message"
where “your error message” is a string that identiy approximately the failure cause observed in the failing step. - Is there a build that failed exactly like this?
- Yes:
- 💡 Double check that you’re looking at that the same step by inspecting the labels of message (click on the line to make them visible)
- Yes, that’s a flake. See the Spotted a flake scenario
- No: it’s not a flake, reach out the team owning those tests.
- Yes:
You can also refer to the Loom walkthrough “how to find out if a CI failure is a recurring flake”.
Builds are not being created on Buildkite
- Severity: major
- Impact: It’s possible to merge a PR without going through CI. No builds are produced and it’s impossible to deploy the new commits.
- Possible causes:
- GitHub is experiencing some outage that is affecting webhooks.
- Buildkite is experiencing some outage.
- Webhooks that trigger the builds have been deleted.
Actions
- Inspect webhooks status on the
sourcegraph/sourcegraph
repository settings - If you’re not authorized to see this page, ping
@dev-infra-support
or escalate to@github-owners
. - Check the status of the webhook, if it’s not green, something is wrong. However, if it is green it is no guarantee that the webhook is operating as usual! If GitHub Webhooks is experiencing degraded performance, it might not be emitting events to the endpoint at all any more, and the green status was the last submission before the outage started. See the next step to verify the status of Webhooks.
- Check GitHub Status
- Check Buildkite Status
- A possible way to mitigate a GitHub outage is to recreate the webhook.
- Delete the old buildkite webhook.
- Create a new one by following these instructions.
SSH into an agent
- Gravity: none
- Impact: none (unless a destructive action is performed)
- Possible cause:
- Need to investigate a problem and suspect the agent is at fault
Actions
- Identify if you want to look at a Bazel agent or a stateless one. Bazel agents are under the
buildkite-bazel
namespace, and stateless agents are underbuildkite
namespace. - Request access to the DevX Day2Day entitle bundle by typing
/access_request
in Slack. - Find the pod you want to SSH into with one of the following methods:
- Use
kubectl get pods -n $NAMESPACE -w
to observe the currently running agents and get the pod name (k9s
works here too). - From a Buildkite build page, click the “Timeline” tab of a job and see the entry for “Accepted Job”. The “Host name” in the entry is also the name of the pod that the job was assigned to.
- Use
- Use
kubectl exec -n $NAMESPACE -it buildkite-agent-xxxxxxxxxx-yyyyy -- bash
to open a shell on the Buildkite agent.
Replacing Agents
- Gravity: minor
- Impact: May fail ongoing builds, but that’s fine.
- Possible causes:
- Newer version of the agents needs to be deployed.
Actions
- Refer to the instructions here to remove currently deployed agents. The buildkite-job-dispatcher will deploy jobs with any updated config.
Agent availability issues
- Gravity: major
- Impact: Builds stuck in “waiting for agent”
- Possible cause:
- Agent dispatch malfunction or GCP infrastructure outage
Actions
- Check dispatcher dashboard for health metrics
- Check dispatched agents for availability issues
- Check dispatcher logs for details
For more details, see the source: buildkite-job-dispatcher