Sourcegraph.com observability

We provide some tooling to make Sourcegraph.com easier to monitor and observe. This includes observability for relevant critical infrastructure such as our CI/CD pipelines.

For general observability development, please refer to the observability development documentation instead, which includes links to useful how-to guides.

Monitoring

For metrics and alerting, see the Sourcegraph monitoring guide.

Grafana Cloud

We have a Grafana Cloud instance at sourcegraph.grafana.net. Accounts are automatically provisioned by logging in with GSuite oAuth. Quick links:

Logs

Logs in Grafana Cloud is provided by Grafana Loki, a logs aggregation system that uses a PromQL-like query language called LogQL.

Loki allows you to easily query for logs, filter for fields within structured logs, and even generate metrics from logs. The official LogQL documentation provides a complete reference, or you can refer to this cheatsheet for a brief overview.

Cloud logs

The Loki instance in Grafana Cloud is currently configured to ingest logs from Sourcegraph.com pushed from grafana-agent’s Loki configuration. To query these, you can start with a LogQL query like:

{deploy="sourcegraph",app="sourcegraph-frontend"}
  | logfmt
  | lvl="warn"

CI logs

The sourcegraph/sourcegraph CI pipeline also uploads pipeline logs using sg to Loki. These uploads only happen for failed builds on main - we do not publish data for successful builds or branch builds (for those, you can refer to our build traces). To query logs, you can start with a LogQL query like:

{app="buildkite",branch="main",state="failed"}
  |~ "FAILED:"

Also refer to the CI dashboard, which is a set of graphs based on the contents of uploaded logs, for more examples—just select a panel and click “Explore” to see the underlying query.

A demo is also available that demonstrates one of the most common use cases of this functionality, assessing flakes: how to find out if a build is a recurring flake.

Additional resources:

Cloudflare

Cloudflare Analytics is used to extract useful data about the performance of our WAF, as well as the overall traffic distribution to our instances. Note that the retention of analytics data is relatively short due to the limits on our plan.

This section gives a quick overview of how to access Cloudflare analytics, and how to interface with their GraphQL API. Note that in most cases, you’ll be able to get much richer metrics by accessing our existing monitoring dashboards on our own internal monitoring.

GraphQL API

Cloudflare Analytics provides a somewhat limited API for retrieving monitoring data. Note that you can only retrieve relatively recent data, and have a limited number of operations.

Tools

Cloudflare recommends using GraphiQL, a lightweight electron app, to interface with their API due to its relative ease of use. Configuration instructions are here. The auth key and email can be found here. The tool also helps enumerate the available parameters, and is quite useful for exploring the API.

Available data

The Cloudflare API mainly contains network layer information about communications to and from the service. The entire list of datasets is enumerated here. For an example, the number of requests and page views per minute, along with the number of unique accessors can be found with the following query. Note that the results are ordered by datetimeMinute_ASC, since the default response ordering does not rely on time.

viewer {
    zones(filter: {zoneTag: [ZONE_TAG]}) {
      httpRequests1mGroups(limit: 10000,  filter: {datetime_gt: "", datetime_lt: ""}, orderBy: [datetimeMinute_ASC]) {
        sum {
          requests
          pageViews
        }
        uniq {
          uniques
        }
        dimensions {
          datetimeMinute
        }
      }
    }
  }
}