Managed Instance technical documentation

Operations

Please review the Managed Instances operations guide for instructions.

Release process

SOC2/CI-100

Sourcegraph upgrades every test and customer instances according to SLA.

The release process is performed in steps:

New version is released via release guild
GitHub issue in Sourcegraph Customer repository with the mi2 env create-tracking-issue -e prod $TARGET_VERSION command
GitHub issue is labeled with team/cloud and Cloud Team is automatically notified to perform Managed Instances upgrade. Label is part of the template.
Cloud team performs upgrade of all instances in given order:

Stage	Working days since release	Action	Condition not met?
1	0-2	Upgrade internal instances by Cloud Team (incl. demo and rctest)
2	3-4	Time for verification by Sourcegraph teams	New patch created -> start from 1st stage
3	5-6	Upgrade: 30% trials 10% customers	New patch created -> upgrade internal in 1 working day and start from 2nd stage
4	7-8	Upgrade: 100% trials 40% customers	New patch created -> upgrade internal in 1 working day and start from 3rd stage
5	9-10	Upgrade: 100% customers	New patch created -> upgrade internal in 1 working day and start from 3rd stage

After upgrade of every single instances uptime checks are verified. This includes automated monitoring

Sample upgrade:

tracking issue - 5.1.4.
GitHub Pull Requests for 5.1.4 upgrade

Release process for patch releases

With bi-weekly patch release schedule, Cloud Team is using simplified release process to ensure Cloud customers can obtain patch as soon as possible.

Stage	Working days since release	Action
1	0-2	Patch internal instances by Cloud Team (incl. demo, clouddev and rctest)
2	3-5	Patch trials and customer instances.

Known limitations of managed instances

Sourcegraph managed instances are now running on Kubernetes, specifically GKE, today.

Current Cloud architecture has been tested to support a workload of >100000 repositories (440GB Git storage) and 10000 simulated users on a n2-standard-32 VM.

Security

Isolation: Each managed instance is created in an isolated GCP project with heavy gcloud access ACLs and network ACLs for security reasons.
Admin access: Both the customer and Sourcegraph personnel will have access to an application-level admin account. Learn more about how we ensure secure access to your instance.
VM/SSH access: Only Sourcegraph personnel will have access to the actual GCP environment, this is done securely through GCP IAP TCP proxy access only. Sourcegraph personnel can make changes or provide data from the environment upon request by the customer.
Inbound network access: The customer may choose between having the deployment be accessible via the public internet and protected by their SSO provider, or for additional security have the deployment restricted to an allowlist of IP addresses only (such as their corporate VPN, etc.). Filtering of the IP allowlist is performed by our WAF provider, Cloudflare. Notes, in addition to the customer provided IP allowlist, traffic from well-known public code hosts (e.g. GitHub.com) is also permitted to access selected Sourcegraph endpoints to ensure functionality of certain features.
Outbound network access: The Sourcegraph deployment will have unfettered egress TCP/ICMP access, and customers will need to allow the Sourcegraph deployment to contact their code host. This can be done by having their code-host be publicly accessible, or by allowing the static IP of the Sourcegraph deployment to access their code host.
Web Application Firewall (WAF) protections: All managed instances are proxied through Cloudflare and leverage security features such as rate limiting and the Cloudflare WAF.

Access can be requested in #it-tech-ops WITH manager approval.

Monitoring and alerting

SOC2/CI-86 SOC2/CI-25

Each managed instance is created in an isolated GCP project, with exclusive resources.

Metrics are visible to Sourcegraph employees with access in a centralized GCP metrics scoping project project. All metrics can be seen in scoped projects dashboard.

Every customer managed instance has alerts configured:

cloud provider-managed uptime check is configured in dedicated GCP managed instance project
- v2.0
instance performance metrics alerts are configured in the scoped project for all managed instances
additional v2.0 infrastructure pefrormance metrics configured per instance
application performance metrics - based on application log events

Alerting flow:

When alert is triggered, it is sent to Opsgenie channel:
From Opsgenie, alert is sent to on-call Cloud and Slack channels (#opsgenie, #alerts-managed-instances.
On-call Cloud engineer has to decide, what is the alert type and if incident should be opened and follow the procedure to perform the incident. On-call Cloud engineer should use the generated managed instances operations to check, assess and repair broken managed instance.
When alert is closed via incident resolution, post-mortem actions has to be assigned and performed.

Opsgenie alerts Sample managed instance incident - customer XXX is down.

List Trials

Please visit go/cloud-ops