Clearly during COVID-19, Slack became more mainstream. I was wondering how they cope with that scale and found this article. They shared how they deploy. These are the interesting points from the article.
Deploys require a careful balance of speed and reliability.
As understand from their article, although they are available globally, they still mainly operate in US timezone.
Merged code is only deployed during North America business hours to make sure we are fully staffed for any unexpected problems.
They deploy every 2 hours!
Every day, we do about 12 scheduled deploys
They look like don’t have the enterprise toil in their release cycle. Engineers building software looks like also responsible making it live. This is good engineering and one of the core principles of DevOps.
an engineer is designated as the deploy commander in charge of rolling out the new build to production
I imagine, when there is a problem in that release, they are trying to fix it first rolling back and then sending the fix from the release branch. This is still a good fall-forward strategy
builds can be rolled back if there is a spike in errors and easily hotfixed if we detect a problem after release.
We investigate the issue, identify the PR that is causing problems, revert it, cherry-pick in that revert, and make a new build. However — sometimes we don’t catch a problem before it reaches production. In this scenario, it’s critical to restore service, so we immediately roll back to a previous working build before starting our investigation.
They look like had a rollout issue due to number of instances and developed a pull-based system instead of push-based. I think this makes the release process more atomic and self-serve. It can also help reducing blast radius if something goes wrong.
Instead of pushing the new build to our servers using a sync script, each server pulls the build concurrently when signalled by a Consul key change. This allows us to maintain a high velocity of deploys even as we continue to scale
During a deploy, the new code would be copied to the unused cold directory. Then, once the server was drained of active processes, we would switch directories instantaneously.