Scaling operations

Case study: Zalora

Summary

Zalora was facing challenges scaling its operations after rapidly growing to reach millions of customers. Christopher Forno, prior to joining the Singapore Data Company, was tasked with starting and leading DevOps at Zalora. Zalora’s transformation from traditional systems administration to DevOps brought it stability, improved new feature testing, and positioned it for further growth.

Initial state

When Forno started, Zalora’s servers had already outgrown their previously allocated network space and had been divided into 3 networks. Administration tasks thus had to be committed 3 times to 3 different Puppet git repositories. Small differences between the 3 networks led to errors and slowed down the work of system administrators. Adding more servers to handle more traffic would exacerbate the problem further.

Code deployments were done by the release manager directly on production servers with the aid of engineers.

Aside from being error-prone, this led to some confusion about the real state of production systems. When investigating an error, system administrators would often have to manually search for changes made to live servers that were not reflected in commit logs.

The rapid pace of new feature development was bottlenecked by the rate which system administrators could configure, test, and deploy new infrastructure. Engineers worked around the problem by deploying the necessary systems into the QA environment, leading to configuration drift between production and QA. To maintain stability for testing new features, engineers and QA had adopted a split development/QA iteration cycle.

Solution

Forno identified that Zalora needed to:

Define an ideal network and completely describe it in a single repository. This repository would contain everything necessary to instantiate a working network capable of serving traffic for a single country (e.g. Zalora Singapore), on AWS or on bare metal either as a production network, as a QA network, or as a development VM (in which case VirtualBox also needed to be supported).
Migrate all systems over to this single network definition and put all future changes through this single repository.
Integrate code deployment with infrastructure deployment.

Doing so enabled the following:

Developers could run a working copy of the website on their own laptop and develop new features in an environment identical to production at any time (without needing to halt for QA testing).
QA had its own network, which was not shared with developers. When developers wanted QA to test changes they just had to push the changes to the appropriate location. When QA needed to test multiple changes concurrently, DevOps could just deploy another network.
Deployments could be handled via a point-and-click GUI, eliminating “fat finger” outages and creating a single source of truth for both engineers and operations to understand the entire state of the system.

DevOps was further improved as follows:

All deployments as well as errors and notices were then logged to a central location, accessible to all developers.
Transparent secure tunnels between networks were created, allowing for cross-network communication with no overhead for developers. This enabled deploying to multiple data centers for cost savings and to reduce latency to customers.
All data sources were consolidated into a single location (the Zalora data warehouse) for the benefit of the BI team.

HR and onboarding

Forno hired 7 staff (6 full-time, 1 part-time) and 1 technical writer/editor (to document everything).

They were hired to take over day-to-day systems management, and to have enough spare resources to build the new systems in parallel.

They were trained through being asked to read and improve the existing documentation, assigning tangential tasks that would introduce them to the systems in the company without overwhelming them, and encouraged to communicate among themselves for knowledge transfer. All were already competent with most of the systems Zalora was using and quickly came up to speed with the rest

As the team integrated each system, Forno arranged and attended meetings with the responsible developers until he was confident that DevOps had established a reliable and persistent communication channel. In the special case of the technical writer, he continuously encouraged people from all departments of the company to talk to her.

This gave other departments the chance to understand what DevOps could do for them and who could resolve their issues.

Notes on the technology used

The technology we used (NixOS, the Nix language, and NixOps) is unique in how it handles changes. The deployed servers are essentially read-only: their filesystem cannot be modified via SSH, even by root. This provides some practical benefits:

any deployed changes can be rolled back;
the configuration in git is strongly guaranteed to describe the state of the systems.

Most alternatives fall into two categories: traditional configuration management or containers.

Traditional configuration management was partially responsible for the problems the company was facing: the Puppet configuration did not match the actual state of the servers, meaning that system administrators were afraid to make changes for fear of breaking the system in an irreversible way. The company had also tried alternatives such as Salt with similar results.

Containers are a new approach to operations that addresses deployment by allowing developers to build containers that can then be tested and shipped into production as-is. They are popular because they are good for hosting companies (like Google's Kubernetes), allowing the host to use lightweight containers instead of heavyweight virtual machines (which itself allows finer control over tenancy) and easy to understand for developers.

However, at the time there were no container-based solutions to managing a network of machines, meaning that all management would have continued to be manual.

The other disadvantage linked with containers was their opaqueness: only the developers who had built the container have access to the source and build system necessary to identify and fix problems in production. This becomes an issue when scheduling pager duty, because at least one developer for every system needs to be available at any time on any day to deal with any problems.

Today

N.B.: this case study was approved for release in March 2017.

Zalora continues to grow, supported by the DevOps team of 3 full-timers. After the development phase finished, the team naturally shrank over time to the minimum size necessary for maintenance and support.