Centralised and self-service monitoring helps identify problems early and diagnose errors in the heat of an emergency.
- Design a central logging system if one doesn't exist already. Identify all systems that should be logging to it and provide them with an API and documentation for how to do so.
- Recommend how to provide web access to the central logging system and train all interested employees on how to use it.
- Propose an alerting system that both monitors the central logging system for deviations and exposes an emergency pass-through API for alerting the business to urgent errors.
- Formalise an on-call policy and schedule for handling emergencies from the alerting system.
Tested and automated backups protect the business from catastrophic disasters.
- Trace all data sources to an automated backup. If no automated backups exist for a data source, document how to put them in place.
- Gather and define redundancy and latency requirements, and review backups to ensure that they meet them. If not, provide a plan for doing so.
- Test all existing backups to ensure that they can be used for recovery in case of disaster. Point out any failures and how to repair them.
- Create a plan for continuous backup testing to ensure that backups are always ready in case of disaster.
Multi-layered security measures protect against both external and internal threats.
- Test network perimeters for vulnerabilities (aka penetration testing) and create a plan for closing them (through firewalls, security patches, etc.).
- Audit communication channels to ensure that data is not being transferred "in the clear" over the internet. Provide a plan for encrypting any clear channels discovered.
- Identify employees with direct access to sensitive data (customer emails, etc.) and propose indirect alternatives that still allow them to do their job.
Well-tuned systems scale larger, provide better customer experience, and save the business money.
- Review any existing profiling systems in place. Provide recommendations and plans for setting up continuous and ad-hoc profiling to proactively discover slow code.
- Identify likely failure points and propose high availability solutions, including cost and a plan for setup.
- Review server and network specifications for any wasteful over-provisioning and provide estimated cost savings for resizing to more reasonable specifications.
If your technical staff is not equipped to implement all recommendations, we have direct experience in all areas and can help:
- Train existing developers and/or administrators in new systems or best practices.
- Identify existing competencies among employees who are not officially in a DevOps or System Administration role.
- Implement urgent and critical tasks ourselves until your DevOps is running self-sufficiently at full-speed.
Due to the sensitive nature of the work, particularly with security projects, we cannot discuss our current and past projects. However, our CTO's previous experience building DevOps at Zalora, the largest online fashion company in South East Asia, covers many of the topics discussed above, and so we include it for illustration.
Zalora was facing challenges scaling its operations after rapidly growing to reach millions of customers. Christopher Forno, prior to joining the Singapore Data Company, was tasked with starting and leading DevOps at Zalora. Zalora’s transformation from traditional systems administration to DevOps brought it stability, improved new feature testing, and positioned it for further growth.