{"id":5482,"date":"2023-04-05T22:19:22","date_gmt":"2023-04-05T12:19:22","guid":{"rendered":"https:\/\/sysmit.com\/cf22\/?p=5482"},"modified":"2024-01-01T12:27:17","modified_gmt":"2024-01-01T02:27:17","slug":"inside-disneys-site-reliability-engineering-practice","status":"publish","type":"post","link":"https:\/\/sysmit.com\/cf22\/inside-disneys-site-reliability-engineering-practice\/","title":{"rendered":"Inside Disney’s Site Reliability Engineering practice"},"content":{"rendered":"
It is no small feat to run an ecosystem of entertainment experiences to delight a wide range of people, from young children to older “Disney adults”.<\/p>\n\n\n\n
Almost every Disney experience relies on a sophisticated technology stack working in the background.<\/p>\n\n\n\n
\n“Steve Jobs once said technology amplifies human ability. At Disney, we use technology to create digital experiences that bring magic to people all around the world.” \u2014 Jason Cox, Director of Global SRE at Disney<\/p>\n<\/blockquote>\n\n\n\n
Disney\u2019s SRE teams have ensured that the magic keeps happening, even as experiences and their underlying technology become more and more complex.<\/p>\n\n\n
History of SRE at Disney<\/strong><\/h2>\n\n\n
Jason Cox has been the Director of Global SRE at Disney since 2011. <\/p>\n\n\n\n
He has helped Disney remain a global leader in the entertainment industry. This is by keeping the company up-to-date as technology has become a critical driver in entertainment.<\/p>\n\n\n\n
His promotion to Systems Reliability Engineering tzar at the company saw its challenges.<\/p>\n\n\n\n
\ud83d\udccc For the record, SRE at Disney is denoted as Systems Reliability Engineering<\/em>.<\/p>\n\n\n\n
He had already been at the company for several years prior, working in the operations team of Disney\u2019s Internet group. The biggest hurdle he saw to effectiveness was that they and every other group operated in siloes.<\/p>\n\n\n\n
There weren\u2019t only siloes among departments like technology, product, etc. Disney had four large divisions with separate autonomous CTOs. These divisions were: Studios, Consumer Products and Interactive, Parks and Support, and Media Networks.<\/p>\n\n\n\n
While this allowed each division to operate independently, it also resulted in institutionalized Shadow IT. Every division had its own way of doing things, even for very trivial work.<\/strong><\/p>\n\n\n\n
However, rapid growth necessitated cross-company changes, so a DevOps transformation was undertaken. Jason and his team applied DevOps principles to each part of the business, not just technically but culturally, by breaking down silos and scaling DevOps.<\/p>\n\n\n\n
The goal was to improve communication and collaboration, getting technologists from various disciplines to work together and embrace new ideas.<\/p>\n\n\n\n
The following issues were on top of mind for the SRE group:<\/p>\n\n\n\n
\n
- As Disney\u2019s businesses expanded digitally, the <\/strong>workload and firefighting increased. Development teams faced immense increases in workload as their server counts jumped from 10s to servers to 100s and 1000s.<\/li>\n\n\n\n
- Bureaucracy and manual processes <\/strong>\u2014 like tickets for even the smallest of requests \u2014 slowed down engineering teams\u2019 ability to deal with business needs and customer demands for rapid iteration e.g. cloud accounts took weeks to provision rather than minutes<\/li>\n\n\n\n
- Production systems were suffering from low reliability, security, resiliency, and quality<\/li>\n\n\n\n
- Agile work practices were improving the velocity of development that was hitting production, but that itself was challenging operations because of the scale and speed of changes to production systems<\/li>\n\n\n\n
- Engineers were burning out due to high cognitive load trying to perform operational heroics during extended periods of firefighting and<\/em> having no time to improve on the work – they only had time to react and move on to the next problem<\/li>\n<\/ul>\n\n\n\n
Disney\u2019s servers had nicknames after Snow White’s dwarves \u2014 grumpy, sleepy, and dopey to reflect server behaviors \u2014 but this amusing behavior revealed a more significant issue of difficulty staying on top of server configurations<\/p>\n\n\n\n
\ud83d\ude05 Servers began to take on the personality they were named after<\/strong><\/p>\n\n\n\n
\n
- Grumpy regularly showed errors<\/li>\n\n\n\n
- Sleepy kept suffering from high latency<\/li>\n\n\n\n
- Bashful would disappear from the network for days<\/li>\n<\/ul>\n\n\n\n
To address these issues, the newly minted SRE teams worked to create a more centralized IT infrastructure that would streamline operations across all divisions.<\/p>\n\n\n\n
They championed several initiatives including:<\/p>\n\n\n\n
\n
- the implementation of company-wide systems of operation \u2014 like self-service portals \u2014 that would allow different departments to work more efficiently<\/li>\n\n\n\n
- adoption of newer technologies such as cloud computing and virtualization, which have allowed Disney to scale its operations more effectively<\/li>\n\n\n\n
- building infrastructure-as-code (IAC) and coupling it with the application code while leveraging technology such as containerization and serverless architecture<\/li>\n\n\n\n
- shifting focus to reliability with the aim of delivering more reliable applications and experiences through platform abstraction<\/li>\n<\/ul>\n\n\n\n
\n“Our DevOps transformation at Disney focused on technology, leadership, and community. Technology is crucial because it amplifies human ability.” \u2014 Jason Cox, Director, Global SRE at Disney<\/p>\n<\/blockquote>\n\n\n\n
With Jason’s timely vision and his team\u2019s hard work, Disney has been able to stay ahead of the competition and remain a leader in the entertainment industry.<\/p>\n\n\n
How SRE has helped Disney\u2019s tech operations<\/strong><\/h2>\n\n\n
The Systems Reliability Engineering (SRE) team at Disney revolutionized its technology landscape by emphasizing the importance of core DevOps processes.<\/p>\n\n\n\n
By incorporating best practices such as continuous integration and delivery, automated testing, and monitoring<\/strong>, the team was able to improve the efficiency and reliability of various systems and applications.<\/p>\n\n\n\n
The team also worked closely with cross-functional teams to identify and address key pain points, resulting in a more streamlined and effective workflow.<\/p>\n\n\n\n
As a result of these efforts, the Disney SRE team was able to significantly enhance the overall performance and scalability of Disney’s technology infrastructure, leading to improved customer experiences.<\/p>\n\n\n\n
In particular, they addressed 2 key challenges\u2026<\/p>\n\n\n
Disney’s challenge: poor visibility across systems<\/strong><\/h3>\n\n\n
With operations spread out across multiple locations and environments, it was crucial for Disney SREs to have a way to track and analyze data from all of their systems in one place<\/strong>. To achieve this, they invested in sophisticated technology and trained their staff to use it effectively.<\/p>\n\n\n\n
They recognized the importance of keeping a close eye on their operations, so they sought to implement systems that could provide them with comprehensive monitoring and observation capabilities.<\/p>\n\n\n\n
They also established clear protocols for identifying and addressing issues, as well as for reporting on progress and performance<\/strong>.<\/p>\n\n\n\n
By taking these steps, Disney was able to ensure that their systems were running smoothly and efficiently, which in turn allowed them to provide the high-quality experiences that their customers have come to expect.<\/p>\n\n\n
\u21aa\ufe0f Solution: comprehensive observability<\/strong><\/h3>\n\n\n
Disney employs a variety of methods and technologies to ensure the effectiveness of their monitoring and observability processes. In addition to Splunk, which they use for log analysis, they also utilize Grafana for metrics visualization and PagerDuty for incident management.<\/p>\n\n\n\n
Disney’s use of Splunk allows them to efficiently analyze logs<\/strong>, which helps to identify potential problems and expedite the process of resolving them.<\/p>\n\n\n\n
By using Grafana to visualize metrics<\/strong>, Disney can gain a better understanding of their systems’ performance and proactively address any issues that arise.<\/p>\n\n\n\n
Furthermore, PagerDuty’s incident management capabilities ensure that the appropriate teams are notified in real-time<\/strong> when any critical events occur.<\/p>\n\n\n\n
In summary, Disney’s strategic use of these tools and techniques enables them to maintain a highly effective monitoring and observability system, which is essential to ensuring the efficient operation of their complex technological infrastructure.<\/p>\n\n\n\n\n\n
\n\n\nDisney’s challenge: driving consistent reliability at scale<\/strong><\/h3>\n\n\n
Disney faced a significant challenge in achieving consistent reliability at scale across all their environments and locations. This was especially challenging due to the sheer size of the organization and the diverse range of locations where they operate.<\/p>\n\n\n
\u21aa\ufe0f Solution 1: deploy configuration management<\/strong><\/h3>\n\n\n
To address this challenge, Disney turned to the use of configuration management tools such as Puppet and Chef. These tools have been crucial in helping Disney achieve consistent infrastructure across their numerous environments and locations.<\/p>\n\n\n\n
In fact, with the help of Puppet and Chef, the average time to deploy a new environment was reduced from 2 weeks to just 2 hours.<\/p>\n\n\n\n
By having a centralized system for managing configurations<\/strong>, Disney is able to ensure that all of its systems are running smoothly and are up-to-date.<\/p>\n\n\n\n
Let\u2019s go through 2 examples where configuration management has made a positive impact:<\/p>\n\n\n\n
Example 1 of configuration management<\/strong><\/p>\n\n\n\n
Before implementing CM, employees would spend eight hours each night manually updating the 100 servers<\/strong> involved in the “Toy Story Mania” attraction. But now, thanks to configuration management, a single person can update the entire fleet in just 30 minutes<\/strong>!<\/p>\n\n\n\n
By enforcing configuration and converging each system together, configuration management has also helped reduce system drift for Disney. This means that each system is more consistent and performs at a higher level, leading to improved operations and better results.<\/p>\n\n\n\n
Example 2 of configuration management<\/strong><\/p>\n\n\n\n
Disney was also able to ensure consistency across the 220 stores they have across the U.S., each with multiple point-of-sale devices. By converging these devices through configuration management, employees could easily verify that everything was working as intended.<\/p>\n\n\n\n
\u2192 This is important because it allows the stores to provide a consistent experience for customers, regardless of which store they visit.<\/p>\n\n\n\n
In addition, configuration management helps to ensure that employees are able to spend more time helping customers and less time troubleshooting technical issues. By streamlining its technical infrastructure, Disney has also been able to reduce costs associated with maintenance and support.<\/p>\n\n\n\n
\u2192 This has allowed them to invest more in other areas of their business, such as marketing and product development.<\/p>\n\n\n
\u21aa\ufe0f Solution 2: bespoke automation tools<\/h3>\n\n\n
In addition to using open-source tools, Disney SREs also developed their own internal configuration management tool called the “Disney Deployment Framework.”<\/p>\n\n\n\n
This framework allows them to automate the application deployment process and ensure consistency across different environments<\/strong>. By having a tailored solution that fits the unique needs of the organization, Disney is able to achieve even greater levels of reliability and consistency.<\/p>\n\n\n\n
Disney SREs also emphasize the importance of testing configuration management code rigorously. They have developed a tool called “Simba” that allows them to test changes to infrastructure code before deploying it to production<\/strong>. By doing so, they can catch any issues before they cause problems for the business.<\/p>\n\n\n\n
The impact of configuration management has been \u201ctruly magical\u201d for Disney. By streamlining their IT processes and ensuring that everything is running smoothly, they are able to focus on delivering an exceptional experience to their customers.<\/p>\n\n\n
What is Disney\u2019s SRE culture like?<\/strong><\/h2>\n\n
The 3 C\u2019s value system<\/h3>\n\n\n
The value system at the Disney company consists of better, faster, safer, happier.<\/p>\n\n\n\n
\u201cHow do we go for higher quality? That is taking it to the next level of quality.\u201d<\/p>\n\n\n\n
\u201cGo faster. Got to get it to market faster.\u201d<\/p>\n\n\n\n
Disney\u2019s SRE team culture takes inspiration from this for three values of its own \u2014 the 3 C\u2019s:<\/p>\n\n\n\n