{"id":5482,"date":"2023-04-05T22:19:22","date_gmt":"2023-04-05T12:19:22","guid":{"rendered":"https:\/\/sysmit.com\/cf22\/?p=5482"},"modified":"2024-01-01T12:27:17","modified_gmt":"2024-01-01T02:27:17","slug":"inside-disneys-site-reliability-engineering-practice","status":"publish","type":"post","link":"https:\/\/sysmit.com\/cf22\/inside-disneys-site-reliability-engineering-practice\/","title":{"rendered":"Inside Disney’s Site Reliability Engineering practice"},"content":{"rendered":"

Introduction<\/strong><\/h2>\n\n\n

It is no small feat to run an ecosystem of entertainment experiences to delight a wide range of people, from young children to older “Disney adults”.<\/p>\n\n\n\n

Almost every Disney experience relies on a sophisticated technology stack working in the background.<\/p>\n\n\n\n

\n

“Steve Jobs once said technology amplifies human ability. At Disney, we use technology to create digital experiences that bring magic to people all around the world.” \u2014 Jason Cox, Director of Global SRE at Disney<\/p>\n<\/blockquote>\n\n\n\n

Disney\u2019s SRE teams have ensured that the magic keeps happening, even as experiences and their underlying technology become more and more complex.<\/p>\n\n\n

History of SRE at Disney<\/strong><\/h2>\n\n\n

Jason Cox has been the Director of Global SRE at Disney since 2011. <\/p>\n\n\n\n

He has helped Disney remain a global leader in the entertainment industry. This is by keeping the company up-to-date as technology has become a critical driver in entertainment.<\/p>\n\n\n\n

His promotion to Systems Reliability Engineering tzar at the company saw its challenges.<\/p>\n\n\n\n

\ud83d\udccc For the record, SRE at Disney is denoted as Systems Reliability Engineering<\/em>.<\/p>\n\n\n\n

He had already been at the company for several years prior, working in the operations team of Disney\u2019s Internet group. The biggest hurdle he saw to effectiveness was that they and every other group operated in siloes.<\/p>\n\n\n\n

There weren\u2019t only siloes among departments like technology, product, etc. Disney had four large divisions with separate autonomous CTOs. These divisions were: Studios, Consumer Products and Interactive, Parks and Support, and Media Networks.<\/p>\n\n\n\n

While this allowed each division to operate independently, it also resulted in institutionalized Shadow IT. Every division had its own way of doing things, even for very trivial work.<\/strong><\/p>\n\n\n\n

However, rapid growth necessitated cross-company changes, so a DevOps transformation was undertaken. Jason and his team applied DevOps principles to each part of the business, not just technically but culturally, by breaking down silos and scaling DevOps.<\/p>\n\n\n\n

The goal was to improve communication and collaboration, getting technologists from various disciplines to work together and embrace new ideas.<\/p>\n\n\n\n

The following issues were on top of mind for the SRE group:<\/p>\n\n\n\n

    \n
  • As Disney\u2019s businesses expanded digitally, the <\/strong>workload and firefighting increased. Development teams faced immense increases in workload as their server counts jumped from 10s to servers to 100s and 1000s.<\/li>\n\n\n\n
  • Bureaucracy and manual processes <\/strong>\u2014 like tickets for even the smallest of requests \u2014 slowed down engineering teams\u2019 ability to deal with business needs and customer demands for rapid iteration e.g. cloud accounts took weeks to provision rather than minutes<\/li>\n\n\n\n
  • Production systems were suffering from low reliability, security, resiliency, and quality<\/li>\n\n\n\n
  • Agile work practices were improving the velocity of development that was hitting production, but that itself was challenging operations because of the scale and speed of changes to production systems<\/li>\n\n\n\n
  • Engineers were burning out due to high cognitive load trying to perform operational heroics during extended periods of firefighting and<\/em> having no time to improve on the work – they only had time to react and move on to the next problem<\/li>\n<\/ul>\n\n\n\n

    Disney\u2019s servers had nicknames after Snow White’s dwarves \u2014 grumpy, sleepy, and dopey to reflect server behaviors \u2014 but this amusing behavior revealed a more significant issue of difficulty staying on top of server configurations<\/p>\n\n\n\n

    \ud83d\ude05 Servers began to take on the personality they were named after<\/strong><\/p>\n\n\n\n

      \n
    • Grumpy regularly showed errors<\/li>\n\n\n\n
    • Sleepy kept suffering from high latency<\/li>\n\n\n\n
    • Bashful would disappear from the network for days<\/li>\n<\/ul>\n\n\n\n

      To address these issues, the newly minted SRE teams worked to create a more centralized IT infrastructure that would streamline operations across all divisions.<\/p>\n\n\n\n

      They championed several initiatives including:<\/p>\n\n\n\n

        \n
      • the implementation of company-wide systems of operation \u2014 like self-service portals \u2014 that would allow different departments to work more efficiently<\/li>\n\n\n\n
      • adoption of newer technologies such as cloud computing and virtualization, which have allowed Disney to scale its operations more effectively<\/li>\n\n\n\n
      • building infrastructure-as-code (IAC) and coupling it with the application code while leveraging technology such as containerization and serverless architecture<\/li>\n\n\n\n
      • shifting focus to reliability with the aim of delivering more reliable applications and experiences through platform abstraction<\/li>\n<\/ul>\n\n\n\n
        \n

        “Our DevOps transformation at Disney focused on technology, leadership, and community. Technology is crucial because it amplifies human ability.” \u2014 Jason Cox, Director, Global SRE at Disney<\/p>\n<\/blockquote>\n\n\n\n

        With Jason’s timely vision and his team\u2019s hard work, Disney has been able to stay ahead of the competition and remain a leader in the entertainment industry.<\/p>\n\n\n

        How SRE has helped Disney\u2019s tech operations<\/strong><\/h2>\n\n\n

        The Systems Reliability Engineering (SRE) team at Disney revolutionized its technology landscape by emphasizing the importance of core DevOps processes.<\/p>\n\n\n\n

        By incorporating best practices such as continuous integration and delivery, automated testing, and monitoring<\/strong>, the team was able to improve the efficiency and reliability of various systems and applications.<\/p>\n\n\n\n

        The team also worked closely with cross-functional teams to identify and address key pain points, resulting in a more streamlined and effective workflow.<\/p>\n\n\n\n

        As a result of these efforts, the Disney SRE team was able to significantly enhance the overall performance and scalability of Disney’s technology infrastructure, leading to improved customer experiences.<\/p>\n\n\n\n

        In particular, they addressed 2 key challenges\u2026<\/p>\n\n\n

        Disney’s challenge: poor visibility across systems<\/strong><\/h3>\n\n\n

        With operations spread out across multiple locations and environments, it was crucial for Disney SREs to have a way to track and analyze data from all of their systems in one place<\/strong>. To achieve this, they invested in sophisticated technology and trained their staff to use it effectively.<\/p>\n\n\n\n

        They recognized the importance of keeping a close eye on their operations, so they sought to implement systems that could provide them with comprehensive monitoring and observation capabilities.<\/p>\n\n\n\n

        They also established clear protocols for identifying and addressing issues, as well as for reporting on progress and performance<\/strong>.<\/p>\n\n\n\n

        By taking these steps, Disney was able to ensure that their systems were running smoothly and efficiently, which in turn allowed them to provide the high-quality experiences that their customers have come to expect.<\/p>\n\n\n

        \u21aa\ufe0f Solution: comprehensive observability<\/strong><\/h3>\n\n\n

        Disney employs a variety of methods and technologies to ensure the effectiveness of their monitoring and observability processes. In addition to Splunk, which they use for log analysis, they also utilize Grafana for metrics visualization and PagerDuty for incident management.<\/p>\n\n\n\n

        Disney’s use of Splunk allows them to efficiently analyze logs<\/strong>, which helps to identify potential problems and expedite the process of resolving them.<\/p>\n\n\n\n

        By using Grafana to visualize metrics<\/strong>, Disney can gain a better understanding of their systems’ performance and proactively address any issues that arise.<\/p>\n\n\n\n

        Furthermore, PagerDuty’s incident management capabilities ensure that the appropriate teams are notified in real-time<\/strong> when any critical events occur.<\/p>\n\n\n\n

        In summary, Disney’s strategic use of these tools and techniques enables them to maintain a highly effective monitoring and observability system, which is essential to ensuring the efficient operation of their complex technological infrastructure.<\/p>\n\n\n\n\n\n


        \n\n\n

        Disney’s challenge: driving consistent reliability at scale<\/strong><\/h3>\n\n\n

        Disney faced a significant challenge in achieving consistent reliability at scale across all their environments and locations. This was especially challenging due to the sheer size of the organization and the diverse range of locations where they operate.<\/p>\n\n\n

        \u21aa\ufe0f Solution 1: deploy configuration management<\/strong><\/h3>\n\n\n

        To address this challenge, Disney turned to the use of configuration management tools such as Puppet and Chef. These tools have been crucial in helping Disney achieve consistent infrastructure across their numerous environments and locations.<\/p>\n\n\n\n

        In fact, with the help of Puppet and Chef, the average time to deploy a new environment was reduced from 2 weeks to just 2 hours.<\/p>\n\n\n\n

        By having a centralized system for managing configurations<\/strong>, Disney is able to ensure that all of its systems are running smoothly and are up-to-date.<\/p>\n\n\n\n

        Let\u2019s go through 2 examples where configuration management has made a positive impact:<\/p>\n\n\n\n

        Example 1 of configuration management<\/strong><\/p>\n\n\n\n

        Before implementing CM, employees would spend eight hours each night manually updating the 100 servers<\/strong> involved in the “Toy Story Mania” attraction. But now, thanks to configuration management, a single person can update the entire fleet in just 30 minutes<\/strong>!<\/p>\n\n\n\n

        By enforcing configuration and converging each system together, configuration management has also helped reduce system drift for Disney. This means that each system is more consistent and performs at a higher level, leading to improved operations and better results.<\/p>\n\n\n\n

        Example 2 of configuration management<\/strong><\/p>\n\n\n\n

        Disney was also able to ensure consistency across the 220 stores they have across the U.S., each with multiple point-of-sale devices. By converging these devices through configuration management, employees could easily verify that everything was working as intended.<\/p>\n\n\n\n

        \u2192 This is important because it allows the stores to provide a consistent experience for customers, regardless of which store they visit.<\/p>\n\n\n\n

        In addition, configuration management helps to ensure that employees are able to spend more time helping customers and less time troubleshooting technical issues. By streamlining its technical infrastructure, Disney has also been able to reduce costs associated with maintenance and support.<\/p>\n\n\n\n

        \u2192 This has allowed them to invest more in other areas of their business, such as marketing and product development.<\/p>\n\n\n

        \u21aa\ufe0f Solution 2: bespoke automation tools<\/h3>\n\n\n

        In addition to using open-source tools, Disney SREs also developed their own internal configuration management tool called the “Disney Deployment Framework.”<\/p>\n\n\n\n

        This framework allows them to automate the application deployment process and ensure consistency across different environments<\/strong>. By having a tailored solution that fits the unique needs of the organization, Disney is able to achieve even greater levels of reliability and consistency.<\/p>\n\n\n\n

        Disney SREs also emphasize the importance of testing configuration management code rigorously. They have developed a tool called “Simba” that allows them to test changes to infrastructure code before deploying it to production<\/strong>. By doing so, they can catch any issues before they cause problems for the business.<\/p>\n\n\n\n

        The impact of configuration management has been \u201ctruly magical\u201d for Disney. By streamlining their IT processes and ensuring that everything is running smoothly, they are able to focus on delivering an exceptional experience to their customers.<\/p>\n\n\n

        What is Disney\u2019s SRE culture like?<\/strong><\/h2>\n\n

        The 3 C\u2019s value system<\/h3>\n\n\n

        The value system at the Disney company consists of better, faster, safer, happier.<\/p>\n\n\n\n

        \u201cHow do we go for higher quality? That is taking it to the next level of quality.\u201d<\/p>\n\n\n\n

        \u201cGo faster. Got to get it to market faster.\u201d<\/p>\n\n\n\n

        Disney\u2019s SRE team culture takes inspiration from this for three values of its own \u2014 the 3 C\u2019s:<\/p>\n\n\n\n

        \"\"<\/figure>\n\n\n\n
        <\/div>\n\n\n\n

        This value set has allowed operations-facing engineers at Disney to:<\/p>\n\n\n\n

          \n
        • become less transactional and more integrated with the work<\/li>\n\n\n\n
        • do less manual work and drive more self-service and automation<\/li>\n<\/ul>\n\n\n

          \u201cLet\u2019s fix the job title\u201d<\/h3>\n\n\n

          Until 2017, Disney\u2019s operations engineers \/ SREs were called \u201cSystems operators\u201d. They changed the naming to \u201cSystems Engineers\u201d to reflect that they weren\u2019t just the people \u201coperating the train\u201d but also those who designed the train, built the train, the tracks, and the bridges.<\/p>\n\n\n\n

          Disney\u2019s SREs began their journey at the company by working with other teams to espouse the above ethos \u2014 that they were integrated with the value chain and not outside of it.<\/p>\n\n\n\n

          \n

          We began to engineer our future, and as part of that, we became embedded with the teams that we’re supporting the product teams and the business teams. \u2014 Jason Cox<\/p>\n<\/blockquote>\n\n\n

          Toward a generative culture<\/h3>\n\n\n

          SRE leadership has eschewed the traditional management mantra of fear, power orientation, command-and-control, and bureaucracy. They aim to lead with a generative culture that empowers knowledge workers to achieve their edge<\/em>.<\/strong><\/p>\n\n\n\n

          At the same time, the SRE teams have a unique way of making engineers not become arrogant with thoughts of grandeur that they are A-player rockstars above everyone else.<\/p>\n\n\n

          Fostering a service mindset<\/h3>\n\n\n

          \u201cMy team sits at the bottom<\/em> of the corporate hierarchy\u201d \u2014 at least that\u2019s the mindset that Jason Cox has instilled in his SRE teams. They are there to be of service to the business stakeholders and the technologists on the ground.<\/p>\n\n\n\n

          In Jason\u2019s words, \u201cMy goal is to say, \u2018How can I help?\u2019 So as I go into each one of these segments, I say, \u2018I’m with corporate. I’m here to help.\u2019”<\/p>\n\n\n\n

          People on the ground are still apprehensive when he and his team approach them, especially with the \u201cI\u2019m with corporate\u201d phrase being part of their approach.<\/p>\n\n\n\n

          He adds that they work and continually communicate to make sure that people and especially engineers on the ground don\u2019t see them as an imposing force, \u201chere to take away your fun\u201d.<\/p>\n\n\n

          Developing T-shaped skillsets<\/h3>\n\n\n

          SRE teams acted as tech evangelists that were helping champion the positive changes mentioned earlier across the organization, as well as augmenting product teams with their T-shaped skills.<\/p>\n\n\n\n

          T-shaped skills refer to the concept that SREs should possess a broad range of skills and knowledge across different fields, while also having deep expertise in a particular area.<\/p>\n\n\n\n

          For instance, a Site Reliability Engineer may have experience in software development, operations, project management, security, and more. At least a very thin layer of one or several of these areas, enough to be empathetic and have an understanding to lean in and help.<\/p>\n\n\n\n

          However, they should also have a deeper understanding of one area, such as cloud computing, security, networks, or automation. By having this broad and deep skill set simultaneously, SREs can better understand and empathize with their colleagues in different disciplines.<\/p>\n\n\n

          A culture of continuous learning<\/h3>\n\n\n

          This in turn has helped Disney\u2019s SREs foster more effective collaboration and problem-solving. Furthermore, possessing a diverse range of skills and knowledge can help SREs identify and address issues that may arise in a complex system.<\/p>\n\n\n\n

          To continue developing their skills, Disney ensures its SREs participate in a community of practice, nicknamed Jedi Engineering Training (JET).<\/p>\n\n\n\n

          This community has two distinct benefits:<\/p>\n\n\n\n

            \n
          1. New technologies are promoted while a community of people is fostered around discussing technologies and collaborative problem-solving.<\/li>\n\n\n\n
          2. External and internal experts visit to speak about their current projects. The community uses these insights to work through problems, adopt innovations, and connect with each other.<\/li>\n<\/ol>\n\n\n

            Parting words<\/strong><\/h2>\n\n\n

            As you have read, Disney\u2019s Site Reliability Engineering (SRE) has revolutionized its technology landscape by emphasizing the importance of core DevOps processes such as continuous integration and delivery, automated testing, and observability.<\/p>\n\n\n\n

            They have also worked to create a more centralized IT infrastructure that would streamline operations across all divisions.<\/p>\n\n\n\n

            Disney’s SRE culture consists of collaboration, curiosity, and courage. They aim to lead toward a generative culture that empowers knowledge workers to achieve their edge.<\/p>\n\n\n\n

            What Walt Disney himself said about Disney\u2019s secret rings very true to how a successful Site Reliability Engineering team can operate:<\/p>\n\n\n\n

            \u201cThere\u2019s really no secret about our approach. We keep opening new doors and doing new things \u2014 because we\u2019re curious. And curiosity keeps leading us down new paths.\u201d<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"

            Introduction It is no small feat to run an ecosystem of entertainment experiences to delight a wide range of people, from young children to older “Disney adults”. Almost every Disney experience relies on a sophisticated technology stack working in the background. “Steve Jobs once said technology amplifies human ability. At Disney, we use technology to […]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[60,8],"tags":[49,78],"_links":{"self":[{"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/posts\/5482"}],"collection":[{"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/comments?post=5482"}],"version-history":[{"count":6,"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/posts\/5482\/revisions"}],"predecessor-version":[{"id":5751,"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/posts\/5482\/revisions\/5751"}],"wp:attachment":[{"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/media?parent=5482"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/categories?post=5482"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/tags?post=5482"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}