Kyle Forster:<\/strong> I would say one of our core observations was that if you look at the way 10x or SREs think about troubleshooting.<\/p>\n\n\n\n<\/p>\n\n\n\n
So many of the, let’s just bring it back to like looking at dashboards and then running command line, running investigations in the command line, folks that were real 10xers were so facile in those basic tools that they could really think at a much higher level about like, “Okay, I’m going to form a hypothesis about this area of the system.”<\/p>\n\n\n\n
“I’m going to form these two or three tests about this hypothesis.” <\/p>\n\n\n\n
I’m going to confirm or deny their ability to think very abstractly about troubleshooting was amazing. And folks that were like, “Oh, wait, is it dash F or dash E on a command line flag? People who are getting up to speed, people are using new tools.<\/p>\n\n\n\n
You’re just using so much of your brain to think about, alright, which command line flag am I supposed to be using, that you lose the thread of like, hey, here’s the hypothesis that I’m trying to prove or disprove. <\/p>\n\n\n\n
So I said, hey, if we could just automate a heck of a lot of this, like, here is the right troubleshooting command to use.<\/p>\n\n\n\n
And if somebody said, be able to say, “Hey, here’s my hypothesis. Oh yeah, that command looks right. That command looks right.”<\/p>\n\n\n\n
We could actually free people up to do much, much, much better troubleshooting. Now, along the way, we came up with a few things. Well, we can’t just generate these troubleshooting commands out of thin air.<\/p>\n\n\n\n
And it’s a real pain in the neck for an organization to try to enumerate all of them. I mean, this is back to early runbooks days. <\/p>\n\n\n\n
As soon as you write a runbook for your own organization, it’s stale almost instantly. But could we have a whole series of really, really, really good SREs write troubleshooting scripts, write troubleshooting commands as an open source community?<\/p>\n\n\n\n
And write commands that are relevant to, you know, at least in the Kubernetes world, there are about 800 open source packages that represent about 75% of the workloads running in the world.<\/p>\n\n\n\n
So if we can get an open source community saying, “Hey, here’s how you troubleshoot these a hundred open source packages.”<\/p>\n\n\n\n
Frankly, we’ve covered a gigantic amount of the Kubernetes universe. <\/p>\n\n\n\n
So if you use that as a starting point, when you say you want to build the largest repository of troubleshooting commands in the world, you can get to a fair degree of completeness for the Kubernetes ecosystem with tens of people and a year, two years.<\/p>\n\n\n\n
But it’s not like hundreds of people for a thousand years. So I think that we’re actually doing, doing pretty well on that front. <\/p>\n\n\n\n
And then can you present something to SREs that says like, “Hey, given what you’re looking for, here’s ten ways to troubleshoot Postgres. Given what you’re looking for, here are five different ways to figure out what your storage utilization is on all of your volumes that underpin this set of Kubernetes namespaces.”<\/p>\n\n\n\n
Can we start getting some basic suggestions that really free people’s cognitive load up so that they’re not thinking, what was that command line flag? <\/p>\n\n\n\n
And they’re thinking, “I actually wonder if this is a storage utilization issue that’s actually sitting underneath here.”<\/p>\n\n\n\n
Let’s get the machine to do the basic thinking for me while I can think of the really, really high-level causes because I know my system really well.<\/p>\n\n\n\n
I think that kind of naturally evolved for us into saying, Hey, frankly, we’re successful enough with suggesting troubleshooting tasks in the Kubernetes ecosystem to SREs. <\/p>\n\n\n\n
Could we also become a tool that SREs actually give to their app developers? And then the real test is the thing that I, I like to think that our newer releases are doing more and more.<\/p>\n\n\n\n
Can you take an app developer who has maybe deep understanding of their part of the system, of their particular part of their particular service or their subservice? <\/p>\n\n\n\n
Who really doesn’t know that much about the rest of the system? And can you help people who don’t really know the rest of the system do a pretty good job of basic troubleshooting?<\/p>\n\n\n\n
So can you help somebody troubleshoot something that they don’t really understand? <\/p>\n\n\n\n
Not to say they can do it completely, but if you get somebody pretty far along a huge amount of the workload of an SRE team of the really, really high toil, you know, Hey. The app developers said Kubernetes is down.<\/p>\n\n\n\n
Kubernetes is not down. The app developers said that Postgres is down. The database is not down. It’s something else. Like we know the second that comes over the Slack channel. <\/p>\n\n\n\n
We know it’s something else. <\/p>\n\n\n\n
And we know that for somebody reasonably senior, three or four hours of their life just went away trying to prove, to show that it was actually the something else that’s a very high toil way to spend a half a day.<\/p>\n\n\n\n
And we’ve kind of found a way to make a much, much more efficient for everybody and kind of enjoyable.<\/p>\n\n\n\n
Ash Patel:<\/strong> And it fits really well into a you build it, you run it type model.<\/p>\n\n\n\nKyle Forster:<\/strong> Certainly we’ve seen a lot of organizations with you build it, you run it that are not getting any ROI at all. Just a huge amount of organizational churn.<\/p>\n\n\n\nAnd what they’ve got is, without us, what they’ve got is you build it, you run it, you built it, something goes down, and then you wait for the SRE team to get online, and then you desperately beg them for help, and you help them get them to help run it for you. <\/p>\n\n\n\n
And we see organization after organization that did this, like, you build it, you run it, laid off a whole bunch of people, and now they’re just completely stuck and they’re getting no ROI out of this terrible change.<\/p>\n\n\n\n
Can we help them get, like, a little bit far along so that we can help some of these organizations survive and get those SRE teams back on their feet and get them really doing high-value work that they want to be doing, high-value reliability architecture work, instead of, “Oh my god, it’s 6 am again, and the you build it, you run it team can’t do it.”<\/p>\n\n\n\n
Figure out which Kubernetes namespace they’re supposed to be in.<\/p>\n\n\n\n
Ash Patel:<\/strong> So I can see that there are two types of people who this can really help. The first is, obviously we’re talking about SREs here. <\/p>\n\n\n\nSo the SREs who can think a little bit more strategically about the system, helping them think more at a higher level, and then say, hey, I want to achieve this.<\/p>\n\n\n\n
But I don’t want to have to remember all these different things that are going to add to my cognitive load. <\/p>\n\n\n\n
Let’s have the machine actually give me suggestions of what I can punch into the CLI to actually get going with my Kubernetes clusters. That’s the first one I’m hearing. And the second one I’m hearing is…<\/p>\n\n\n\n
Kyle Forster:<\/strong> Hmm for myself, like, as a developer, I’m pretty mediocre in C. Like bottom 25% because when I have to manage my own memory, I am not good at it. <\/p>\n\n\n\nBut for me, I had a huge career unlock moment when I discovered Java and the machine would just manage memory for me and I could go a lot faster.<\/p>\n\n\n\n
And I just look at our team compared to other teams that I’ve worked on your ability to work in Java or Golang or Python compared to working in low-level machine level C.<\/p>\n\n\n\n
The team goes so much faster because it’s just, you can think at an architectural level, you can design for performance instead of trying to eke out clock cycles for performance.<\/p>\n\n\n\n
You just make people into better developers, and I think we could do the exact same thing with SRE. Like, the exact same. <\/p>\n\n\n\n
And we’ll look at, a lot of the older ways of, oh wow, that’s kind of the equivalent in programming and assembly. Nobody does that anymore unless you really, really, really have to.<\/p>\n\n\n\n
Ash Patel:<\/strong> So we’ve talked a little bit about how your product can actually help SREs. <\/p>\n\n\n\nI want to dig in a little bit deeper into what your ideal customer would look like because you are a vendor and you are trying to help solve problems at a commercial level. <\/p>\n\n\n\n
So we’ve talked a little bit about the problems they’re facing.<\/p>\n\n\n\n
What is the typical… structure of the SRE teams or cloud computing teams that you’re usually dealing with? How is their work and organization structured in terms of your ideal customer?<\/p>\n\n\n\n
Kyle Forster:<\/strong> We’ve done really well when, fundamentally when the team is understaffed. Sometimes that’s because one or two senior people have recently left. <\/p>\n\n\n\nSometimes that’s because the organization did a shift left, Wow we shall shift left and then did an unfortunate layoff around that and the shift left isn’t working. <\/p>\n\n\n\n
And now the team is just horrifically far behind on a reliability roadmap because all of their time gets spent helping app developers just get unblocked day to day.<\/p>\n\n\n\n
So we do really well with organizations that have kind of experienced almost what it’s like working at the higher level of Maslow’s pyramid here. <\/p>\n\n\n\n
Or if you go back to the SRE book, like. Like, “Hey, here’s the functions of a senior SRE. There’s some great chapters on that, but they’ve had a glimpse of that.”<\/p>\n\n\n\n
And now, unfortunately, they’re back to operating at a really, really low level of, we’re spending so much time helping our app developers get unblocked that we’re just not able to help out in the architecture conversations. <\/p>\n\n\n\n
We’re just not able to do real optimization even just optimization of dashboards and alerts.<\/p>\n\n\n\n
We cannot free up any time at all to reduce our own toil. <\/p>\n\n\n\n
Because we’re just stuck in it. Trying to make sure our app developers aren’t unblocked, because every time they get blocked, you know, fire rains down on our heads. <\/p>\n\n\n\n
We tend to do really, really well with those types of organizations, because that’s where the pain is really, really bad.<\/p>\n\n\n\n
And we need to be able to say, Hey, with one hour of your time, we can help. None of the teams that we work with. They can’t afford a week to do a vendor implementation. <\/p>\n\n\n\n
They can’t even afford a day to do a POC. I mean, it has to be like an hour to do a POC and we have to show value that fast.<\/p>\n\n\n\n
So, we kind of work on these extraordinarily time-constrained teams, and that’s where I think that we do the best work.<\/p>\n\n\n\n
Ash Patel:<\/strong> So, we’ve talked about their pains. What would a day, when they’ve used your product, what would the win look like at the end of it? How would they feel?<\/p>\n\n\n\nKyle Forster:<\/strong> Our best moments are when… When they’re two phases, The first phase is somebody says, “Wow, you just turned my junior SRE colleague into a 10x’er.”<\/p>\n\n\n\nAnd we’ve seen that a few times. And that just makes me so happy because that person is going to get an awesome promotion when you can kind of instantly turn somebody who’s sort of fairly new to the profession, fairly new to the system and our case, fairly new to Kubernetes, and you can suddenly get them.<\/p>\n\n\n\n
Ripping through it, everybody is happy, everybody wins. I love seeing that as the first stage, you know, of like, in my organization there used to be a lot of people waiting around for me, and now all of a sudden they can help themselves. <\/p>\n\n\n\n
Like, that’s an awesome moment. And the second moment that we look for is, hey, they actually just started giving our tool outside of the SRE team.<\/p>\n\n\n\n
And hey, all of a sudden, we have app developers in another time zone. They don’t wait for us in the morning anymore. <\/p>\n\n\n\n
Whenever there’s something that they can’t fix by themselves, like we show up and there’s a perfect triage report on our desk that shows 50 things that were tried, shows the two issues, shows a nice summary, shows us exactly what we need to do.<\/p>\n\n\n\n
I love seeing that moment of, “Oh wow, my junior people are vastly more productive and self-sufficient than they were.”<\/p>\n\n\n\n
“Now suddenly my app developers while they don’t have the full credentials, you know, we can’t give them everything, but you know, they’re suddenly much more self-sufficient than they were.”<\/p>\n\n\n\n
There’s this kind of like, aha, everybody wins moment, and that, for me is very, very, very satisfying.<\/p>\n\n\n\n
Ash Patel:<\/strong> I like that idea of enabling junior SREs to be 10xers, because that shows them that there’s potential for their career progression. <\/p>\n\n\n\nThey can start thinking about bigger and more complex things, because they don’t have to worry about memorizing How does this work?<\/p>\n\n\n\n
How does that work? And that’s a great feeling to have.<\/p>\n\n\n\n
So, it’s a relatively new space that you’re in. Where do you see your category heading in the next two to three years?<\/p>\n\n\n\n
Kyle Forster:<\/strong> If you look at monitoring and observability, broadly speaking, like dashboards, products that serve dashboards, or somewhere around 6 billion spent last year on products that all feed metrics or stuff built on top of metrics and into dashboards.<\/p>\n\n\n\nTeams that we interviewed say, “Hey, how often for your senior engineers, at least within Kubernetes, like how often are things entirely resolved by your dashboards?” <\/p>\n\n\n\n
And the answer came back 80% of the time. <\/p>\n\n\n\n
Their senior engineers wind up back on the CLI. For non-Kubernetes environments, I suspect the number’s probably lower because these are mostly older, more mature operational environments.<\/p>\n\n\n\n
When there aren’t very many new services, when there aren’t very many new applications, when things aren’t changing very fast, my guess is that there’s a lot more that’s resolved on just the dashboards alone.<\/p>\n\n\n\n
But if an organization is doing a lot of engineering work, There is always a lot of stuff that’s not resolvable on the dashboards alone.<\/p>\n\n\n\n
And I think we have this kind of funny quirk in our industry right now that like, Okay, six billion dollars spent on dashboards, and anything that’s not there has to be done at the assembly language level. <\/p>\n\n\n\n
I think that there’s just a better industry structure here. I think if we can start by making the junior people into 10xers.<\/p>\n\n\n\n
I think that we have a huge win. <\/p>\n\n\n\n
I think when we can make the app developers much more self-sufficient, then I think what we’ll see is a troubleshooting category that’s kind of roughly the same size as the observability category.<\/p>\n\n\n\n
Ash Patel:<\/strong> There isn’t a troubleshooting category right now. I would say it’s kind of a subset of incident management, but you want it to be its own category.<\/p>\n\n\n\nKyle Forster:<\/strong> I think it’ll become its own category because incident management, I feel like incident management plays a gigantic role, but it’s very much a production role, whereas troubleshooting, a huge amount of the troubleshooting that goes on is going on in dev and test environments in addition to production.<\/p>\n\n\n\nYou ideally want to troubleshoot prod the same way you troubleshoot dev and test. <\/p>\n\n\n\n
But the number of teams that we talked to were, you know. Hey, in theory, my job is prod, but if I look at my schedule day to day, wow, my schedule day to day, a heck of a lot of time gets spent in our test environment, just helping get people unblocked.<\/p>\n\n\n\n
I think that’s where these categories are actually very, very, very separate. And the starting point, at least, that we’ve seen is totally separate. <\/p>\n\n\n\n
So I think we’ll wind up integrating incident management more than folding into the category.<\/p>\n\n\n\n
Ash Patel:<\/strong> In terms of SRE as a broader field, rather than going into different areas like observability and sort of management and troubleshooting, where do you see SRE heading in the next few years?<\/p>\n\n\n\nKyle Forster:<\/strong> It’s a good question. I kind of think the profession is a little bit of a crossroads.<\/p>\n\n\n\nMostly because of the huge downward economy over last year’s push to shift left. And I think that we either find a way that SRE teams makeshift left successful and use that, frankly, as an opportunity to leapfrog. <\/p>\n\n\n\n
Go from the basic fingers on keyboard of troubleshooting work to say that’s part of the job, but that’s not like 130% of my day.<\/p>\n\n\n\n
Instead like real reliability architecture and reliability optimizations are a big hunk of my day, or we’ll see, hey, organizations that did a lot of shift left and then the SRE team just don’t provide enough value. <\/p>\n\n\n\n
They’re just sitting there like unblocking app developers, unblocking app developers, unblocking app developers, and there’ll be some new category that comes in for that kind of high end architecture.<\/p>\n\n\n\n
Ash Patel:<\/strong> I feel that SRE really as a crossroads, it was showing a lot of growth potential during the pandemic, 2020, 2021, 22. <\/p>\n\n\n\nIt’s lost steam this year. There’s a lot of dissatisfaction, a lot of disengagement from SREs that I’ve spoken with.<\/p>\n\n\n\n
I feel that can turn around. It’s just, we need to power through this. It’s probably looking at the role like you said, rather than just sitting in front of a keyboard and just trying to play around and figure out what’s going on. <\/p>\n\n\n\n
It’s going to be a lot of using assisted technologies to support you.<\/p>\n\n\n\n
But you’re actually doing a lot of think work outside of just typing commands on a keyboard. <\/p>\n\n\n\n
You’re actually probably getting up like how UX people do on a whiteboard and actually drawing out what’s happening in the system where do you think we should investigate etc. etc. <\/p>\n\n\n\n
And I feel that’s where they could move towards to feel less like reactive people and more proactive when they’re even dealing with incidents.<\/p>\n\n\n\n
Kyle Forster:<\/strong> I agree.<\/p>\n\n\n\nAsh Patel:<\/strong> What piece of advice would you give to SREs, ideally related to SRE work?<\/p>\n\n\n\nKyle Forster:<\/strong> Well, first, look, I’m a vendor here, so I’m going to start by pitching my own book. <\/p>\n\n\n\nI think that there is a certain, and I see it, I see it in every single organization, including my current one.<\/p>\n\n\n\n
There’s a certain like, “Oh, thank you. You just saved my problem. You just saved me.” <\/p>\n\n\n\n
This effusive thank you that comes from an app developer who you’ve just unblocked. Or from a junior SRE, whom you’ve just unblocked. And there’s this incredible feeling of like, I am the hero and I just did this.<\/p>\n\n\n\n
And it’s because of expertise through a bunch of hard-won scars, and it’s an incredibly gratifying feeling. <\/p>\n\n\n\n
And every single time somebody feels that way, the person on the other side said, “oh, thank goodness. Ash just got me unblocked, man. His team made a bunch of crappy decisions that got me blocked in the first place.”<\/p>\n\n\n\n
We have these incredibly perverse incentives. <\/p>\n\n\n\n
I really saw it, frankly working on the Kubernetes platform the sheer number of times when people would, amongst my clients, and I could see it happening. <\/p>\n\n\n\n
They’d say not to me but to their lead at Kubernetes say, “Oh, thank goodness you just got me unblocked. Oh, I hate Kubernetes, this stupid system just got me totally blocked in the first place.” <\/p>\n\n\n\n
So, I think that as an industry, we do need to sort of wrestle through that because it feels really, really, really good to unblock people, but you create these extraordinarily anti-sentiment for your team and for the technology base that your team has chosen.<\/p>\n\n\n\n
And it’s really weird. And I think that figuring out a way of saying like, “Hey, that feels good to get somebody unblocked, but now structurally I need to design myself out of this loop,” is really, really necessary, especially as teams get small and teams just have to stick together. <\/p>\n\n\n\n
Otherwise, “Wow, Ash was awesome. Shame none of his colleagues could help me out. There must be a bunch of dummies just sitting around.”<\/p>\n\n\n\n
It’s just the reality of being human and it really does do terrible things for teams. <\/p>\n\n\n\n
So I think figuring out this deep misalignment of incentives I think it’s gonna be a really important role for very very senior SREs who have both the technical maturity and the organizational emotional maturity just kind of spot this when it’s happening and figure out a way design around the answer pattern<\/p>\n\n\n\n
Ash Patel:<\/strong> In summary stick together whatever happens as SREs. We’ve got to stick together<\/p>\n\n\n\nThank you Kyle for coming in and having a candid conversation about things affecting SRE, and in particular, how your solution can help them improve their working lives.<\/p>\n\n\n\n
Kyle Forster:<\/strong> Appreciate you having me on.<\/p>\n","protected":false},"excerpt":{"rendered":"Episode 10 [SREpath Podcast] Ash Patel interviews Kyle Forster of RunWhen about his perspective on AI and its usefulness in achieving reliability goals. RunWhen has developed a tool that uses visual cluster mapping and GenAI for troubleshooting Kubernetes problems. Its localhost version has hit over 1900 downloads in the 6 weeks since launch. Transcript Don’t want to […]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[48],"tags":[30,41,75],"_links":{"self":[{"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/posts\/5823"}],"collection":[{"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/comments?post=5823"}],"version-history":[{"count":15,"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/posts\/5823\/revisions"}],"predecessor-version":[{"id":6060,"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/posts\/5823\/revisions\/6060"}],"wp:attachment":[{"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/media?parent=5823"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/categories?post=5823"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/tags?post=5823"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}