#10 Using AI for Kubernetes troubleshooting self-service (with Kyle Forster)

Episode 10 [SREpath Podcast]

Ash Patel interviews Kyle Forster of RunWhen about his perspective on AI and its usefulness in achieving reliability goals.

RunWhen has developed a tool that uses visual cluster mapping and GenAI for troubleshooting Kubernetes problems. Its localhost version has hit over 1900 downloads in the 6 weeks since launch.

Transcript

Don’t want to listen to the episode right now? Read the conversation below.

Ash Patel: This is an interview episode of the SREpath Podcast.

Kyle Forster from RunWhen will join me to share his learnings from helping SREs and their teams.

Yes, Kyle is an SRE software vendor, but this is not a sponsored episode. Simply a quest to expand the body of knowledge of SRE.

Kyle, great to have you.

Kyle Forster: Thank you so much for having me on. I appreciate it.

Ash Patel: So, let’s get right into it. What do you do SRE-wise in terms of you and your company?

Kyle Forster: Sure. So our company, we build an expert community of SREs who are all contributing troubleshooting scripts.

We then build an AI layer on top, these digital assistants that our customers use to search through their troubleshooting scripts and to suggest to all of their engineers what to run and when.

That’s the name of the company.

Our goal is that it’s something that not only SREs use but really SREs would give to their app developers so that they’re not tied down every second trying to help troubleshoot something that went in a dev environment or something that went Wrong On a test.

Ash Patel: Interesting. But why SRE?

There are so many other spaces you could be in. You could work directly with platform people, DevOps. It seems to be all getting abstracted out. Why SRE people in particular?

Kyle Forster: We looked at who does the most troubleshooting. And we found across our users who are the most interested in working with us.

Certainly quite a few use the platform engineering title. Certainly, some just use a DevOps title and a very large number use an SRE title.

What we found was a lot of this was org structure and team size dependent. But an awful lot of it came down to patterns that anybody would recognize across any engineering team.

When one of our app developers gets stuck. Usually, a couple of people try to help out and then there’s kind of always one or two people in the organization that wind up just getting pulled into every single complicated troubleshooting session over and over and over and over.

That person’s time was so valuable that we kind of set out to say, “All right, how can we help that person get some of their time back to do strategic projects?”

Ash Patel: You’ve got an interesting personal history working with SREs and in particular, some of the tools I use a lot in, especially Kubernetes.

Your work history is interesting because your job at Google was related to that space.

Do you want to talk about that a little bit?

Kyle Forster: So, you know, even before I was actually on the Kubernetes team at Google, I used to give out Niall Murphy’s book to all of our customers when I was at an earlier startup that I had founded.

We had a theme around networking and how the big hyperscale companies do networking and the SRE principles fit really well for networking professionals that wanted to kind of go a little bit further down the path in their own careers.

At Google, obviously sitting on the Kubernetes team, I was working with SRE leads from our various customers all day long.

It was major, major, major stakeholders in every Kubernetes deployment and that’s kind of what gave me a lot of call-it opinions on the profession and where it had been, where it was headed.

Right before we actually left to start RunWhen, I remember doing a series of LinkedIn searches, and at the time there were about 4,000 people on LinkedIn outside of Google and Facebook that were carrying SRE in their job title.

And I remember it was amazing. In the first year and a half, that number went from 4, 000 up to something like 65,000.

I repeated the exact same search.

So you just see the huge explosion in use in the job title. It was just amazing. Absolutely, absolutely amazing.

But clearly, I think SRE in the enterprise means something different because you have a very different set of tools than you have in these hyperscale, much more homogenous environments.

The challenges are different and that means that the solutions are different.

And so I think that SRE as a profession really, really had to adapt and evolve a lot over the course of that growth in ways that I don’t think any of the original folks who coined the term at Google ever expected.

Ash Patel: SRE is interpreted in so many ways.

And in some ways, I’ve seen people interpret it as things that I would say engineers would not like.

What is one of the biggest anti-patterns you’ve seen in your years working with SREs?

Kyle Forster: I would say SRE in the Google form Was not SRE without SLOs.

That was just an absolutely critical thing. But in many ways, the SLOs were not always, external customer facing.

You would have many, many, many services internally that were internal facing and that each had their own SLOs to their users. It’s a little bit different from the way that a lot of enterprise production environments work.

You don’t always have the sense of, hey, we have many, many, many internal services and we have many of our own internal customers. more often at different scale and for different industries.

The org chart just looks a little bit different.

You also have a little bit less of the backdrop of Google, where it’s very, very, you know, there’s intense culture around being data and measurement-driven, but intense forgiveness for the individuals whenever the metrics on the thing that they’re driving happened to go very wonky.

In most, certainly not all, but in a lot of enterprise cultures, there’s much, much, much less forgiveness there, which meant the sheer number of times I was involved in our early days in conversations and say, “Hey, should SLOs be tied to compensation?” And at Google, I don’t think anybody would have ever considered it.

It would just never even be something that people thought about.

And I found actually I had to help multiple SRE leaders through tricky conversations where they say, “Hey, my fear is, as we measure this, that my own personal compensation is going to get tied to this and the compensation of all my team members are going to get tied to this. And I’m not entirely in control of these metrics,” which is a very scary thing.

So that I think is one really good example of an area where SRE and the Google mindset really, really, really had to adapt and evolve to make the profession relevant outside of Google.

Ash Patel: This is something that I keep saying to people that if you copy what’s in the Site Reliability Engineering book from 2016, that’s seven years old now.

And a lot of things have changed in the last seven years.

Kyle Forster: I find it’s hard to copy.

We found a series of teams that copied it successfully at scales of 50 SREs and above.

But there are a lot of patterns that start showing up when you have teams of that size. They tended to have full-time software development teams just building tools for the SRE teams.

Related article:  #14 Faster Incident Resolution through Data-Driven Notebooks (with Ivan Merrill)

But if you think about, the number of frankly, massive-scale financial services institutions and massive-scale retailers that can afford those types of SRE teams, they’re few and far between.

So as soon as you start talking about a more typical enterprise scale, three to ten people on the team, maybe we’re lucky if we have one full-time software developer who’s building tools.

You just have to adjust a lot of the thinking.

It’s not to say the book is, I think the book is actually fascinating, but it’s a book about how Google does it that requires, and I think Niall has actually done a whole series of great posts on this, that book is the way that Google does it.

There’s a lot that you adapt, actually, as you think about how your organization has to do it.

You have to do a lot of deep thinking there, and I think that’s important.

Ash Patel: I’ll add my own opinion for a brief moment here. In that I say, “Hey, you should pick a little bit out of every organization that can provide you a good example. So look at Uber, Netflix, Spotify, all of these companies, as well as Google have had some interesting experiences with SRE”.

And it should be an idea for people to explore all of these different companies, what they’ve done and put all the different pieces together from all of their works to bring together and adapt practices into their unique context. We could keep going on about this for hours.

We’ve had some good chats about this, Kyle. But, I want to go back to what RunWhen actually does for SREs. So, in under 5 minutes, well let’s try and timebox this. What can RunWhen do to make an SRE’s life easier?

Kyle Forster: I would say one of our core observations was that if you look at the way 10x or SREs think about troubleshooting.

So many of the, let’s just bring it back to like looking at dashboards and then running command line, running investigations in the command line, folks that were real 10xers were so facile in those basic tools that they could really think at a much higher level about like, “Okay, I’m going to form a hypothesis about this area of the system.”

“I’m going to form these two or three tests about this hypothesis.”

I’m going to confirm or deny their ability to think very abstractly about troubleshooting was amazing. And folks that were like, “Oh, wait, is it dash F or dash E on a command line flag? People who are getting up to speed, people are using new tools.

You’re just using so much of your brain to think about, alright, which command line flag am I supposed to be using, that you lose the thread of like, hey, here’s the hypothesis that I’m trying to prove or disprove.

So I said, hey, if we could just automate a heck of a lot of this, like, here is the right troubleshooting command to use.

And if somebody said, be able to say, “Hey, here’s my hypothesis. Oh yeah, that command looks right. That command looks right.”

We could actually free people up to do much, much, much better troubleshooting. Now, along the way, we came up with a few things. Well, we can’t just generate these troubleshooting commands out of thin air.

And it’s a real pain in the neck for an organization to try to enumerate all of them. I mean, this is back to early runbooks days.

As soon as you write a runbook for your own organization, it’s stale almost instantly. But could we have a whole series of really, really, really good SREs write troubleshooting scripts, write troubleshooting commands as an open source community?

And write commands that are relevant to, you know, at least in the Kubernetes world, there are about 800 open source packages that represent about 75% of the workloads running in the world.

So if we can get an open source community saying, “Hey, here’s how you troubleshoot these a hundred open source packages.”

Frankly, we’ve covered a gigantic amount of the Kubernetes universe.

So if you use that as a starting point, when you say you want to build the largest repository of troubleshooting commands in the world, you can get to a fair degree of completeness for the Kubernetes ecosystem with tens of people and a year, two years.

But it’s not like hundreds of people for a thousand years. So I think that we’re actually doing, doing pretty well on that front.

And then can you present something to SREs that says like, “Hey, given what you’re looking for, here’s ten ways to troubleshoot Postgres. Given what you’re looking for, here are five different ways to figure out what your storage utilization is on all of your volumes that underpin this set of Kubernetes namespaces.”

Can we start getting some basic suggestions that really free people’s cognitive load up so that they’re not thinking, what was that command line flag?

And they’re thinking, “I actually wonder if this is a storage utilization issue that’s actually sitting underneath here.”

Let’s get the machine to do the basic thinking for me while I can think of the really, really high-level causes because I know my system really well.

I think that kind of naturally evolved for us into saying, Hey, frankly, we’re successful enough with suggesting troubleshooting tasks in the Kubernetes ecosystem to SREs.

Could we also become a tool that SREs actually give to their app developers? And then the real test is the thing that I, I like to think that our newer releases are doing more and more.

Can you take an app developer who has maybe deep understanding of their part of the system, of their particular part of their particular service or their subservice?

Who really doesn’t know that much about the rest of the system? And can you help people who don’t really know the rest of the system do a pretty good job of basic troubleshooting?

So can you help somebody troubleshoot something that they don’t really understand?

Not to say they can do it completely, but if you get somebody pretty far along a huge amount of the workload of an SRE team of the really, really high toil, you know, Hey. The app developers said Kubernetes is down.

Kubernetes is not down. The app developers said that Postgres is down. The database is not down. It’s something else. Like we know the second that comes over the Slack channel.

We know it’s something else.

And we know that for somebody reasonably senior, three or four hours of their life just went away trying to prove, to show that it was actually the something else that’s a very high toil way to spend a half a day.

And we’ve kind of found a way to make a much, much more efficient for everybody and kind of enjoyable.

Ash Patel: And it fits really well into a you build it, you run it type model.

Kyle Forster: Certainly we’ve seen a lot of organizations with you build it, you run it that are not getting any ROI at all. Just a huge amount of organizational churn.

And what they’ve got is, without us, what they’ve got is you build it, you run it, you built it, something goes down, and then you wait for the SRE team to get online, and then you desperately beg them for help, and you help them get them to help run it for you.

And we see organization after organization that did this, like, you build it, you run it, laid off a whole bunch of people, and now they’re just completely stuck and they’re getting no ROI out of this terrible change.

Related article:  How cloud infrastructure teams evolve – from start to maturity

Can we help them get, like, a little bit far along so that we can help some of these organizations survive and get those SRE teams back on their feet and get them really doing high-value work that they want to be doing, high-value reliability architecture work, instead of, “Oh my god, it’s 6 am again, and the you build it, you run it team can’t do it.”

Figure out which Kubernetes namespace they’re supposed to be in.

Ash Patel: So I can see that there are two types of people who this can really help. The first is, obviously we’re talking about SREs here.

So the SREs who can think a little bit more strategically about the system, helping them think more at a higher level, and then say, hey, I want to achieve this.

But I don’t want to have to remember all these different things that are going to add to my cognitive load.

Let’s have the machine actually give me suggestions of what I can punch into the CLI to actually get going with my Kubernetes clusters. That’s the first one I’m hearing. And the second one I’m hearing is…

Kyle Forster: Hmm for myself, like, as a developer, I’m pretty mediocre in C. Like bottom 25% because when I have to manage my own memory, I am not good at it.

But for me, I had a huge career unlock moment when I discovered Java and the machine would just manage memory for me and I could go a lot faster.

And I just look at our team compared to other teams that I’ve worked on your ability to work in Java or Golang or Python compared to working in low-level machine level C.

The team goes so much faster because it’s just, you can think at an architectural level, you can design for performance instead of trying to eke out clock cycles for performance.

You just make people into better developers, and I think we could do the exact same thing with SRE. Like, the exact same.

And we’ll look at, a lot of the older ways of, oh wow, that’s kind of the equivalent in programming and assembly. Nobody does that anymore unless you really, really, really have to.

Ash Patel: So we’ve talked a little bit about how your product can actually help SREs.

I want to dig in a little bit deeper into what your ideal customer would look like because you are a vendor and you are trying to help solve problems at a commercial level.

So we’ve talked a little bit about the problems they’re facing.

What is the typical… structure of the SRE teams or cloud computing teams that you’re usually dealing with? How is their work and organization structured in terms of your ideal customer?

Kyle Forster: We’ve done really well when, fundamentally when the team is understaffed. Sometimes that’s because one or two senior people have recently left.

Sometimes that’s because the organization did a shift left, Wow we shall shift left and then did an unfortunate layoff around that and the shift left isn’t working.

And now the team is just horrifically far behind on a reliability roadmap because all of their time gets spent helping app developers just get unblocked day to day.

So we do really well with organizations that have kind of experienced almost what it’s like working at the higher level of Maslow’s pyramid here.

Or if you go back to the SRE book, like. Like, “Hey, here’s the functions of a senior SRE. There’s some great chapters on that, but they’ve had a glimpse of that.”

And now, unfortunately, they’re back to operating at a really, really low level of, we’re spending so much time helping our app developers get unblocked that we’re just not able to help out in the architecture conversations.

We’re just not able to do real optimization even just optimization of dashboards and alerts.

We cannot free up any time at all to reduce our own toil.

Because we’re just stuck in it. Trying to make sure our app developers aren’t unblocked, because every time they get blocked, you know, fire rains down on our heads.

We tend to do really, really well with those types of organizations, because that’s where the pain is really, really bad.

And we need to be able to say, Hey, with one hour of your time, we can help. None of the teams that we work with. They can’t afford a week to do a vendor implementation.

They can’t even afford a day to do a POC. I mean, it has to be like an hour to do a POC and we have to show value that fast.

So, we kind of work on these extraordinarily time-constrained teams, and that’s where I think that we do the best work.

Ash Patel: So, we’ve talked about their pains. What would a day, when they’ve used your product, what would the win look like at the end of it? How would they feel?

Kyle Forster: Our best moments are when… When they’re two phases, The first phase is somebody says, “Wow, you just turned my junior SRE colleague into a 10x’er.”

And we’ve seen that a few times. And that just makes me so happy because that person is going to get an awesome promotion when you can kind of instantly turn somebody who’s sort of fairly new to the profession, fairly new to the system and our case, fairly new to Kubernetes, and you can suddenly get them.

Ripping through it, everybody is happy, everybody wins. I love seeing that as the first stage, you know, of like, in my organization there used to be a lot of people waiting around for me, and now all of a sudden they can help themselves.

Like, that’s an awesome moment. And the second moment that we look for is, hey, they actually just started giving our tool outside of the SRE team.

And hey, all of a sudden, we have app developers in another time zone. They don’t wait for us in the morning anymore.

Whenever there’s something that they can’t fix by themselves, like we show up and there’s a perfect triage report on our desk that shows 50 things that were tried, shows the two issues, shows a nice summary, shows us exactly what we need to do.

I love seeing that moment of, “Oh wow, my junior people are vastly more productive and self-sufficient than they were.”

“Now suddenly my app developers while they don’t have the full credentials, you know, we can’t give them everything, but you know, they’re suddenly much more self-sufficient than they were.”

There’s this kind of like, aha, everybody wins moment, and that, for me is very, very, very satisfying.

Ash Patel: I like that idea of enabling junior SREs to be 10xers, because that shows them that there’s potential for their career progression.

They can start thinking about bigger and more complex things, because they don’t have to worry about memorizing How does this work?

How does that work? And that’s a great feeling to have.

So, it’s a relatively new space that you’re in. Where do you see your category heading in the next two to three years?

Kyle Forster: If you look at monitoring and observability, broadly speaking, like dashboards, products that serve dashboards, or somewhere around 6 billion spent last year on products that all feed metrics or stuff built on top of metrics and into dashboards.

Teams that we interviewed say, “Hey, how often for your senior engineers, at least within Kubernetes, like how often are things entirely resolved by your dashboards?”

And the answer came back 80% of the time.

Related article:  #9 Inside Booking.com’s Site Reliability Engineering Practice

Their senior engineers wind up back on the CLI. For non-Kubernetes environments, I suspect the number’s probably lower because these are mostly older, more mature operational environments.

When there aren’t very many new services, when there aren’t very many new applications, when things aren’t changing very fast, my guess is that there’s a lot more that’s resolved on just the dashboards alone.

But if an organization is doing a lot of engineering work, There is always a lot of stuff that’s not resolvable on the dashboards alone.

And I think we have this kind of funny quirk in our industry right now that like, Okay, six billion dollars spent on dashboards, and anything that’s not there has to be done at the assembly language level.

I think that there’s just a better industry structure here. I think if we can start by making the junior people into 10xers.

I think that we have a huge win.

I think when we can make the app developers much more self-sufficient, then I think what we’ll see is a troubleshooting category that’s kind of roughly the same size as the observability category.

Ash Patel: There isn’t a troubleshooting category right now. I would say it’s kind of a subset of incident management, but you want it to be its own category.

Kyle Forster: I think it’ll become its own category because incident management, I feel like incident management plays a gigantic role, but it’s very much a production role, whereas troubleshooting, a huge amount of the troubleshooting that goes on is going on in dev and test environments in addition to production.

You ideally want to troubleshoot prod the same way you troubleshoot dev and test.

But the number of teams that we talked to were, you know. Hey, in theory, my job is prod, but if I look at my schedule day to day, wow, my schedule day to day, a heck of a lot of time gets spent in our test environment, just helping get people unblocked.

I think that’s where these categories are actually very, very, very separate. And the starting point, at least, that we’ve seen is totally separate.

So I think we’ll wind up integrating incident management more than folding into the category.

Ash Patel: In terms of SRE as a broader field, rather than going into different areas like observability and sort of management and troubleshooting, where do you see SRE heading in the next few years?

Kyle Forster: It’s a good question. I kind of think the profession is a little bit of a crossroads.

Mostly because of the huge downward economy over last year’s push to shift left. And I think that we either find a way that SRE teams makeshift left successful and use that, frankly, as an opportunity to leapfrog.

Go from the basic fingers on keyboard of troubleshooting work to say that’s part of the job, but that’s not like 130% of my day.

Instead like real reliability architecture and reliability optimizations are a big hunk of my day, or we’ll see, hey, organizations that did a lot of shift left and then the SRE team just don’t provide enough value.

They’re just sitting there like unblocking app developers, unblocking app developers, unblocking app developers, and there’ll be some new category that comes in for that kind of high end architecture.

Ash Patel: I feel that SRE really as a crossroads, it was showing a lot of growth potential during the pandemic, 2020, 2021, 22.

It’s lost steam this year. There’s a lot of dissatisfaction, a lot of disengagement from SREs that I’ve spoken with.

I feel that can turn around. It’s just, we need to power through this. It’s probably looking at the role like you said, rather than just sitting in front of a keyboard and just trying to play around and figure out what’s going on.

It’s going to be a lot of using assisted technologies to support you.

But you’re actually doing a lot of think work outside of just typing commands on a keyboard.

You’re actually probably getting up like how UX people do on a whiteboard and actually drawing out what’s happening in the system where do you think we should investigate etc. etc.

And I feel that’s where they could move towards to feel less like reactive people and more proactive when they’re even dealing with incidents.

Kyle Forster: I agree.

Ash Patel: What piece of advice would you give to SREs, ideally related to SRE work?

Kyle Forster: Well, first, look, I’m a vendor here, so I’m going to start by pitching my own book.

I think that there is a certain, and I see it, I see it in every single organization, including my current one.

There’s a certain like, “Oh, thank you. You just saved my problem. You just saved me.”

This effusive thank you that comes from an app developer who you’ve just unblocked. Or from a junior SRE, whom you’ve just unblocked. And there’s this incredible feeling of like, I am the hero and I just did this.

And it’s because of expertise through a bunch of hard-won scars, and it’s an incredibly gratifying feeling.

And every single time somebody feels that way, the person on the other side said, “oh, thank goodness. Ash just got me unblocked, man. His team made a bunch of crappy decisions that got me blocked in the first place.”

We have these incredibly perverse incentives.

I really saw it, frankly working on the Kubernetes platform the sheer number of times when people would, amongst my clients, and I could see it happening.

They’d say not to me but to their lead at Kubernetes say, “Oh, thank goodness you just got me unblocked. Oh, I hate Kubernetes, this stupid system just got me totally blocked in the first place.”

So, I think that as an industry, we do need to sort of wrestle through that because it feels really, really, really good to unblock people, but you create these extraordinarily anti-sentiment for your team and for the technology base that your team has chosen.

And it’s really weird. And I think that figuring out a way of saying like, “Hey, that feels good to get somebody unblocked, but now structurally I need to design myself out of this loop,” is really, really necessary, especially as teams get small and teams just have to stick together.

Otherwise, “Wow, Ash was awesome. Shame none of his colleagues could help me out. There must be a bunch of dummies just sitting around.”

It’s just the reality of being human and it really does do terrible things for teams.

So I think figuring out this deep misalignment of incentives I think it’s gonna be a really important role for very very senior SREs who have both the technical maturity and the organizational emotional maturity just kind of spot this when it’s happening and figure out a way design around the answer pattern

Ash Patel: In summary stick together whatever happens as SREs. We’ve got to stick together

Thank you Kyle for coming in and having a candid conversation about things affecting SRE, and in particular, how your solution can help them improve their working lives.

Kyle Forster: Appreciate you having me on.

Ash Patel
Connect?