Navigated to Today's Deep-Dive: StackStorm - Transcript
Deep Dive

ยทE331

Today's Deep-Dive: StackStorm

Episode Transcript

If you really look at the modern digital world, it's just this massive web of event chains, right?

Totally.

A customer clicks something over here, and that triggers a process way over there, which then, you know, alerts some monitoring system.

Yeah, and for anyone in DevOps or for the SREs listening, managing all that, it can feel impossible.

Exactly.

You're just trying to stitch it all together and make sure the right thing happens every single time.

And you're definitely looking for a shortcut through all that complexity.

And that is, well, that's exactly what we're diving into today.

Our mission here is to really unpack StackStorm.

The famous IFTTT for Ops.

That's the one.

We're going to get past the jargon and explain this really powerful, event-driven automation platform so that, you know, beginners can get it.

Not just what it is, but how it actually fundamentally changes IT operations.

We're basing this on its GitHub docs and the core features overview.

And before we jump in, we have to thank the people who make these deep dives possible.

Of course.

This deep dive is supported by SafeServer.

SafeServer assists with digital transformation, helping you move to the future of IT infrastructure.

They also provide, well, amazing hosting for software like the platform we're discussing today.

So if you're serious about modernizing your stack, find out more at www.safeserver.de.

Again, that's www.safeserver.de.

Okay, let's start with that core idea.

IFTTT for ops.

We all know if this, then that for like our personal apps.

Right.

Thinking your smart lights to your email or whatever.

Yeah.

But applying that same simple logic to massive enterprise level infrastructure, that seems like a total game changer.

It is.

And it's mostly because of the scale and the criticality of the work.

StackStorm is fundamentally a platform that's built to integrate all your different services and tools.

All of them.

All of them.

Your monitoring systems, your ticketing platforms, cloud providers, deployment tools, you name it.

And its whole focus is on event driven automation for the really tough tasks.

Think auto remediation, really sophisticated incident response, or these complex multi-stage deployments.

It basically brings structure to the chaos.

And the scale.

I mean, the sources mentioned them huge numbers, something like 160 integration packs.

With over 6,000 actions available on what they call the stack storm exchange.

So no, this is not just about running a few simple bash scripts.

And it has a full rules engine, workflow capabilities, and it even supports chat ops, right?

So you can manage everything from Slack.

Exactly.

But what's really crucial here, and this is a thing I think you, the learner, should really internalize, is the philosophical shift.

Okay.

All the contents, the rules, and the workflows that dictate all the automation logic, it's all stored as code.

Wait, let me stop you there.

If the operational logic, the if this, then that, is stored as code, how does that actually help an SRE?

I mean, isn't that just adding more complexity?

It's actually the opposite.

By treating your automation logic just like you treat your application code, you immediately get all the benefits of the modern DevOps lifecycle.

Ah, so you're talking about version control with Git, code reviews?

Code reviews by peers, testing environments, auditability, clear governance, all of it.

You're not relying on that one engineer who knows the magic fix at 3 a.m.

anymore.

You're relying on a process that's been written down, reviewed, and versioned?

Precisely.

It just elevates your operations to the same standard of reliability as your actual code base.

That makes so much sense.

Moving from like hero knowledge to a codified process, that's a huge unlock.

Okay, so let's move from the philosophy to the real world.

Give us some concrete examples of how this works.

Absolutely.

Let's start with something that happens way too often.

The dreaded alert.

The 2 a.m.

page.

The 2 a.m.

page.

So our first pattern is facilitated troubleshooting.

So picture this.

A monitoring tool, let's say it's Sensu or New Relic, it captures a system failure.

Normally, a human gets that alert, logs into four different systems, runs diagnostics.

It's the swivel chair integration.

The swivel chair, exactly.

With StackStorm, that alert is the trigger.

So instead of a human logging in, StackStorm acts instantly.

It becomes a kind of digital first responder.

And what's its first move?

It just immediately runs a whole series of diagnostic checks.

It pings the physical node, it checks the state of AWS or OpenStack instances, verifies app components, pulls log snippets.

And then what?

It puts all of those results, all correlated and with context, directly into a shared space like Slack or a JRA ticket.

So the human engineer steps in already fully informed.

It saves critical minutes.

I can see that.

That turns the 2 a.m.

panic from an investigation into just a quick verification.

But what about actually solving the problem?

That brings us to our second pattern, automated remediation.

And this is where StackStorm's multi-step workflows really shine.

Okay.

Let's take an OpenStack compute node failure.

The goal here is graceful failure handling.

The hardware failure kicks off a really complex workflow.

It doesn't just reboot the machine.

What does it do?

First, the workflow verifies the failure.

Then, if it's confirmed, it properly evacuates all the running virtual machines onto healthy nodes.

And it tells people.

Yep.

Automatically emails the VM owner about potential downtime.

But here's the really smart part.

The failsafe.

Right.

Because it can't just keep going blindly if something goes wrong.

Exactly.

These workflows have explicit conditions.

So if the evacuation step times out, or if it detects it was only partially successful, the workflow just freezes.

It saves its state and calls PagerDuty to alert a human engineer with all the data it's collected so far.

So it knows its own limits.

It knows exactly when a human needs to step in.

That balance is fundamental for building trust in the automation.

That is robust.

Okay.

Last one.

How does this apply to something as high stakes as CICD, continuous deployment?

Well, CICD with StackStorm goes way beyond what a tool like Jenkins can do on its own.

Jenkins can handle the build and test phase.

StackStorm takes over for the whole orchestration.

The automation can provision a new AWS cluster, deploy the code, and then it starts to carefully shift traffic over using the load balancer.

And it's watching what happens.

Constantly.

It's pulling real-time performance data from a tool like New Relic.

Based on the metrics you define, latency, error rates, it intelligently decides whether to fully roll forth the new deployment or to instantly trigger a rollback.

It manages the full life cycle.

Wow, the value there is just so clear when you put it like that.

You're getting consistency, incredible speed, and you're freeing up your best people from doing these repetitive stressful tasks.

You shift from doing the work to writing smarter operational code.

Okay, so now that we know what it does, let's get onto the hood.

You said it was a modular architecture.

Can you break down those core components for us, the building blocks?

For sure.

It's all built on loosely coupled microservices that talk over a message bus, so things can fail or scale independently.

So to follow the flow, think of it in terms of sensory input, brains, and then actions.

First up, you have sensors and triggers.

The eyes and ears.

Exactly, the eyes and ears of the platform.

Sensors are just Python plugins that are always watching external systems.

When an event happens, a host goes down, a GitHub repo changes, the sensor fires off a stack storm trigger, and the trigger is just the platform's internal version of that event.

So the trigger is the if this, then you need hands for the then that.

The hands are the actions.

These are the outbound integrations, what StackStorm actually does.

They can be super simple, like an SSH command or really complex, like an integrated call to Docker or Puppet.

And anything can be an action.

Basically, yeah.

Any script or command line tool can become a first-class action just by adding a little bit of metadata.

And what decides which action runs for which trigger?

That's the rules engine.

That's the brain.

The rules are that coded link.

They map a specific trigger to a specific action.

The rule can apply matching criteria, like only run if CPU usage is over 90 percent.

That maps the data from the trigger to the inputs the action needs.

Okay, but what if I need to run five different actions in a row?

Do I need five different rules?

No.

That is where workflows come in.

They're the assembly line.

The uber actions.

The uber actions, exactly.

Workflows are how you sketch multiple actions together.

They define the order, the complex transition conditions we talked about, and they make sure the output from step one becomes the input for step two.

Got it.

And finally you have PACs.

Correct.

PACs are just the shareable kits.

They group everything together.

Sensors, actions, rules, workflows into one simple unit.

And that's why the StackStorm Exchange exists, so people can share these operational patterns.

And you can access it all through an API, a command line, or a UI.

Full REST API, a really powerful CLI, and a web UI.

Yep.

That sounds technically impressive, but who's actually using this?

I mean, real-world adoption is the true test, right?

Oh, absolutely.

And this isn't just theory.

It's very established.

It's Apache 2.0, licensed, actively developed, and used by companies with just massive infrastructure needs.

Like who?

Well, MedFlex, for one.

They use StackStorm to build their own internal platform they call Winston.

Winston.

Yeah.

And it's dedicated specifically to event-driven diagnostics and auto remediation.

They just needed that reliable, super-fast response time in their cloud environment.

StackStorm gave them the framework.

Wow, okay.

That's a huge name.

What about in the security space?

Target is a big user there.

They realized that the real power was in its flexibility.

So they used the existing integrations to get up and running fast, which freed up their teams to focus entirely on building custom security features and automating their unique compliance checks.

So they used it to accelerate their own security development.

Exactly.

And then you have someone like Pearson, who uses it for internal efficiency.

They basically took all these small, specific operational tasks and turned them into individual actions.

Then they could easily orchestrate them into bigger, reliable macro tasks that they share across the whole organization.

It really does keep coming back to consistency and speed.

And the fact that it's mostly Python, I think the source said 94%, that makes it so much more accessible for engineers today.

Right.

They can write automation logic instead of clicking through some opaque vendor tool.

It's all about making chat ops better, automating security response, and letting teams focus on innovation, not just running the same old playbooks.

So to kind of sum it all up for you, the learner, StackStorm lets you transform those ad hoc operational patterns and that tribal knowledge into defined, code-based, event-driven processes.

And because everything is code and it all goes over that message bus, the entire system is fully audited.

Every single action is recorded.

Every action, manual or automated, is recorded and stored.

You can send it all to Splunk or LogStash or whatever you use.

That audit trail is huge.

And it actually brings us to our final thought for you, the listener, to kind of chew on.

Given that all this automation logic, the workflows, the rules, is stored and managed as code with full version control, how does that change the very definition of a bug in the future?

That's a great question.

Right.

Is a failure in the system a traditional coding error, like a syntax mistake?

Or is it actually a flaw in the operational logic that you wrote into the automation?

It feels like it requires a totally different kind of pure review, something to think about.

Absolutely.

And thank you once again to our sponsor, Safe Server, for supporting this deep dive and for all their work in digital transformation.

They provide incredible hosting for platforms just like StackStorm.

You can find out more about their services and how they can help your team at www.safeserver.de.

That's www.safeserver.de.

Never lose your place, on any device

Create a free account to sync, back up, and get personal recommendations.