Monitorama 2019 Recap

Day 1

John Allspaw kicks off the event with a talk called “Taking Human Performance Seriously” — John started off with a thought exercise around “how long would your system/application continue running if you stopped doing anything. Stopped updating code, responding to alerts, stopped doing anything.” Given the show of hands, only two people believed their apps would last longer than one week. Everyone else was less.

John’s talk was a fantastic way to kick off Monitorama, with topics including automation, safety, risk, and human error, and how our beliefs on these have changed dramatically over the last 50 plus years.

Nora Jones continued next to tie in the previous talk to Chaos Engineering Traps. Nora mentioned that we can use Chaos Engineering as a way to build adaptive decision-making capabilities in engineers. But there are many traps of Chaos Engineering that Nora has seen implemented at various companies — she highlighted 8 of them in her talk. A couple of these traps that resonated with me were:

“You can measure your success with CE by counting the number of vulnerabilities you find.”

Counting errors and vulnerabilities doesn’t really give you any idea about the health of the team or the organization and in many ways, it’s not a valuable OKR. Goodhart’s Law says, “All metrics of scientific evaluation are bound to be abused.” Nora continued by saying the goal is to push on a journey of resilience through the vulnerabilities we find. And looking at what went right in a Chaos experiment helps us understand what we are good at.

“Chaos Engineers should be rule enforcers.”

As Nora said, part of the success in Chaos Engineering is building relationships, and it’s not fair to force everyone to fix every vulnerability that you find. The goal is to build context, not take away control from the engineers and shift their priorities. Fixing every bug/vulnerability is also a trap; every fix has an ROI and you need to focus on the most valuable things first.

A wonderful break from the norm, Dave Cadwallader was welcomed on stage with kids Boden and Zenna as they talked about teaching our observability tools to our kids. On stage, the Cadwallader family not only talked about a project they created to check if their garage door was open with Grafana and Prometheus but also showed off a children’s book idea where they simplify Prometheus concepts.

The rest of the first day talks centered around a wide range of topics, like diving into real-world observability tooling with BPF, diving into Service Level Objects, and a few more high-level topics around modern observability and various tradeoffs thereof. The organizers continue to do a fantastic job creating an inclusive environment with wonderful talks from a diverse range of presenters.

Day 2

Day 2 was kicked off with Aysylu Greenberg from Google talking about “Software Supply Chain Observability with Grafeas and Kritis”, two open source software projects that she works on to help companies manage their build and release pipelines. When it comes to software supply chain management, you are really trying to understand what happens to the code from the source all the way out through production. What is great about these tools is that they aim to help companies manage the complexity of bug and vulnerability management and enforcement of not only their software binaries but also their dockerized applications and Kubernetes deployments.

Luke Demi from Coinbase came on next and made the case that we should be moving away from logs and events and instead treat everything as structured events. It’s a concept that we love to chat about at CHAOSSEARCH. Luke had the crowd just rolling in laughter as he talked about all the pains and challenges of scaling log and event collection. His main pain points were around the fact that he currently needs to run multiple solutions in order to handle both metrics and logs, and ends up with different isolated environments. Luke made the case that we need to be moving towards a world where structured events are captured and saved for later search and analysis. As he said in his talk, “Metrics do not give you context, you need context to investigate the real world.”

Another talk later in the day from Dave Josephsen highlighted the challenge of querying data cost-effectively using Athena. Dave talked about moving a huge amount of log data into Amazon S3, but then in order to get more efficient queries without having to spend huge amounts of money, he then did additional transformations into Parquet files. The downside, of course, is that you need to now additionally transform your data. Presto/Athena lack the ability to leverage any sort of search on the data. That’s where it’s cool to talk with the attendees here about what we are building at CHAOSSEARCH — we bring the power of Elasticsearch APIs and text search to your data on your Amazon S3.

The rest of the day 2 talks were centered around different ways to enable engineers to be able to monitor and manage their applications in production — which many say is a core requirement to effective DevOps. Finally, a series of interesting lightning talks (5 min presentations) closed out a wonderful day 2 of the event.

Day 3

In the early days of the Monitorama conference, the final day of the show was traditionally used for folks to show off new and interesting open source projects in longer “workshop” type sessions. As the years have progressed the sheer number of newly created open source monitoring tools has slowed down dramatically. The organizers still make sure a majority of the talks on the last day are mostly related to open source monitoring tools and demos but keep the talks to a more standard 30 min length.

This year we got a few more talks related to distributed tracing, and the Grafana open source team showed off Loki, which is a logging aggregator that instead of indexing your logs takes the same approach that Prometheus does and simply lets you create tags for grouping and creating streams. The upside to using tags is that you’re no longer indexing your data, but since there is no index, you can no longer search across your logs. It’s a very early project but it will be interesting to see how well it’s adopted in the community.

We also heard a few talks on Day 3 that were not showing off new monitoring tools but talked about the human side of monitoring. The speakers did a great job reminding us that there are still humans and complex organizations behind all of those graphs and alerts. The fact that a wonderfully diverse group of talks can be presented at this single-track conference shows just how special an event this is that the organizers have worked to create. In the end, Monitorama 2019 was yet again an amazing event and continues to be the place to be for anyone interested in monitoring and observability.