Post-mortem about system outage on 2018-04-03

Andreas · April 2, 2018, 10:56pm

Problem

During the last few weeks, we found the system a little greedy on memory consumption, so we had to do some additional hand-holding from time to time. We experienced a similar behavior tonight while having the opportunity to analyze the root cause.

Reason

In certain circumstances we don’t exactly know about yet, the system will trigger a message loop on the MQTT bus. After that, all services consuming messages from there started suffering badly. It looks like as we start receiving more traffic on the acquisition system it gets more likely that this error is triggered.

Root cause

We traced the root cause back to a feature released with Kotori 0.20.0 and enabled on our platform in May 2017. Funny enough, this feature actually is about error handling.

In certain conditions, an invalid message received from the MQTT bus started kicking off the message loop and things obviously spiraled out of control.

Mitigation

We just disabled the “Error signalling over MQTT feature” (code) completely until further notice. We will enable it again when its back from the garage.

Final words

Building message loops is one of the fine arts when running bus or network systems on a shared medium and so we are finally happy we are now part of the family ;]! It is really strange this hasn’t happened before as we are running the system in this configuration for almost a year now. However, we are always happy to catch such edge cases to be able to add more robustness to the system, as we will do with the next software release.

We are sorry for any inconveniences this might have caused for you.

With kind regards,
the people of Hiveeyes.

Andreas · April 4, 2018, 2:09pm

A minor update on this

It looks like the troubles we had were also related to the “Instant Dashboard” feature in some way. So it probably was a combination of both the error signalling and the instant dashboard feature which triggered the teardown on some edge case eventually. We are still investigating the circumstances.

Mitigation

As a countermeasure, we also turned off the instant dashboard feature and are already working on a new release of Kotori with improved robustness which is coming soon.

Impact on the platform

We ask newcomers to our system as well as people wanting to submit measurement data on the "testdrive" channels for your patience until we restored the convenience features. The system currently will neither create instant dashboards when receiving data on a new acquisition channel nor report about any errors happening while processing ingress data over MQTT.

Relax

The core data acquisition functionality is not impaired by having disabled these two subsystems, the recently introduced monitoring probes tell us everything is fine - SNAFU.

Have fun!

Andreas · April 8, 2018, 10:23pm

News

We identified the root causes of both problems and tried to address them properly. For everyone who dares to have another look under the hood, the related amendments to the code base are:

MQTT error signalling robustness

https://github.com/hiveeyes/kotori/commit/c6798fb8

Grafana 5 compatibility for “Instant Dashboard” feature

They will be shipped with the next release to the platform, probably later this night.

Andreas · April 11, 2018, 10:03pm

We have shipped the updates required to mitigate the problems from this hiccup, you might want to follow up at Kotori release 0.21.1.

Have fun.