During the last few weeks, we found the system a little greedy on memory consumption, so we had to do some additional hand-holding from time to time. We experienced a similar behavior tonight while having the opportunity to analyze the root cause.
Reason
In certain circumstances we don’t exactly know about yet, the system will trigger a message loop on the MQTT bus. After that, all services consuming messages from there started suffering badly. It looks like as we start receiving more traffic on the acquisition system it gets more likely that this error is triggered.
Root cause
We traced the root cause back to a feature released with Kotori 0.20.0 and enabled on our platform in May 2017. Funny enough, this feature actually is about error handling.
In certain conditions, an invalid message received from the MQTT bus started kicking off the message loop and things obviously spiraled out of control.
Mitigation
We just disabled the “Error signalling over MQTT feature” (code) completely until further notice. We will enable it again when its back from the garage.
Final words
Building message loops is one of the fine arts when running bus or network systems on a shared medium and so we are finally happy we are now part of the family ;]! It is really strange this hasn’t happened before as we are running the system in this configuration for almost a year now. However, we are always happy to catch such edge cases to be able to add more robustness to the system, as we will do with the next software release.
We are sorry for any inconveniences this might have caused for you.
It looks like the troubles we had were also related to the “Instant Dashboard” feature in some way. So it probably was a combination of both the error signalling and the instant dashboard feature which triggered the teardown on some edge case eventually. We are still investigating the circumstances.
Mitigation
As a countermeasure, we also turned off the instant dashboard feature and are already working on a new release of Kotori with improved robustness which is coming soon.
Impact on the platform
We ask newcomers to our system as well as people wanting to submit measurement data on the "testdrive" channels for your patience until we restored the convenience features. The system currently will neither create instant dashboards when receiving data on a new acquisition channel nor report about any errors happening while processing ingress data over MQTT.
Relax
The core data acquisition functionality is not impaired by having disabled these two subsystems, the recently introduced monitoring probes tell us everything is fine - SNAFU.
We identified the root causes of both problems and tried to address them properly. For everyone who dares to have another look under the hood, the related amendments to the code base are: