System outage on October 3, 2018


#1

Problem

Our host machine “cirrus” experienced a reboot on Wed Oct 3 11:55:55 2018 CEST.

Mitigation

We started the DAQ machines “elbanco” and “eltiempo” again around 12:56 CEST.

Post-mortem

  • The reasons or circumstances about the reboot are not known yet. There are no traces in the logfiles or whatsoever. We believe it was due to a sudden power loss.
  • The sensor probe notifications reported from the monitoring system haven’t been received in time as my aNag application had freezed, which occurred just today for the very first time. Bummer!
  • Some data channels lacked recent data after the system had recovered. As we know from educated guessing that such things might be related to Imkerliche Daten nicht im Grafana sichtbar, we didn’t hesitate to Repair InfluxDB TSI index files, which instantly solved the issue.
  • We are still busy restoring some auxilliary services which are not reboot-safe yet and will improve this gradually by adding appropriate sysvinit/systemd wrappers, a task which is long overdue.

All in all, a real Murphy. Sorry for any inconveniences this might have caused for you. Have fun and enjoy some good music.