Monitoring in-flight data and the whole database for freshness

Andreas · March 31, 2018, 6:03pm

Proposal

While already being here, we can think about how to improve the monitoring of our systems for future data loss events.

Status quo

Monitoring in-flight data

In the past, we focused on the individual beekeeper and built sensor infrastructure for detecting data loss on individual data acquisition channels, see also:

This currently works by watching individual MQTT bus topics and deducing events from detected anomalies.

Problem

In general, we still lack good general visibility into the data acquisition system at many places. In order to overcome this, we want to start with improving the situation around isolated vs. federated data, see Datenmischwerk.

Improvements

Monitoring the whole database for freshness

Let’s just start with a minimal spike suitable for monitoring InfluxDB databases for freshness of data. It should be capable to work as a sensor for Icinga2.

To minimize complexity for a quick solution, we will send HTTP requests to the Grafana API in order to query the InfluxDB database. By checking the JSON response, we can see whether there are measurements since a defined time in the past:

# Address of Grafana API endpoint
datasource=https://luftdaten.getkotori.org/grafana/api/datasources/proxy/2/query

# InfluxDB query
query='SELECT * FROM earth_43_sensors WHERE time > now() - 5m LIMIT 1'

# Is there any data for the given query?
http $datasource db==luftdaten_testdrive q=="$query" | jq '.results[0].series != null'

# The result
false

This example will yield “false” in data loss conditions, indicating data got stale for longer than five minutes. It uses the fine programs HTTPie and jq.

Andreas · April 1, 2018, 7:39am

We just released the monitoring-check-grafana 0.2.0 sensor probe for detecting data loss conditions. It works as plugin for Icinga2. Have fun!

Andreas · April 1, 2018, 7:32am

Hi again,

this is rather technical, but nevertheless you might be interested.

Following the proposal about monitoring the whole database for freshness to detect data loss or other dropout conditions of feeds into different datasources, we added an appropriate monitoring sensor to our setup, which currently probes a spot sample of some production datasources.

The sensor is able to detect when database tables usually receiving new records regularly eventually become stale by querying the table with appropriate time constraints.

The source code is available at

Enjoy!

Andreas · April 1, 2018, 9:47am

As this is

nearly to-the-glass monitoring as it probes the very
same Grafana API endpoints the frontend uses for fetching
metric data from, just before rendering it to the display,

you are welcome to ping us if you want to enable such an end-to-end probe for your personal data acquisition channels to be notified by email on data loss. Now we are able to configure such a thing in a few minutes.

clemens · April 1, 2018, 3:11pm

Hi Andreas, I got some emails with the topic “Grafana datasource freshness”, so you have configured ths I assume. We have sensors with different update intervalls, some send every hour, some every 10 minuts. Do you have to configure this by hand or is there a automatic logic behind – e.g. first data came about every 1h so we pick up this interval as default?

Andreas · April 1, 2018, 4:01pm

Dear Clemens,

It’s not a self-learning system yet, as far as we can tell ;], sorry.

The Icinga2 configuration object looks like this:

# ========
# Hiveeyes
# ========

# Monitoring sensors for checking a Grafana datasource against data becoming stale
# https://github.com/daq-tools/monitoring-check-grafana

# Feed from Open Hive Teststand
# https://swarm.hiveeyes.org/grafana/d/000000217/open-hive-teststand
object Service "Grafana datasource freshness for Open Hive Teststand" {
  import "generic-service"
  check_command         = "check-grafana-datasource-stale"

  host_name             = "elbanco.hiveeyes.org"
  vars.sla              = "24x7"

  # Configure sensor here
  vars.grafana_uri      = "https://swarm.hiveeyes.org/grafana/api/datasources/proxy/XXX/query"
  vars.grafana_database = "hiveeyes_open_hive_test42"
  vars.grafana_table    = "default_1_sensors"
  vars.grafana_warning  = "1h"
  vars.grafana_critical = "2d"

  vars.notification.mail.users += [ "clemens-gruber" ]
}

You will get the idea.

Actually, the main goal of this sensor machinery was to have at least one stable telemetry client as a reference for end-to-end monitoring of the whole telemetry and backend domain, that’s why we chose your “Teststand” - thanks! So, if the system receives fresh data on this channel and can prove it, we can be pretty much confident the overall system works.

We added you to the list of notification recipients for your convenience, please let us know if you want to opt-out. Otherwise, feel free to send us requests for monitoring other individual data acquisition channels on your behalf.

Cheers!