Comment 3 for bug 504696

Revision history for this message
Stuart Bishop (stub) wrote :

Instead of querying the _sl.sl_status table, lets query a snapshot of it.

Create a simple script, lagmon.py, that mirrors this table to the ReplicationStatus table, along with the snapshot timestamp. It will do this every 5 seconds.

This script needs to be smart enough to reconnect when the database dies.

This script needs to be started on boot.

The replication_lag() and replication_lag(integer) stored procedures should be altered to query this mirrored table. Lag will be the mirrored lag, plus (CURRENT_TIMESTAMP - snapshot_timestamp).

When querying the replication status gets slow for whatever reason, it just slows down how fast lagmon.py can refresh the snapshot. All the existing systems keep going fine using worst potential lag. If lagmon.py crashes or is killed, everything will just see the lag increase and cope as normal.

We may want our nagios monitoring systems to use the real _sl.sl_status table (at the moment they use the replication_lag() stored procedures like the appservers), and monitor lagmon.py's output separately.