Replication lag checks can block

Bug #504696 reported by Stuart Bishop
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
Critical
Stuart Bishop

Bug Description

For various perfectly normal reasons, querying the _sl.sl_status view can get slow. This is what is used by replication_lag(). On production, we have seen cases where we have dozens of appservers trying to query this information in its slow state causing timeouts and the load balancers to think they have died.

Preferred approach would be a less intrusive way of querying database lag, which would be great since it would also be faster.

The alternative is to set the statement timeout to something small like 0.5 or 0.25 seconds before doing the lag checks. If the timeout occurs, we can assume we are lagged and should raise a Retry exception to run the request in master only mode.

Related branches

Revision history for this message
Tom Haddon (mthaddon) wrote :

We've now been hit twice by this today, 10:14 UTC, 11:18 UTC.

The following query can help diagnose the problem - if you see a log of replication_lag queries that's a problem:

select current_query from pg_Stat_activity where usename like 'lpnet%';

Changed in launchpad-foundations:
importance: Undecided → Critical
Revision history for this message
Tom Haddon (mthaddon) wrote :

And again 13:44 UTC

Revision history for this message
Stuart Bishop (stub) wrote :

Instead of querying the _sl.sl_status table, lets query a snapshot of it.

Create a simple script, lagmon.py, that mirrors this table to the ReplicationStatus table, along with the snapshot timestamp. It will do this every 5 seconds.

This script needs to be smart enough to reconnect when the database dies.

This script needs to be started on boot.

The replication_lag() and replication_lag(integer) stored procedures should be altered to query this mirrored table. Lag will be the mirrored lag, plus (CURRENT_TIMESTAMP - snapshot_timestamp).

When querying the replication status gets slow for whatever reason, it just slows down how fast lagmon.py can refresh the snapshot. All the existing systems keep going fine using worst potential lag. If lagmon.py crashes or is killed, everything will just see the lag increase and cope as normal.

We may want our nagios monitoring systems to use the real _sl.sl_status table (at the moment they use the replication_lag() stored procedures like the appservers), and monitor lagmon.py's output separately.

Revision history for this message
Stuart Bishop (stub) wrote :

Thought about an alternative & potentially more efficient implementation:

lagmon.py as above, but stuffing the results into memcache rather than a DB table.
Appservers query memcached for lag, falling back to the replication_lag() stored procedures if memcache is down for any reason.

Problem with this approach is the appservers still need to query the database to determine what their slave node number is, which potentially could change between requests with database load balancers and config file reloading brought into the equasion.

Gary Poster (gary)
Changed in launchpad-foundations:
assignee: nobody → Stuart Bishop (stub)
Stuart Bishop (stub)
Changed in launchpad-foundations:
status: New → In Progress
milestone: none → 10.01
Revision history for this message
Diogo Matsubara (matsubara) wrote : Bug fixed by a commit
Changed in launchpad-foundations:
status: In Progress → Fix Committed
Curtis Hovey (sinzui)
Changed in launchpad-foundations:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.