Launchpad itself

Replication lag checks can block

Bug #504696 reported by Stuart Bishop on 2010-01-08

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Launchpad itself	Fix Released	Critical	Stuart Bishop	Launchpad itself 10.01

Bug Description

For various perfectly normal reasons, querying the _sl.sl_status view can get slow. This is what is used by replication_lag(). On production, we have seen cases where we have dozens of appservers trying to query this information in its slow state causing timeouts and the load balancers to think they have died.

Preferred approach would be a less intrusive way of querying database lag, which would be great since it would also be faster.

The alternative is to set the statement timeout to something small like 0.5 or 0.25 seconds before doing the lag checks. If the timeout occurs, we can assume we are lagged and should raise a Retry exception to run the request in master only mode.

Tags:

Related branches

lp:~stub/launchpad/replication

Merged into lp:launchpad

Abel Deuring (community): Approve (code) on 2012-09-21

Merged into lp:launchpad/db-devel

Aaron Bentley (community): Approve on 2010-02-08

Revision history for this message

Tom Haddon (mthaddon) wrote on 2010-01-08:

We've now been hit twice by this today, 10:14 UTC, 11:18 UTC.

The following query can help diagnose the problem - if you see a log of replication_lag queries that's a problem:

select current_query from pg_Stat_activity where usename like 'lpnet%';

Changed in launchpad-foundations:
importance:	Undecided → Critical

Revision history for this message

Tom Haddon (mthaddon) wrote on 2010-01-08:

And again 13:44 UTC

Revision history for this message

Stuart Bishop (stub) wrote on 2010-01-09:

Instead of querying the _sl.sl_status table, lets query a snapshot of it.

Create a simple script, lagmon.py, that mirrors this table to the ReplicationStatus table, along with the snapshot timestamp. It will do this every 5 seconds.

This script needs to be smart enough to reconnect when the database dies.

This script needs to be started on boot.

The replication_lag() and replication_lag(integer) stored procedures should be altered to query this mirrored table. Lag will be the mirrored lag, plus (CURRENT_TIMESTAMP - snapshot_timestamp).

When querying the replication status gets slow for whatever reason, it just slows down how fast lagmon.py can refresh the snapshot. All the existing systems keep going fine using worst potential lag. If lagmon.py crashes or is killed, everything will just see the lag increase and cope as normal.

We may want our nagios monitoring systems to use the real _sl.sl_status table (at the moment they use the replication_lag() stored procedures like the appservers), and monitor lagmon.py's output separately.

Revision history for this message

Stuart Bishop (stub) wrote on 2010-01-09:

Thought about an alternative & potentially more efficient implementation:

lagmon.py as above, but stuffing the results into memcache rather than a DB table.
Appservers query memcached for lag, falling back to the replication_lag() stored procedures if memcache is down for any reason.

Problem with this approach is the appservers still need to query the database to determine what their slave node number is, which potentially could change between requests with database load balancers and config file reloading brought into the equasion.

Gary Poster (gary) on 2010-01-14

Changed in launchpad-foundations:
assignee:	nobody → Stuart Bishop (stub)

Stuart Bishop (stub) on 2010-01-15

Changed in launchpad-foundations:
status:	New → In Progress
milestone:	none → 10.01

Revision history for this message

Diogo Matsubara (matsubara) wrote on 2010-01-19: Bug fixed by a commit

Fixed in db r8891 <http://bazaar.launchpad.net/~launchpad-pqm/launchpad/db-devel/revision/8891>

Changed in launchpad-foundations:
status:	In Progress → Fix Committed

Curtis Hovey (sinzui) on 2010-01-28

Changed in launchpad-foundations:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.