Old journal files not deleted

Bug #1028016 reported by Peter Beaman on 2012-07-23
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Akiban Persistit
Critical
Peter Beaman

Bug Description

At a design partner site the journal extended to 100 files, dating back at least a week. The journal should have been trimmed to no more than 10 files. This is a concise restatement of the initial bug reported as #1027104.

The first remaining journal file is akiban_journal.000000000252
The JournalManagerMXBean looks like this:

Closed = false;
CurrentTimestamp = 1941180015;
BlockSize = 1000000000;
AppendOnly = false;
PageMapSize = 2;
JournalFilePath = /mnt/akiban_data_dir/akiban_journal;
LastValidCheckpointTimestamp = 1941178415;
BaseAddress = 252000012130;
JournalCreatedTime = 1341411912660;
LiveTransactionMapSize = 0;
PageListSize = 2;
CurrentAddress = 352002952537;
CopyingFast = false;
FlushInterval = 100;
CopierInterval = 10000;
RollbackPruningEnabled = true;
Copying = false;
JournaledPageCount = 77;
ReadPageCount = 7;
CopiedPageCount = 94705;
DroppedPageCount = 6;
LastCopierException = null;
LastFlusherException = null;
LastValidCheckpointTimeMillis = 1342807124812;
SlowIoAlertThreshold = 2000;
TotalCompletedCommits = 160;
CommitCompletionWaitTime = 2062;

Upon investigation we found there was a very old version of a page in the page map for file 252, and even though newer versions of that page were written to the journal and copied, the old version remained. Edited journal dumps for files 252, 253 and 254 are attached.

Related branches

Peter Beaman (pbeaman) wrote :

Journal info

description: updated
visibility: private → public
Peter Beaman (pbeaman) on 2012-07-23
Changed in akiban-persistit:
milestone: none → 3.1.3
Peter Beaman (pbeaman) on 2012-07-23
Changed in akiban-persistit:
assignee: nobody → Peter Beaman (pbeaman)
status: Confirmed → In Progress
Peter Beaman (pbeaman) wrote :

Ignore previous journal dump - a more complete version is attached.

Peter Beaman (pbeaman) on 2012-07-23
Changed in akiban-persistit:
status: In Progress → Fix Committed
Peter Beaman (pbeaman) wrote :

Found.

This bug was caused by a configuration change in Akiban Server and a poorly handled condition in the JournalManager.

The configuration change was to remove the akiban_txn.v0 volume from the default server configuration file. That volume is a vestige of the old pre-MVCC Persistit transaction mechanism, is no longer needed, and was rightfully removed.

However, the system in question began its journal before that configuration change, and as a result has a copy of page 0 for a volume by that name in the page map. Of note: the file is still present on the host system, but Akiban Server no longer opens the volume.

The final piece of the bug is that in JournalManager#readForCopy, there is faulty handling of this situation. In fact, there is a TODO comment indicating something different should be done. In general the JournalManager is defensive. Its first priority is not to lose any data, so given that it can't find a way to write the page to the correct volume it simply retains it. And that's what it did while writing over 100 journal files.

To fix this we should do the following:

1. Add ERROR level log message and an Alert triggered when this situation arises.
2. Add a way to tell Persistit that the volume in question will never exist again so that Persistit can simply discard the pages.

In the meantime, to work around the problem in Akiban Server (which will happen only in sites that had a version of server before r1801), we can add a configuration parameter to server.properties that will allow the missing volume to be found and therefore make readForCopy happy. The presence of a file in the Akiban Server data directory named akiban_txn.v0 indicates that this might be necessary. Add the following line to server.properties to enable this:

persistit.volume.99=${datapath}/akiban_txn.v0,create,pageSize:16384,initialSize:16K,extensionSize:16K,maximumSize:64K

Other than cluttering the server.properties file this property has no appreciable cost and does not need to be removed later.

Changed in akiban-persistit:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers