Improved MOVE_EVENT handling (was: Time warp problem)

Bug #778140 reported by Siegfried Gevatter
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Zeitgeist Framework
Triaged
Medium
Siegfried Gevatter

Bug Description

MOVE EVENTS
============================================

PRESENTATION

By definition, Zeitgeist's events are immutable, and the subject meta-data
they contain is a snapshot of how a given resource was back when the event
happened.

To be useful, some way of linking event subjects to their physical
representation is needed. The primary identifier for doing this is the
subject's URI.

However, URIs, especially local ones, are transient and may change. To solve
this problem, a new field was added to subjects, and it is special in that
it isn't considered to be immutable. This is the `current_uri' field.

INITIAL IDEA

When a subject is inserted, its `current_uri' field is initially set to the
same value as its `uri' field. When Zeitgeist receives a MOVE_EVENT for that
file (with a coherent timestamp), the value of `current_uri' is updated to
its new file name.

The idea here is that this is done in a way that, if we deleted the
`current_uri' of all subjects and restored them looking at all MOVE_EVENTs
in the database, the result would be the same as before.

CURRENT IMPLEMENTATION

As of now, `current_uri' is initially set to the same value as `current_uri'.
Once a MOVE_EVENT is inserted, all events with a timestamp before that of the
move are updated.

However, after the point the MOVE_EVENT has been inserted, it is never
considered again. This is so for performance reasons, since the initial plan
would require pretty much "rebuilding the database".

PROBLEMS

There are numerous problems with this implementation, at least in theoretical
situations.

One problem is that of events coming in after the MOVE_EVENT (maybe because
the application is batching them). In this case they won't be updated.

We also have the opposite problem, a MOVE_EVENT coming in late after another
conflicting MOVE_EVENT happened. For instance, we have the following events:
 > T5 a.txt, T10 a.txt, T15 a.txt
We receive a first MOVE_EVENT from a.txt to b.txt with timestamp T7. Now we
have (time / current_uri):
 > T5 a.txt, T10 b.txt, T15 b.txt
Finally, we receive a further MOVE_EVENT from a.txt to c.txt with timestamp T0.
The result is:
 > T5 c.txt, T10 b.txt, T15 b.txt
This is totally inconsistent; the correct result would have been:
 > T5 c.txt, T10 c.txt, T15 b.txt

Further, even if implemented as described in the "initial idea" section, the
concept is flawed in that it may happen that events are inserted
retrospectively using already their updated URI. This could give rise to
further inconsistencies.

PROPOSAL

No clear way to avoid this problem is evident. Maybe the best idea is to
formalize the current behavior by documenting it and requesting that MOVE
and DELETE events be inserted near real time (for local files).

OUTSTANDING ISSUES

a) Deletion of MOVE_EVENT
What happens upon deletion of a MOVE_EVENT? Should the current_uri changes be reverted?

b) Insertion of other events
When inserting an event, should Zeitgeist check whether a MOVE_EVENT happened for that URI after the event's timestamp, and update it accordingly?

c) Directories
Should the insertion of a MOVE_EVENT with the renaming from "file:///home/user/dir1" to "file:///home/user/dir2" also update all events with uri "file:///home/user/dir1/*" to "file:///home/user/dir2/*"? I think so.

SEE ALSO

Related to this, please also check my proposal for improved DELETE_EVENT handling in bug #954206.

Related branches

Revision history for this message
Siegfried Gevatter (rainct) wrote :

For 0.8.0 I'll be fixing the MOVE_EVENT handling to only rename events with a timestamp older than that of the move event. The real fix could be postponed for later, or considered for 0.8.0 if there's enough time.

Changed in zeitgeist:
assignee: nobody → Siegfried Gevatter (rainct)
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Siegfried Gevatter (rainct) wrote : Re: Time warp problem in MOVE_EVENT handling

Yet another issue is whether we also want to keep in mind MOVE_EVENTs for subjects with lower timestamp but that are inserted after the MOVE_EVENT.

summary: - MOVE_EVENT handling doesn't support unsorted events for the same uri
+ Time warp problem in MOVE_EVENT handling
Revision history for this message
Seif Lotfy (seif) wrote :

We have 2 options here:
1) not inserting a move event before another move event in time
2) Extract the history of the subject and rebuild and change the DB according to the subject_current_uri

Revision history for this message
Seif Lotfy (seif) wrote :

I prefer the first solution :)

Revision history for this message
Seif Lotfy (seif) wrote :

timestamp | uri | current_uri
-----------------------------
00 | A | A
10 | A | A
20 | A | A
30 | B | B
40 | C | C
50 | A | A

move A to T at 35

timestamp | uri | current_uri
-----------------------------
00 | A | T
10 | A | T
20 | A | T
30 | B | B
35* | A | T
40 | C | C
50 | A | A

move T to L at 15

timestamp | uri | current_uri
-----------------------------
00 | A | L
10 | A | L
15* | A | L
20 | A | T
30 | B | B
35* | A | T
40 | C | C
50 | A | A

Problem:
1) we can not trace the origin of 20 since how is the origin uri. A while the current uri is T without having an A -> T or L -> T before 20 occurs
2) the Move event 35* if once is trying to reproduce the event timeline is invalid since A was moved to L already so we need to modify the move event

description: updated
description: updated
description: updated
summary: - Time warp problem in MOVE_EVENT handling
+ Improved MOVE_EVENT handling (was: Time warp problem)
Changed in zeitgeist:
milestone: none → 0.9.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.