V1 hashing protocol is go!

Written for FileStore by Jason Gerard DeRose on 2013-06-01

So big news! The Version 1 Dmedia Hashing Protocol has been finalized, and is
now used by default.

The only downside is that because of the switch to our Dbase32 encoding, it
wasn't possible to support V0 alongside V1. The first time you load a FileStore
containing a V0 files/ layout, it will move this directory to files0/,
preserving your data.

But to actually use these files, you must run this upgrade script from a
terminal:

    novacut-v0-v1-upgrade

Or optionally, if you only have Dmedia installed, run this (which the above
calls before upgrading the Novacut databases):

    dmedia-v0-v0-upgrade

For more help, please watch this screencast that walks you through the upgrade
process:

    http://www.youtube.com/watch?v=NOyEIab0E-U

There will certainly be additional protocols added in the future, but from V1
forward, all protocols will be supported indefinitely. Newly added files will
be hashed according to whatever the newest protocol version is at the time, but
you will always be able to resolve and verify a file according to an ID computed
with an older protocol version.

This is important so that we can always preserve the link integrity between,
say, a Novacut edit and the files that edit references.

V1 is a big milestone not just because it finalizes the details of V1, but also
because it finalizes our protocol framework, the design aspects we expect to be
the same across all protocol versions from V1 onward.

Each new protocol version will use a different digest size, and we'll use the
ID length to determine what protocol version was used to compute an ID. So for
example, V1 uses a 240-bit digest (48 characters when Dbase32 encoded), and V2
might use a 280-bit digest (56 characters when Dbase32 encoded).

Things that will (or can) change between protocol versions:

    * By definition, the digest size will change

    * The underlying hash function likely will change

    * The leaf size might change

And things we don't expect to change:

    * The digest size will always be a multiple of 40 bits

    * The root-hash and leaf-hashes will use the same digest size

    * The leaf size will be a multiple of 2 MiB

    * The root-hash will be cryptographically tied to the file-size

    * The leaf-hashes will be cryptographically tied to the leaf-index

    * Dbase32 will always be the "official" digest encoding, and will be the
      encoding use by standard FileStore layouts

We want to start research on V2 soon, but we likewise want to be very leisurely
about finalizing V2 (which is why we should start research soon). I hope it
will be at least 5 years till we add another protocol.

If anyone is interested in such research, the Protocol base class makes it quite
easy to build experimental protocols:

    http://bazaar.launchpad.net/~dmedia/filestore/trunk/view/head:/filestore/protocols.py

Also, folks might be interested in reading the V1 protocol specification:

    http://docs.novacut.com/filestore/specification.html

(Thanks again to Robert von Burg for working on an experimental Java
implementation, which provided lots of great feedback that helped make the
specification clearer and more correct.)

And one last note, just because people tend to ask a lot: I'll explain why we
stuck with Skein for V1, rather than using the SHA-3 winner (Keccak).

The biggest reason we stuck with Skein is performance, specifically performance
of 64-bit software implementations. For our current 240-bit digest size, Skein
is roughly twice as fast as Keccak (on my hardware anyway). And that is
significant, because even Skein can't quite keep up with the read throughput of
todays fastest SSDs when hashing on a single core.

Although Keccak can be faster than Skein when implemented in hardware, we have
to make decisions based on the here and now. Today, no such hardware
implementations are readily available, and it might be many years before they
are.

Another issue is the Skein parameter system made our cryptographic tying very
easy, but HMAC (or equivalent) hasn't been defined yet for Keccak, and defining
our own would be very risky. Yet lots of practical experience working through
different protocol iterations, and building useful software on top, has made me
more convinced than ever that this cryptographic tying is an extremely important
and pragmatic feature for our use case.

And the last thing is I've embraced the idea that we will have additional
protocol versions sooner rather than later. We will support V1 forever, not use
it by default forever. I think it's important that we architect the `filestore`
package to support multiple protocol versions as soon as possible, long before
we actually have a second protocol. This is inevitable, so let's start
practicing now.

We really needed to commit to a stable protocol version *now*, and I think at
this moment Skein is the best choice. Plus, if we switched to a different hash
algorithm now, it would still take time to understand it as well as we
understand Skein, to be confident about the way in which we are using it. That
could easily mean another year till we have a stable protocol.

Anyway, thanks to the many people who have patiently reviewed the many protocol
iterations leading up to V1, especially David Jordan, Hagen Fürstenau, and
Robert von Burg.

-Jason

Updated .

Read all announcements