changelog shortlog tags branches files raw gz bz2 help

Mercurial > hg > ventivac / changeset: first version of description of ventisrv fileformat (and rationale for design decisions). will re-read tomorrow and add some make-up.

changeset 123: d1f69ce075ec
parent 122: 84abb5444a76
child 124: cc6bd0bff47f
author: Mechiel Lukkien <mechiel@ueber.net>
date: Sun, 19 Aug 2007 23:19:45 +0200
files: doc/mkfile doc/ventisrv-fileformat.ms
description: first version of description of ventisrv fileformat (and rationale for design decisions). will re-read tomorrow and add some make-up.
     1.1--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
     1.2+++ b/doc/mkfile	Sun Aug 19 23:19:45 2007 +0200
     1.3@@ -0,0 +1,23 @@
     1.4+#<fonts.pal
     1.5+NPROC = 1
     1.6+FILES = \
     1.7+	ventisrv-fileformat.ps\
     1.8+
     1.9+PRE=$FONTS'.ps 9
    1.10+.nr PS 9
    1.11+.vs 11
    1.12+.nr VS 11
    1.13+.nr dP 1
    1.14+.nr dV 1p
    1.15+.nr dT 4
    1.16+.nr XT 4
    1.17+'
    1.18+
    1.19+all:V: $FILES
    1.20+
    1.21+%.ps:D:	%.ms
    1.22+	cat $stem.ms | tbl | troff -ms | lp -dstdout > $target
    1.23+	#cat $stem.ms | tbl | groff -Tps -ms > test.ps
    1.24+
    1.25+%.pdf: %.ps
    1.26+	ps2pdf <$stem.ps >$stem.pdf
     2.1--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
     2.2+++ b/doc/ventisrv-fileformat.ms	Sun Aug 19 23:19:45 2007 +0200
     2.3@@ -0,0 +1,67 @@
     2.4+.TL
     2.5+Ventisrv file formats
     2.6+.AU
     2.7+Mechiel Lukkien
     2.8+mechiel@xs4all.nl
     2.9+.AI
    2.10+Google Summer of Code, for the Plan 9/Inferno project
    2.11+.br
    2.12+August 2007
    2.13+.AB
    2.14+This paper describes the file format used by ventisrv for its data and index file.
    2.15+It also provides some rationalisation for design decisions.
    2.16+.AE
    2.17+.SH
    2.18+Introduction
    2.19+.PP
    2.20+At startup, ventisrv reads the index file sequentially to read a part of the score of each stored block into memory.  The data file keeps the actual data and all relevant meta-data, enough to reconstruct an index file (though to reconstruct an index file, the data file has to be read entirely).  The data file is a concatenation of a data header and the data contents, compression changes this slightly.  The index file is a concatenation of headers referencing a block in the data file (and in the same order).  The file format is described first for the simple case, without compression, and for this case first the format of the data file, and next the format for the index file.  Then the adjustments made for to the file format to support compression are mentioned.
    2.21+.SH
    2.22+Format of data file
    2.23+.PP
    2.24+First, an empty file (zero length) data file, is simply a data file without any block stored in it.  A block is stored by writing a header, called Dhdr, to the data file, followed by the data itself.  The header is 41 bytes long: a 4 byte magic, 20 byte score, 1 byte data type, 2 byte size, 4 bytes `connection time'.  The fixed magic value is 0x2f9d81e5.  The size field indicates the number of bytes following the header.  Even though 2 bytes can address up to 64 kilobytes, only values up to 56 kilobytes are valid as a consequence of the venti protocol which does not allow larger blocks.  Note that the `zero score', the score belonging to the empty data block, is never stored on disk, it is handled internally by ventisrv, though such a block is valid in the file format.  The score in the header is of course the score of the data following the header.  Ventisrv checks whether the score in the header matches the score it calculates from the data, to detect e.g. disk failures.  The `connection time' is the time (in seconds since UNIX epoch) at which the connection was started.  It is useful to relate blocks to an accidental or malicious batch of writes.
    2.25+.PP
    2.26+This is the definition of a Dhdr in Limbo (with functions removed), along with the `magic':
    2.27+.P1
    2.28+Dhdrmagic:      con big 16r2f9d81e5;
    2.29+
    2.30+Dhdr: adt {
    2.31+        score:  Score;		# 20 bytes
    2.32+        dtype:  int;		# 1 byte
    2.33+        size:   int;		# 2 bytes
    2.34+        conntime:       big;	# 4 bytes
    2.35+};
    2.36+.P2
    2.37+.SH
    2.38+Format of the index file
    2.39+.PP
    2.40+For each block (header and data) written to the data file, a header is written to the index file.  Headers in the index and data file are always in the same order.  A header in the index file, called Ihdr, is 15 bytes in size: the first 8 bytes of the score (which it 20 bytes), a 1 byte type and a 6 byte offset into the data file.
    2.41+.PP
    2.42+Only 8 bytes from the score are stored because storing more is not useful (if they were needed, more main memory would be needed to run the ventisrv than fits into current and future computers).  Also, the index file has to be read into memory at ventisrv startup, so it is best to keep it as small as possible.  Even 8 bytes are more than needed for almost all ventisrv installations.  Note that index headers to not contain a `magic' number and do not have data following them (so perhaps the name `Ihdr' is misleading).
    2.43+.PP
    2.44+Only 6 bytes are used for storing the offset into the data file.  More address space will never be needed because main memory will run out first when storing such large amounts of data.
    2.45+.PP
    2.46+Below the definition of an Ihdr in Limbo (with functions removed) is given.  The field `compressed' is only used for compression and can be ignored for now.
    2.47+.P1
    2.48+Ihdr: adt {
    2.49+        halfscore:      array of byte;	# 8 bytes
    2.50+        dtype:  int;			# 1 byte
    2.51+        offset: big;			# 6 bytes
    2.52+        compressed:     int;
    2.53+};
    2.54+.P2
    2.55+.SH
    2.56+Format adjustments to support compression
    2.57+.PP
    2.58+After the basic ventisrv functionality was implemented, support for compressing blocks of data was added.  The most straight-forward solution would be to add a bit to the index header to indicate whether the block is compressed; and add a similar bit to the header in the data file, along with the size of the compressed payload (i.e. data actually on disk, which will be decompressed in the actual data).  The actual implementation is a bit different.  A new header can now occur in the data file, the Fhdr (F for flate, the compression algorithm used, implemented by Inferno's filter-deflate(2) module.  An Fhdr is of variable size, it can contain information about one or more data blocks.  This is necessary because the compressed payload following the header contains data for multiple blocks.  The only reason for compressing multiple blocks into a single `compressed payload' is that the compression ratio will be higher, because the search history for the compression algorithm will be larger (and it does not have to build up such a history for each block to compress).
    2.59+.PP
    2.60+The fixed-length part of an Fhdr is 7 bytes long:  a 4 byte magic, a 1 byte count for the number of blocks stored in the compressed payload, and the size of the compressed payload.  The fixed value of the magic is 0x78c66a15.  The maximum number of compressed blocks in a single Fhdr is 256.  The size of the compressed payload is currently kept <= 56 kilobytes, though they can be up to 64 kilobytes.  The maximum size cannot be much larger because the entire compressed payload up to the needed block has to be decompressed when a single block is needed.
    2.61+.PP
    2.62+The variable-length part of the header follows immediately after the fixed-length part.  This variable part is made up of a header for each block stored in the compressed payload.  Each such header looks much like a Dhdr, it is 27 bytes in size:  a 20 byte score, 1 byte date type, 2 byte size and 4 byte connection time.  The size is the size of the uncompressed data.  To illustrate, consider an Fhdr that represents two blocks.  On disk, it will start off with 7 bytes of fixed-size header.  The `count' will be set to 2.  This header is immediately followed by 27 bytes for the first block and another 27 bytes for the second block.  After this a compressed payload follows of the `size' specified in the fixed-length part of the header.  Note that the variable length part of the header is not compressed.  This allows for determining whether a score is present by only reading the header.  Compressing the 27 bytes would not be of much use anyway, since 20 bytes of out 27 are the score, which is a random value to the compression algorithm.
    2.63+.PP
    2.64+The index header changes only slightly:  the most significant bit of the `data file offset' now indicates whether the header in the data file it points to is a Dhdr (when the bit is not set) or an Fhdr (when it is set).  Headers in the index file are still in the same order of appearance as the blocks in the data file.  Note that each stored score is given a header in the index file.  This includes possible multiple scores in a single Fhdr in the data file:  they each get an Ihdr, with the data file offset pointing to the same location.
    2.65+.SH
    2.66+Conclusions
    2.67+.PP
    2.68+Support for compression makes the file format less simple, but not significantly so.  Improvements could be made in the area of compression.  For example, another compression algorithm can be used, one that does not need to build-up compression history as much, or has some predefined histories to choose from.  Also, since compression is relatively slow, a faster compression algorithm would be welcome.  The header format would not necessarily have to change to accommodate for this.
    2.69+.PP
    2.70+The index and data files contain enough information to cross-check the validity of the data blocks.  Ventisrv performs these checks on the most recently written blocks in these files.  The data file is always written before the index header is written, though not flushed explicitly, so the index file may be flushed in the background by a file system scheduler.  In any case, missing index entries are automatically added by ventisrv at startup, missing data file blocks are a fatal error.  The only remaining problem is the question what to do with a permanently damaged and non-recoverable (e.g. from backup) data block.  Ideally, it should be possible to mark a data block as invalid, at least in the data file.