[ntar-workers] On Markers and Bookmarks

Sat Jul 2 02:21:53 GMT 2005

Ciao Gianluca,

On Thu, 2005-06-30 at 22:48 -0700, Gianluca Varenni wrote:
>
> In general, I like the idea of having some indexing records to the file. And 
> maybe they could be put at the very end of a section, in order to avoid 
> recomputing the offsets).

cool! That was exactly my plan with the Section End Block.

> My question is: what is the best/most useful 
> indexing scheme we should use? I think it heavily depends on the application 
> processing the trace file. Maybe the best approach is to put some very 
> simple markers (as several folks here on the mailing list proposed) at 
> *reasonable* places (optional, of course), leaving the application to build 
> its own tree/db/whatever data structure to fast access the data.

So what's the verdict on the PacketsBlock/GoPBlock idea? Given that, I
think it wouldn't be that hard to do the following:

- Fix a baseline granularity for indexing, say a GoPBlock size on the
order of a few megabytes. We build and index on the timestamps of the
first packet in GoPBlocks, yielding the start offsets of those
GoPBlocks.
- The contents of an index block would encode a search tree structure.
For small sections, this'll be equivalent to a list, and as the traces
grow bigger, represent more and more complex trees -- think B-tree. Each
node contains a number of slots (encoding of their number, or whether it
can be fixed etc, to be discussed etc), each of which is labelled by a
timestamp and either pointing to another node or containing a file
offset.

Internal node:

   0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                 	      Tree node ID                       |0|
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                          Node pointer                       |x|
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                        Timestamp (High)                       |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                        Timestamp (Low)                        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                                 ...                               
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                          Node pointer                       |x|
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                        Timestamp (High)                       |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                        Timestamp (Low)                        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                          Node pointer                         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Leaf node:

   0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                 	      Tree node ID                       |1|
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                       File Offset (High)                      |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                       File Offset (Low)                       |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

(MSB of first word = node type, internal (0) or leaf (1). x = unused)

To navigate, you read the index block into memory, building up the tree
structure, and then walk the tree to find the GoPBlock closest to what
you need by comparing the timestamp you're looking for to the ones
stored inside the node slots. Inside that resulting GoPBlock, you search
sequentially.

The other way round, you could build up the tree structure in memory as
you pump out the packets and GoPBlocks, and when you stop the capture,
you serialize it out as an index block inside the SEB.

A 1TB file would have roughly 200K 5MB GoPBlocks. Even with generous
node sizes the entire tree structure should fit within a few megabytes.
(Thinking *very* back-of-envelope -- don't yell at me :)

I'm pretty shafted at the moment but if you like the idea I could likely
offer around a day or so per week in July to hack this up.

Cheers,
Christian.
-- 
________________________________________________________________________
                                          http://www.cl.cam.ac.uk/~cpk25
                                                    http://www.whoop.org