Saturday, 22 November 2014

Breaking Down The File

The Record Format

At the top level the IMQ files have a sequence of records, arranged in a basic "length/data" format.

So the file content is a series of data fields formatted as:

[L0] [L1][D0][D1]...[Dn]

Where L0 & L1 combine to give you a 16 bit length, and then D0..Dn are the target data.

Using some off the shelf Qt functions we can simply break out the entire file content  into a standard Qt byte holding structure, and then break out each record individually into a hash map.

Loading The File

Initially we can load the entire file into memory with a fragment like:
QFile fin;
QByteArray ba;

  if (!
    return false;
  ba = fin.readAll();

The QByteArray structure gives us an array of raw bytes with simple accessors and modifier hooks. So following this the "ba" contains all the file data. Since the files are relatively small (a couple of hundred K) this isn't a big deal.

Breaking Out The Fields

We can then break this down into into a set of independent fields, using QList to a manage a list of QByteArray objects, each of which holds a basic record from the input file, i.e.

  QList<QByteArray> _data;
  QByteArray ba;
  QByteArray bn;
  int rlength, length;

    bn = ba.left(2);

    length = (
              ((unsigned char) << 8) |
              ((unsigned char)
    bn = ba.left(length);

    rlength = length;
    if ((rlength %2) != 0) {

In this piece of logic we:
  • Grab two bytes to get a length
  • Remove the length bytes from the start of the QByteArray
  • Get the given length  of data as a new QByteArray
  • Append the new QByteArray to the end of the QList
  • Remove the bytes from the main QByteArray, If the field length was odd, remove an extra byte
The extra byte is a side effect of the file format, which specifies that every record must contain an even number of bytes.
We should probably be more careful about the endianity of the 16 bit length value construction based on the host, but since this doesn't impact on our construction code this is left as "an exercise for the reader".

At this point we have a set of records, which break down into five distinct regions and for the Voyager images all follow the same layout:
  1. The image label - all the parameters that are associated with this image, such as the instrument used to capture, time of capture, etc as well as data are pointers (more on this later). This is a variable number of records, of which the last simply contains the string "END".
  2. The image histogram - Always two records, which combine to make a table of 256x32 bit integers, indicating the histogram of image elements.
  3. The Encoding histogram - Always three records, which combine to make a table of 511x32 bit integers. This is a set of offset/frequency values which are used to generate the binary tree for Huffman decompression (more on this later)
  4. Engineering table - Always one record, which contains "other" engineering data. For now I'm ignoring this field - we don't need it to decompress/view the image.
  5. The Image Object - This is 800 variable length records, each of which represents a single line of image data. The line is compressed and will extract to an 8 bit 800 pixel wide line, resulting in an 800x800x8bpp grey scale image.

How Many Values In the Encoding Histogram?

The decompression table is 511 entries, but the image only has 8bpp (256 values). At first glance this is a little odd.

This is because only the first pixel is an absolute value; the rest of the line is a set of "offset" values from the previous pixel - this leads to some efficient compression since the pixel-to-pixel differences tend to cluster and compress well, but the side effect is that the largest value swings are from -255 to +255 steps ( i.e. "0+255=255", or "255 -255=0") requiring more bytes to cover the range.

I'll go over this in a bit more detail when we actually come to uncompress the image.

Breaking down the header

The header is basically a set of entries each of which is of the form "Tag = Value".
There are only three exceptions to this general rule: 
  • END
  • Pure Comment records

The "END" is used to flag the end of the label header, and the start of the first image histogram field follows it.

Comment-only records start with "/*"and have no other data in them.

The  END_OBJECT is related to the "OBJECT" statement. OBJECT is used to qualify a particular set of associated data elements in the file, so for example the file we're using has an Object per histogram, and each Object describes the  specifics. Importantly objects may have overlapping tag values; i.e.:

[31]    " ITEMS = 256"
[33]    " ITEM_BITS                       = 32"
[34]    "END_OBJECT"
[36]    " ITEMS = 511"
[38]    " ITEM_BITS = 32"
[39]    "END_OBJECT"

Actually since we know the item size and format is fixed you could just ignore these fields for now, however in practice I track the "current" object when parsing through and prefix the name into the data structure.

Data Locations

There's a couple of pointer fields in the original image data - these are fields which start with the "^" character. e.g.
The pointers are offset by one from the array location (since the pointer values are "1" based), and can be used to locate the data structures by looking directly at the QList used to store the records.

Storing the header

Under Qt we can use the QHash template class as a simple lookup dictionary, so we can make a declaration like:
QHash<QString, QString> _labels;
This allows us to do a simple split at the "=" to separate the key and value, then we can just insert it in the hashtable with
_labels[key] = value;
And retrieve it with:
value = _labels[key];
So, for example, if we dump all the incoming header items into the hashtable we can then retrieve the name of the probe with
name = _labels["SPACECRAFT_NAME"];

Obviously when we insert these in the hash then the ordering is lost, and the pointer records are no longer useful.

Next up will be building the compression table and decompressing the image...

Things I'm glossing over

Removing comments, which start with "/*" from the records, using the .trimmed() and .simplified() methods to clean up the whitespace in the QByteArray entries and error handling throughout, none of which are particularly interesting...