Thursday, 27 November 2014

Cassini - Getting out the header information

Using Regular Expressions to Extract the tags

Rather than dump all the tags into a Hash, as we did with the Voyager data, we will use a regular expression to parse out the tag data from the image file.

In the case of a Vicar record then we can isolate the data value by looking for the occurrence of specific tag with a pattern like:

"one or more spaces" "the tag" "optional spaces" "=" "optional spaces"
In regexp speak this would be something like

QString label;
...
QString pattern;
pattern = "\\s+" + label + "\\s*\\=\\s*";

Where '\s' is regular expression speak for a whitespace character, "+" means "1 or more", and "*" means "0 or more". For the Qt code we have to put two backslashes due to the C string interpretation rules.

One place the above regular expression algorithm will fall down is actually on LBLSIZE, since this is at the start of the file and won't have a leading space. So we actually want the pattern formed to take account of this special case - we can do that with a simple flag marker:

bool start
...

    if (!start)
      pattern = "\\s+" + label;
    else
      pattern = label;

    pattern +="\\s*\\=\\s*";



We can put this into a Qt Regular Expression, and find a match in a byte array  with

QByteArray ba
...
QRegularExpression re;
...
  re.setPattern(pattern);
  QRegularExpressionMatch match = re.match(ba);

  if (match.hasMatch())
  {

...

And then we can locate the end of the match (i.e. the data value) with:
  int off = match.capturedEnd();

One minor point is that the QRegularExpression class is a Qt5 only - for Qt4 there's a QRegExp, which provides similar functionality but with a different API - I'm sticking to Qt5 for this.

At this point we extract the data value. This could be done with a more complex regular expression, but to keep it simple we'll just implement a character by character walk, keeping an eye out for quoting

  do
 {
  char ch = ba[off++];
  result.append(ch);

    if (ch == '\'')
      quoted = !quoted;
    if ((!quoted) && (ch == ' '))

      done = true;
  }while ((off < ba.size()) && (!done));

i.e. if we hit a whitespace outside a quote we're done, otherwise we just append data to the output.

The tags we're looking for

Since we're scanning for a specific set of tags, here's what we want to find:

 "LBLSIZE"
The size of the label header, in bytes. We can use this to locate the image data.

  "TYPE"
Type of data - this should always say 'IMAGE' for the files we're handling.

  "ORG"
The file Organisation; this tells us how the pixel data is arranged, and what the various size tags actually mean. In this case we're primarily looking for a value of 'BSQ' in this field, which means "Band Sequential".  This means the image data is arranged by per-line samples, arranged as a number of lines, and the lines are grouped into bands. The alternative arrangements can interleave the lines (BIL) or pixels (BIP). For now we can check for BSQ and only process files with this data layout.

  "FORMAT"
The format of pixel samples. We're going to deal with 'BYTE' (8bpp) and 'HALF' (signed 16bpp) only.

  "NL"
  "NS"
  "NB"

The count of Lines, samples and bands in the image. Essentially this gives us information we need to recover and correctly size the output image. Note that since we use the Record Size and prefix (RECSIZE & NBB) we don't actually need to use NS directly; more on this when we extract image data.

  "N1"
  "N2"
  "N3"

For BSQ Organised files these are equivalent to 'NS', 'NL' and 'NB' respectively. Since we only handle BSQ we use these interchangeably.

  "NBB"
Binary prefix bytes - this is used when we extract image data from the pixel area, by telling us how much of the line is binary prefix.

  "NLB"
Label area size, we remove this - more on this when we process the image.

  "RECSIZE"
Record size - this tells us about the underlying size of data chunks in the label and pixel data regions - more on this when we process the image.

  "MISSION_NAME"
  "INSTRUMENT_NAME"
  "IMAGE_TIME"
  "TARGET_NAME"
  "FILTER_NAME"
Image meta data we can use to classify the images.

And how we get them

This is fairly simple - we just walk a list of tags, extract the value using the regular expression parser (which for this example is in "getLabelValue()"), and drop it in a hash
i.e.
 static const char* fl_labels[] = {
  "LBLSIZE",
  "NBB",

...
 And then we have, from the previous code, something like:

QHash<QString, QString> _labels;
int  sz = sizeof(fl_labels)/sizeof(char*);
...
 for (unsigned int i=0; i < sz; i++)
  {
  QString value;
    value = getLabelValue(_data, fl_labels[i], i==0);
    _labels[key] = value;
  }

(Note the "i==0" which is special casing for the start flag in the regular expression function). So, once again, we do a file read to pull the data into a QByteArray with

QString nm
...
QFile fin;
...
  fin.setFileName(nm);
  if (!fin.open(QIODevice::ReadOnly))
    return false;

  _data = fin.readAll();
  fin.close();




and pass it through the tag parser above. Done!