Wednesday, 18 May 2016

Faster Image Processing Using Halide

Halide: Why?

Halide is a Domain Specific Language which is targeted at image processing.

Essentially Halide provides a mechanism to separate the operations on an image from the way the data is loaded and the processing is scheduled, so we can specify "What to do with the image data" independently from "How the data is loaded and processed".

This is very useful when we're looking at processing images using multicore systems, GPUs, SSE and a bunch of other machine specific optimizations: we have a number of available optimization methods, and the best approach for a given target may not be obvious.

So at the top level Halide allows us to firstly specify the image processing that we do on the incoming data, and then we can experiment with different scheduling choices, all whilst ensuring that our processing remains consistent. This allows us to rapidly prototype the way we process data to establish the fastest approach on a given target.

For way more information go to http://halide-lang.org

Building it

Clone it from Git - you'll also need a recent LLVM, so get that too:

export HALIDE_ROOT=`pwd`

git clone https://github.com/halide/Halide.git
svn co https://llvm.org/svn/llvm-project/llvm/branches/release_37 llvm3.7
svn co https://llvm.org/svn/llvm-project/cfe/branches/release_37 llvm3.7/tools/clang

cd llvm3.7
mkdir build
cd build
cmake -DLLVM_ENABLE_TERMINFO=OFF -DLLVM_TARGETS_TO_BUILD="X86;ARM;NVPTX;AArch64;Mips;PowerPC" -DLLVM_ENABLE_ASSERTIONS=ON -DCMAKE_BUILD_TYPE=Release ..
make -j8

cd ..
export LLVM_CONFIG=`pwd`/build/bin/llvm-config
export CLANG=`pwd`/build/bin/clang
export LLVM_ROOT=`pwd`/build

cd ..

mkdir halide_build
cd halide_build
make -f ../Halide/Makefile -j8


At this point you can build the tutorials with lines like:
  make -f ../Halide/Makefile -j8 tutorial_lesson_02_input_image

Using it

Since we're not building redo the exports:

export HALIDE_ROOT=`pwd`
cd llvm3.7
export LLVM_CONFIG=`pwd`/build/bin/llvm-config
export CLANG=`pwd`/build/bin/clang
export LLVM_ROOT=`pwd`/build
cd ..

The bare minimum

Halide drops into the C++ language as a handful of extra classes which operate (more or less) transparently as new types, method calls and operations.

Halide operates on dense data structures, representing grayscale images as 2 dimensional, and handling colour as a third dimension, so colour is dealt with as separate planes rather than packed RGB.

Although Halide can actually cope with up to 4 dimensions we'll mostly stick with the "2D and maybe colour" representation for actual images.

Halide expects us to define a set of functions that define how to calculate a point in the output image, and separately how to schedule those functions (i.e. how to feed them data, parallelise operations, etc).

Basic Image handling

To load an image:
#include "Halide.h"
#include "halide_image_io.h"
using namespace Halide::Tools;

...
  Halide::Image<float> input = load_image("input.png");

 
And to save an image:

  save_image(input, "output.png");
 

And build it with a compile line like
 
  g++ foo.cpp -g \
    -I $HALIDE_ROOT/halide_build/include \
    -I $HALIDE_ROOT/Halide/tools \
    -L $HALIDE_ROOT/halide_build/bin \
    -lHalide \
    `libpng-config --cflags --ldflags` \
    -lpthread \
    -ldl \
    -o foo \
    -std=c++11

Then run it, using the library path prefix for Halide:

  LD_LIBRARY_PATH=$HALIDE_ROOT/halide_build/bin ./foo

Processing the Image

To tell Halide how to process the image data we provide a set of pure functions which essentially describe the value at an output pixel in terms of the input pixel data.

When we say that the functions are pure, we mean in this case that there should be no state involved in the processing: no global variables should be modified and the input image data should not be changed - regardless of the order in which pixels are processed we should always produce the same output values for a given set of input data.

All image processing is described as a chain of these pure functions, which transform input co-ordinates to output values.

So here's a simple example. First define an input & output image pair:

  Halide::Image<float> input = Halide::Tools::load_image("input.png");   
  Halide::Image<float> output(input.width(), input.height(), input.channels());

Then we define a function:
    Halide::Func brighter;
   
And the variables we will use for image data.
    Halide::Var x, y, c;

These are the X location, Y location and Colour channel respectively.
 
Then we define the details of the Halide function, which specifies our input to output transform - in this case our 3 dimensional representation of the output value (i.e. X and Y with colour). Note we use Halide's sqrt() over math library versions.

    brighter(x, y, c) = (0.9f * Halide::sqrt(input(x, y, c)/2))  + 0.1f;

This is just an arbitrary 1:1 pixel manipulation for this example which turns up the brightness of the image: The function will always produce the same output regardless of the sequence in which input values are supplied.
   
Then we apply this function to the output: This compiles the Halide processing pipeline and runs it to generate the output image using the realize () function. We then save the image using the Halide library tools.

  brighter.realize(output);
  Halide::Tools::save_image(output, "output.png");

And that's it - this is currently pretty slow, running this usually takes around 120-130mS over a single 3856*3856 image  (not counting image load and save times).

Scheduling

At this point we can play with the schedule.

Initially we can vectorise the algorithm. This causes us (on x86) to use SSE to issue the calculation as four wide vectors

  brighter.vectorize(x, 4);

Doing this and re-running leaves us with run times down at the 105mS mark. However we can also parallelise the operations across y variables with:

  brighter.parallel(y);
 
Adding this and our operations fall into the 95mS to 100mS range. We also play with parallelism for the colour, and changing the vector width, but this doesn't speed us up further.

We can also try experimenting with tiling, reordering, unrolling, etc. See the Halide docs for the complete set. Our example is simple enough that this doesn't help though.

The Code so far

So, here's our simple example processing function, and timing code:

void difftime(const char* prefix, struct timeval* t1)
{
struct timeval res;
struct timeval t2;

  gettimeofday(&t2, NULL);
  timersub(&t2, t1, &res);
  printf("%s %d.%06d\n", prefix, (int)res.tv_sec, (int)res.tv_usec);
}

static void process_vec(Halide::Image<float>& thisimg, Halide::Image<float>& outimg)
{
struct timeval t1;

  gettimeofday(&t1, NULL);

  Halide::Func brighter;
  Halide::Var x, y, c;
  brighter(x, y, c) = (0.9f * Halide::sqrt(thisimg(x, y, c)/2))  + 0.1f;

  brighter.vectorize(x, 4);
  brighter.parallel(y);
  brighter.realize(outimg);
}


static int process_image(const char* nm_in, const char* nm_out)
{
Halide::Image<float> input;

  input = Halide::Tools::load_image(nm_in);
  Halide::Image<float> output(input.width(), input.height(), input.channels());

  for (int i=0 ; i < 10 ; i++)
  {
  struct timeval t1;

    gettimeofday(&t1, NULL);
    process_vec(input, output);
    difftime("Process ", &t1);
  }
  Halide::Tools::save_image(output, nm_out);

return 0;
}

Startup Costs

The first run of Halide's realize() function is always slow - if we run the process over the image repeatedly then the first run is always significantly longer than the subsequent runs (for the basic version this is 370mS and about 500mS in the vectorised case).

This is for a number of reasons - partly because we're dragging code and data into the cache, but we also have overhead from the JIT compile itself, which we always pay for.

Ideally we don't want to JIT every time, once we have "good" schedule we just want to use it. The good news is that we can ask Halide to compile the processing code in advance, and apply this to images on demand.

Compiling pipelines in advance

Halide offers a way to compile the pipeline in advance, referring to this as Ahead of Time (AOT) compilation in the docs.

In this scenario we define the Halide function as before, however rather than realize() it on the image we use the compile_to_file() function - this generates an object file that we can link in to our program, and a header file to interface with the generated image processing code and structures.

Compiling The Function

So for the vector code above we have the very simple standalone piece of code which we compile and run separately:

int main (int argc, char* argv[])
{
  Halide::Func brighter;
  Halide::Var x, y, c;
  Halide::ImageParam in_img(Halide::type_of<float>(), 3);

  brighter(x, y, c) = (0.9f * Halide::sqrt(in_img(x, y, c)/2))  + 0.1f;
  brighter.vectorize(x, 4);
  brighter.parallel(y);

  std::vector<Halide::Argument> args = {in_img};
  brighter.compile_to_file("processor_obj", args);
}

We describe the "brighter" algorithm, as before, add a vector describing the inputs (in this case just the incoming image), and compile the output to the file(s) "processor_obj"

When we compile and run this then the output is to produce two new files "processor_obj.h" and "processor_obj.o" - these files are all we need to run the processing function, and we link these into our target application.

Using the Precompiled Object

Now inside the header file we see "buffer_t", which is the structure we use to describe the image data that we supply to the image processing and the output.

This structure specifies the layout of image data in memory for Halide, then we simply call the function from the compiled object supplied with populated instances of this buffer type.

One key piece of information is that, as mentioned earlier, Halide expects us to be using colour planes rather than packed RGB values. So instead of having the values for Red, Green and Blue per pixel together in memory Halide implicitly expects us to have all the Red Values for all of the pixels in the image in a 2D array, followed by all the Blue values and then all the Green values.

So now we have the following function which maps our image data into the necessary structure

static void make_buffer_refs(float* img, buffer_t* ref, int width, int height)
{
    ref->host  = (uint8_t*)img;
    ref->elem_size = sizeof(float);

    ref->stride[0] = 1;
    ref->stride[1] = width;
    ref->stride[2] = height*width;

    ref->extent[0] = width;
    ref->extent[1] = height;
    ref->extent[2] = 3;

return;
}

In this case we're handling an image  of float data, with the given width and height, and putting the output into the buffer descriptor "ref":

The two basic entries in the buffer_t structure are host and elem_size. "host" tells us the base of the pixel data, and "elem_size" is the size of an individual pixel element - in this case it's the size of a float.

Following this we tell Halide about the stride information: i.e. how the data is packed into memory and the gaps between elements.

This is the stride[] data - Halide supports up to four dimensions here, and for colour data we use the first three for the per pixel gap, the per line gap and the per colour gap respectively.

In our case everything is tightly packed, so we say the per-pixel width is "1 data element" (i.e. pixels are tightly packed), the per line gap is the width (i.e. no gap at the end of lines) and the colour gap is the height*width (i.e. each colour plane follows directly on from the preceding one). If we had to deal with sparser data structures (e.g. aligned lines or image planes, etc) we would modify this information.

We also tell Halide how large the data input is: again we specify the dimensions via an array, and in this case the "extents" array, which specifies we have "width" pixels per line, "height" lines per colour plane, and for this case 3 colour planes (RGB)

This function is for the case where we're dealing with RGB data as floats - if we were dealing with greyscale images we would leave the [2] element for both stride and extents unset, and for different data types modify the element size accordingly.

Halide also supports a number of other fields which we can use for smarter memory layouts, but this will do for our case for now.

So to use this we simply have two buffer references, one for the input and one for the output image:

buffer_t input_buf = {0};
buffer_t output_buf = {0};

And we set these for the input and output image data, i.e.:

  make_buffer_refs(img, &input_buf, input_image.size().width, input_image.size().height);
  make_buffer_refs(output_img, &output_buf, input_image.size().width, input_image.size().height);

Then we call the processing function defined in the Halide header, "processor_obj.h" - this is defined as:

 int processor_obj(buffer_t *_in_img_buffer, buffer_t *_brighter_buffer) HALIDE_FUNCTION_ATTRS;

And to use it we call this with the input and output buffer structures:

  processor_obj(&input_buf, &output_buf);

And that's it: timing this we see an initial run of ~190mS, and around 50mS when called repeatedly (which is a bit less than twice as fast as the JIT-ed version overall).

One interesting side effect of the ahead of time compilation is that we don't have to link with the Halide libraries any more - just the compiled processing object file; so with OpenCV providing our image load/store code we can compile with a line like:

  g++ process.cpp processor_obj.o \
   `pkg-config opencv --cflags --libs` -lpthread -ldl -o process

to generate our executable. This gives us a neat workflow, where we play with the JIT  version of the code to generate the "best" version of the processing pipeline, then generate the object and header and copy it into our target application build - we can forget all about Halide at this point, and just treat the function as a black box, which is nice.

On that OpenCV code; for the example it's dumber than a bag of spanners (since this is just a mock up to get data in/out of the Halide domain), but for completeness we have this as the code:

buffer_t input_buf = {0};
buffer_t output_buf = {0};

cv::Mat input_image = cv::imread("input.png");
 
  input_image.convertTo(input_image, CV_32FC3); // Convert to 3 channel floats

float* img;
float* tgt;

  int img_sz = input_image.size().width * input_image.size().height * 3*sizeof(float);
  img = (float*)malloc(img_sz);
  tgt = img;

  float* output_img;
  output_img = (float*)malloc(img_sz);

  for (int channel = 0; channel < 3 ; channel++)
    for (int x = 0; x < input_image.size().width ; x++)
      for (int y = 0; y < input_image.size().height; y++)
      {
      cv::Vec3f px_vl = input_image.at<Vec3f>(cv::Point(x,y));
        float value = px_vl.val[channel];
        *tgt++ = (value/255.0);
      }

  make_buffer_refs(img, &input_buf, input_image.size().width, input_image.size().height);
  make_buffer_refs(output_img, &output_buf, input_image.size().width, input_image.size().height);


    for (int i=0 ; i < 10 ; i++)
    {
    struct timeval t1;
      gettimeofday(&t1, NULL);
      processor_obj(&input_buf, &output_buf);
      difftime("Process ", &t1);
    }

  cv::Mat output;
  output.create(input_image.size().width, input_image.size().height, CV_32FC3);

  tgt = output_img;

  for (int channel = 0; channel < 3 ; channel++)
    for (int x = 0; x < input_image.size().width ; x++)
      for (int y = 0; y < input_image.size().height; y++)
      {
        float value = *tgt++;
        cv::Vec3f px_vl = output.at<Vec3f>(cv::Point(x,y));
        px_vl.val[channel] = value*255.0;
        output.at<Vec3f>(cv::Point(x,y)) = px_vl;
      }
  cv::imwrite("opencv_output.png", output);


Note the rather hacky 255.0 scaling of float values (since our original expect 0-1.0 values, but OpenCV produces floats from 0.0  to 255.0) and the cheesy read/modify/write of pixel values at the transfer to image output.

Also If we're feeling picky we should do something about converting OpenCV's colour order (BGR) to RGB in the planes, but we're lazy and it doesn't matter for this test.

Tuesday, 17 May 2016

A Better PDS Parser

The Improved Parser



This (source tarball via Google Drive) is a slightly better PDS parser than we've made previously.

Like most parsers it's actually fairly uninteresting: There are a number of oddities in parsing that mean that the logic would be a little convoluted as a simple regexp parser. Specifically the field separator (\r\n) can appear at several places without indicating a new field.

However using this parser we can parse several new files as well as Rosetta, including the Venux Express VMC, which we used the VICAR header from previously, and the Mars Express files, which leads us to the more interesting problem of parsing Mars Express files.

Mars Express Files

These are 16bpp greyscale images, however one of the interesting things in the mars express data is the presence of prefix data on each image line.

The details of the binary prefix are in the document HRSC_LABEL_HEADER.PDF from the data archive documentation directory. This is a 68 byte header which is in front of every line in the image data.

This has several fields, but the main thing to consider initially is the leading 12 bytes. This is an 8 byte double, and 4 byte float. This is used to provide the Ephemeris time and Exposure time for each line, where the Ephemeris time is the time this line was sampled at, and the exposure time is the time in milliseconds it took to acquire the line.

In theory then we should find that the start (Ephemeris) time of each line is the start time of the previous line plus the exposure time. Using this information we can spot gaps in the incoming record, since the Ephemeris time will jump unexpectedly.

It's complicated by the fact that the exposure time is not constant over a file. The exposure time changes over the sequence of a capture and therefore in particularly large gaps the line calculation will be off.

Also in a couple of files the gap isn't exactly the exposure time - for example H0165_0068_P12.IMG has a consistent double spacing (this could be me screwing up, of course).

We should do a multiple pass, building up a list of capture times and intervals, and fill in this. However for now we just take the case where we skip 3 or more exposure times between lines and flag a gap.

Be warned that the Mars Express file can be very large, i.e. HRSC H0165_0000_ND2.IMG has a resolution of 5176x256600 and a size of ~1G