Profile!

We all agree that premature optimization is bad, right? Why do some of us still do it?

I have some pixel art that I want to magnify, and I like the way that Maxim Stepin’s hq2x family of filters look. A quick search led me to the hqx project that implements the filter and gives you a command line tool to run it offline. Downloaded it, compiled it, and gave it a try:

jfischer@solace:~/hqx_test$ file test.png
test.png: PNG image, 16 x 24, 8-bit/color RGBA, non-interlaced
jfischer@solace:~/hqx_test$ time ~/bin/hqx test.png out.png

real    0m1.777s
user    0m1.680s
sys     0m0.096s

Ok, yeah, nearly 2 seconds to convert a 16x24 image isn’t working for me. This is supposed to be a realtime algorithm; there’s no reason it should be this slow. I took a brief look at the code and spotted what I thought was the culprit, but just to be sure I profiled the program.

First, I rebuilt the program with profiling enabled (-pg for gcc):

jfischer@solace:~/hqx_test$ cd ~/hqx-read-only/src/
jfischer@solace:~/hqx-read-only/src$ gcc -o ~/bin/hqx -I. -lIL -pg *.c

Now whenever I run ~/bin/hqx it’ll write out a file named gmon.out in my current working directory with the information the profiler gprof needs. Let’s give that a try:

jfischer@solace:~/hqx_test$ time ~/bin/hqx test.png out.png

real	0m1.719s
user	0m1.604s
sys	0m0.112s
jfischer@solace:~/hqx_test$ gprof ~/bin/hqx gmon.out
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
100.00      1.51     1.51        1     1.51     1.51  hqxInit
  0.00      1.51     0.00     3002     0.00     0.00  rgb_to_yuv
  0.00      1.51     0.00     2978     0.00     0.00  Interpolate_2
  0.00      1.51     0.00     2184     0.00     0.00  yuv_diff
  0.00      1.51     0.00     1957     0.00     0.00  Interpolate_3
  0.00      1.51     0.00     1162     0.00     0.00  Interp8
  0.00      1.51     0.00     1121     0.00     0.00  Interp3
  0.00      1.51     0.00      960     0.00     0.00  Interp6
  0.00      1.51     0.00      530     0.00     0.00  Interp2
  0.00      1.51     0.00      467     0.00     0.00  Interp7
  0.00      1.51     0.00      450     0.00     0.00  Interp1
  0.00      1.51     0.00      434     0.00     0.00  Diff
  0.00      1.51     0.00      245     0.00     0.00  Interp5
  0.00      1.51     0.00        1     0.00     0.00  hq4x_32
  0.00      1.51     0.00        1     0.00     0.00  hq4x_32_rb

Yup, hqxInit is what I thought the culprit was. Let’s take a look at that function:

uint32_t   RGBtoYUV[16777216];

HQX_API void HQX_CALLCONV hqxInit(void)
{
    /* Initalize RGB to YUV lookup table */
    uint32_t c, r, g, b, y, u, v;
    for (c = 0; c < 16777215; c++) {
        r = (c & 0xFF0000) >> 16;
        g = (c & 0x00FF00) >> 8;
        b = c & 0x0000FF;
        y = (uint32_t)(0.299*r + 0.587*g + 0.114*b);
        u = (uint32_t)(-0.169*r - 0.331*g + 0.5*b) + 128;
        v = (uint32_t)(0.5*r - 0.419*g - 0.081*b) + 128;
        RGBtoYUV[c] = (y << 16) + (u << 8) + v;
    }
}

hqxInit spends a second and a half at startup time to create a lookup table to convert any 8-bit per channel RGB value to the equivalent YUV value. The rest of the program finishes up pretty much instantaneously. It also uses 64 MB of memory to do this. I can understand why the author (who is not Maxim Stepin, as far as I can tell) would do this: hqx is made to be usable as a library for realtime filtering in a larger program. For the command line tool though, this is ridiculous.

For the command line tool usage, there’s a super easy way to speed this up: just drop the lookup table and do the math to convert the RGB value to YUV every time. Let’s give that a try and replace the existing rgb_to_yuv function with one that doesn’t use the lookup table:

/* Lookup table version
static inline uint32_t rgb_to_yuv(uint32_t c)
{
    // Mask against MASK_RGB to discard the alpha channel
    return RGBtoYUV[MASK_RGB & c];
}
*/

static inline uint32_t rgb_to_yuv(uint32_t c)
{
    /* Mask the alpha channel out of the color */
    c = MASK_RGB & c;
    uint32_t r, g, b, y, u, v;

    r = (c & 0xFF0000) >> 16;
    g = (c & 0x00FF00) >> 8;
    b = c & 0x0000FF;
    y = (uint32_t)(0.299*r + 0.587*g + 0.114*b);
    u = (uint32_t)(-0.169*r - 0.331*g + 0.5*b) + 128;
    v = (uint32_t)(0.5*r - 0.419*g - 0.081*b) + 128;
    return (y << 16) + (u << 8) + v;
}

Then remove the call to hqxInit from the beginning of the program. With those changes, the program runs the way I expected it to:

jfischer@solace:~/hqx_test$ time ~/bin/hqx -s 2 test.png out.png

real    0m0.016s
user    0m0.008s
sys     0m0.004s

jfischer@solace:~/hqx_test$ gprof ~/bin/hqx gmon.out
Flat profile:

Each sample counts as 0.01 seconds.
 no time accumulated

  %   cumulative   self              self     total
 time   seconds   seconds    calls  Ts/call  Ts/call  name
  0.00      0.00     0.00     3002     0.00     0.00  rgb_to_yuv
  0.00      0.00     0.00     2184     0.00     0.00  yuv_diff
  0.00      0.00     0.00      747     0.00     0.00  Interpolate_3
  0.00      0.00     0.00      602     0.00     0.00  Interp2
  0.00      0.00     0.00      549     0.00     0.00  Interp1
...

So for my use, it’s fixed. If the author really wants to keep the lookup table around, there are some other ways to approach it that would avoid the startup penalty. The first one that comes to mind is memoization: the first time you look up a particular RGB value, calculate its YUV equivalent and store it. Subsequent lookups for that value can skip the calculation and just retrieve the stored value. It’d look something like this:

extern uint32_t RGBtoYUV[16777216];

static inline uint32_t rgb_to_yuv(uint32_t c)
{
    /* Mask the alpha channel out of the color */
    c = MASK_RGB & c;

    if (RGBtoYUV[c] == 0) {
        uint32_t r, g, b, y, u, v;

        r = (c & 0xFF0000) >> 16;
        g = (c & 0x00FF00) >> 8;
        b = c & 0x0000FF;
        y = (uint32_t)(0.299*r + 0.587*g + 0.114*b);
        u = (uint32_t)(-0.169*r - 0.331*g + 0.5*b) + 128;
        v = (uint32_t)(0.5*r - 0.419*g - 0.081*b) + 128;
        RGBtoYUV[c] = (y << 16) + (u << 8) + v;
    }

    return RGBtoYUV[c];
}

That way, you only generate lookup entries for the RGB values you actually use. You’re still burning 64 MB of memory for that table though, and your access pattern into it is probably going to be all over the place, wreaking havoc on the CPUs caches. Plus, this way you’re introducing a branch that wasn’t there before. You’d have to test and profile a bunch to determine the most performant way to do this for a particular application, but my gut feeling is that skipping the lookup table and just doing the math every time is the fastest way to go.