Algorithmically Speaking...

Caching Static Strings

18/7/2012

So, it's taken me long enough to get around to it but here it is, my first actual blog post. I'm going to start with something really basic, partly because I'm not yet comfortable with putting my own stuff out there yet, and partly because I'd like to aim some stuff at the 'newbies' about the dead-simple tricks and pitfalls that many 'seasoned' coders take for granted.

Before I get started, bear in mind that all code will probably be in C/C++, under the Works On My Machine certification, so if it doesn't work for you and there are no obvious errors, you may have to port/refactor it yourself. Where the size of variables matters, they are defined as follows:

UI64 - unsigned 64-bit integer
UINT - unsigned 32-bit integer
WORD - unsigned 16-bit integer
BYTE - unsigned 8-bit integer
XXXX * - pointer of type XXXX

SI64 - signed 64-bit integer
SINT - signed 32-bit integer
word - signed 16-bit integer
byte - signed 8-bit integer

I ummed and ahhed for ages over what I should write about first, but a past posting on another forum about strings gave me the incentive to write about efficient static-string handling; it's a simple task that is sometimes taken for granted and often left unoptimised; it's also something simple enough that I can write about it without making myself into too big of a dill. Firstly, I'll start by recalling a rule of thumb coined by Michael Abrash (who amongst his many other accomplishments, was a major contributor to the original Quake source) in his wonderful Black Book which I still live by today:

Know Your Data

The idea behind KYD is this: if you know exactly what format your data takes and how it should be processed ahead of time, then you can often make certain assumptions as to how to speed that processing up. Conversely, if you don't know your data through-and-through, you can end up making false assumptions, producing code that is buggy and/or does more than it really needs to.

As an example, I'm going to use file/path lists as a base, and optimise them using an indexed string cache; file/path lists tend to contain often-repeated strings, making them perfect for demonstrating the benefits of 'string indexing' as opposed to 'string storage'.

Consider the following list of files and imagine logging them starting at 'C:/Data':

    C:/Data/Path1/File1.dat
    C:/Data/Path1/File2.dat
    C:/Data/Path1/File3.dat
    C:/Data/Path2/File1.dat
    C:/Data/Path2/File2.dat
    C:/Data/Path2/File3.dat
    C:/Data/Path3/File1.dat
    C:/Data/Path3/File2.dat
    C:/Data/Path3/File3.dat

A naive approach might do something like this:

    typedef struct
    {
        char name[MAXPATH+1]; // NB: The +1 accounts for the 0 string terminator
        UI64 size;
    }
    FILEPATH;

    std::vector<FILEPATH> list;

Producing:

    list[0] { "C:/Data/Path1/File1.dat",10240 }
    list[1] { "C:/Data/Path1/File2.dat",10240 }
    list[2] { "C:/Data/Path1/File3.dat",10240 }
    list[3] { "C:/Data/Path2/File1.dat",10240 }
    list[4] { "C:/Data/Path2/File2.dat",10240 }
    list[5] { "C:/Data/Path2/File3.dat",10240 }
    list[6] { "C:/Data/Path3/File1.dat",10240 }
    list[7] { "C:/Data/Path3/File2.dat",10240 }
    list[8] { "C:/Data/Path3/File3.dat",10240 }

When filled, [list] would be a vector of files which one could simply iterate sequentially, storing the entire filename in [FILEPATH->name]. The trouble is, most people have embraced the use of long filenames so doing this would be useless, as MAXPATH is usually defined as 260 bytes total and that is hardly enough to store many fully-qualified filenames. Looking carefully at the list again, two main assumptions can be made about file/path names:

Every file/path's fully qualified name is the sum of it's own name and the names of its respective root paths;
Every file/path in the same path has the same 'root' name.

So, C:/Data/Path1 is the root for both C:/Data/Path1/File1 and C:/Data/Path1/File2, right? Using that as a guide, a better approach would be to use a tree something like this (somewhat simplified):

    typedef struct FILEPATH
    {
        char name[MAXPATH+1];
        UI64 size;
        struct FILEPATH *root;
        struct FILEPATH *path;
        struct FILEPATH *file;
    }
    FILEPATH;

    FILEPATH *tree;

Producing:

    [tree] { "C:/Data",0 }
        [->path] { "Path1",0,... }
            [->file] { "File1",10240,... }
            [->file] { "File2",10240,... }
            [->file] { "File3",10240,... }
        [->path] { "Path2",0 }
            [->file] { "File1",10240,... }
            [->file] { "File2",10240,... }
            [->file] { "File3",10240,... }
        [->path] { "Path3",0 }
            [->file] { "File1",10240,... }
            [->file] { "File2",10240,... }
            [->file] { "File3",10240,... }

This contains the same data, but in a tree format that can be recursed at will which allows fully-qualified filenames longer than 260 bytes; retrieving a fully-qualified filename is simply a matter of concatenating the name of the previous root path to the current path, stepping back to the previous root path, and repeating until the current root is [tree]. It also exposes the fact that in any path containing more than one file/subpath, fully-qualified filenames contain data that would be repeated in other files/subpaths in the same path; this is exactly the kind of thing to look for when optimising code.

This all may seem very obvious - after all, we almost always see files within tree structures - but the principal of breaking problems down into the most simple steps is something many people forget to do, and as a result the obvious is sometimes overlooked. This comes under the definition of knowing your data; knowing why a tree structure is a good-fit solution for this data is key to knowing how to optimise it.

String theory

Assuming the use of such a format for logging a path, we are still left with an issue that must be addressed, that of memory usage and overheads. For a small path with minimal files/subpaths, the above would work perfectly well; for a path containg thousands of files and subpaths however, it's clear that the tree is still going to hog a lot of memory for name storage. What's worse is that much of that space is actually wasted, because the average length of an individual file/path name is much less than the 260 bytes allocated for each and every tree entry.One way around that could be to replace the 'static' name buffer and dynamically allocate just enough space per entry to fit the name (ie: replace 'char []' with 'char *'); this reduces the memory usage to acceptable limits, but suffers from a loss of performance from the name allocation, and still-wasted memory due to the size of the 'char *' pointer and the overhead from the name allocation. I used this method a lot in the past because (at the time) it was quick to code and offered at least a mild memory optimisation in many cases, but knowing the bottlenecks I was determined to figure out a better way.

It eventually occurred to me that the repetition inherent in file/path trees could work to my advantage; by storing a string only once and referencing it as required, any number of files/paths could use the same string as often as required without adding extra overhead. This prompted the design of a simple 'string cache', a block of memory which would store any string just once and return a pointer to the string when required. Initially I used 'char *' to link into the cache, but that opened up a whole new set of 'pointer-based' issues; if the cache gets full it cannot be resized without resetting every pointer into it, for example. As a result, I finally decided the best solution was as follows:

   typedef struct FILEPATH
   {
       UINT name;
       //...
   }
   FILEPATH;

   char *stringcache; ( == "\0C:/Data\0Path1\0File1\0File2\0File3\0Path2\0Path3\0" )

In this case, [stringcache] is a block of memory containing all of the unique strings (0-terminated) and [FILEPATH->name] is an index into [stringcache] of the first character of the file/path name. In this case, [stringcache] can be resized or appended to without compromising any 'pointers' to existing strings; existing strings will always start at the same index, new strings will have increasingly higher indices, and a string lookup is a simple as char *look = stringcache + FILEPATH->name. Also note the 0-terminator at the start of [stringcache], which effectively defines index 0 as a NULL string; without this, an index of 0 would denote a real string which can cause various niggling bugs.

So, that's it, we have a way to store and reference static strings without bogging the system down in overheads and reducing duplicate strings, right? Yes...and no. Having somewhere to store the strings and a way to read them back is all well-and-good, but there is a rather large gap in the process: how is the actual string storage performed, and does it really matter anyway? Yes, it certainly matters; reading the strings is easy, low-overhead and fast, but a bad storage method will make adding strings a lot slower than it can be. Once again, it helps to Know Your Data.

Boyer-Moore-or-less

The obvious, brute-force method of adding a string to the cache is to compare the new string to each existing string in turn, returning the index of the existing string if it matches or appending the new string to the cache and returning the new index. Fine for a few hundred short strings, not so fun for logging the contents of an umpteen-gigabyte path. Also, given that the cache itself is a bunch of 0-terminated strings stuck together end-to-end, a Knuth-Pratt-Morris or Boyer-Moore implementation would be wasted here; those algorithms as written are better used searching for substrings in large blocks of text, which the string cache technically is not. However, taking a small leaf from Boyer-Moore and scanning back-to-front produces quite fast results, and once again there are some simple assumptions that can be made when scanning the cache backwards for matches:

More character-match runs tend to occur at the start of strings rather than at the end, so scanning backwards often discards non-matches sooner;
All cached strings must be 0-terminated, so as soon as a character-match fails we can skip to the next string.

Point 1 is a huge one; think of the innocent until proven guilty rule, and imagine the difference if that rule is reversed - the onus of the defendant changes. In the case of strings, you can scan from the end and quit as soon as you discover the final respective characters of two strings don't match...or you could scan from the start only to find that those strings only differ by that last character, therefore wasting time on redundant comparisons. The difference can be quite drastic. As an example:

   "an insanely long file name.dat\0"
   "an insanely long file name.bak\0"

Scanning from the front requires 27 'matches' to be found before the mismatched characters are found; scanning from the back effectively finds the mismatch on the first comparison (ignoring the NULL terminators), and is clearly faster. Given the nature of the data being used, this simple shift in thinking produces consistently faster results in the long run; note that this may not always be the case, as a 'bad' set of paths could still produce a cache that takes worst-case time to scan.

Point 2 is an algorithm-dependant extension to Point 1; to demonstrate it I'll extend the example to this:

                                                       "an insanely long file name.dat\0" <- A
   "this is an insanely long file name.dat\0an insanely long file name.bak\0"
             "an insanely long file name.dat\0"                                            <- B

In this example, A will produce the same mismatch as before, and so it skips to B; B shows where the next match could (and does) occur. In this case, B does not technically match the entire string, but it does match a complete, 0-terminated substring and is therefore considered a match. This effectively provides more potential for removing duplicated strings with no extra overhead, because under certain circumstances substrings can be used as complete strings; during initial testing I found that in certain cases (such as logging large paths full of music or image files) substring matching happens surprisingly often. Of course, this works best if earlier strings are longer than later strings to increase the likelihood of a substring match; if that option is available, use it.

Conclusion

That pretty-much concludes my algorithm for a static-string cache; the Snippet Swappit page contains a complete C++ class similar to that which resides in my personal toolbox and is at the core of much of my string-handling code. Note that a major (and here undocumented) feature of that class is a vector of cached string lengths; though the cache will work without it, I've found that caching the lengths saves repeated calls to 'strlen()' and often results in faster code at the obvious cost of memory, but your milage may vary.

In any case, I hope this provides some food for thought, and shows that even the simplest of tasks can be optimised simply by knowing exactly what your data is and what it's meant to do. And if you can figure out a way to make it better - or a better algorithm entirely - I'd love to see it; there is no such thing as the best or fastest code (another Abrash-ism) and if there was, I sure don't have it. :P

0 Comments