## Compile-time optimization of data structures?

Or: “it’s all fun and games until you realize you’re a monkey“. Bob brings up a big point about the expressiveness of a language and its relation to concise code. In a similar carpal tunnel vein (but certainly a bit less meaningful in scope) I find it difficult to find decent work done on compile-time optimization of data structures based on suggestions for space and time efficiencies and how they intend to be used.

In other words, if you enjoyed and/or revisited an undergraduate course on data structures, based on any given problem at hand, the chances are good that you can concoct some monstrosity of a hybrid data structure that keeps efficiency and your use of the code in mind.

The handwavy question I’m posing is, what aspects of concocting this crazy structure are actually due to clever thinking about the problem at hand vs. “this data structure is good for this, and this one for that” kind of algorithmic work? The end result of the algorithmic work is a tired hand and wordy, unexpressive code. I have this sneaking suspicion that a lot of the work is indeed algorithmic and it’s worth exploring what can be automated at compile-time based on potentially simple rules and patterns.

Perhaps a good example to keep things somewhat concrete is the notion of a set, essentially an unordered collection of unique elements. It has some basic binary operations like union and intersection and there are a variety of things one imagine doing with a set: query for membership of an element, add an element to it, etc [1]. For instance, in a graph, every node might have an associated set of nodes it’s connected to [2], or it may alternatively be represented as a set of edges and a set of nodes where each edge is a set of two nodes. Now the last thing I’m advocating is to always use some mathematical structure to express what objects you’re dealing with in your program, but it does turn out that a “set object” can be quite handy. Here is some simple Python code that basically takes as an input a set of files and returns the number of unique files where each element of a file is a line (keep in mind there can be duplicate lines as well as entire duplicate files):

import sysfrom sets import Set uniqpars = Set() for parfile in sys.stdin: parset = Set()pf = open(parfile.rstrip()) for line in pf: parset.add(line) pf.close() uniqpars.add(parset) print len(uniqpars)

The fact that the example above uses a set in a traditional object-oriented way [3] is somewhat beside the point. What sometimes matters are simple questions like how fast are membership and addition operations or are we doing a lot of unions or how much space does my set actually take? There are many ways to implement a set depending on how you’re using it. A few examples:

Super-fast lookup, but potential memory or false positive concerns:

- A directly-mapped array or bitset if you have enough space to store your entire possible set of members.
- A valueless hash table if you have a decent amount of memory around, a potentially super-large set of possible members and want to store only the members you see. These also have super-quick lookup and insertion/deletion times.
- A Bloom filter if memory is a concern, you have a super-large set of possible members, don’t mind not storing the members, and are willing to accept a small probability of false positives.

The structures above are great for fast operations given the set sizes don’t fluctuate that much and you have the memory or are willing to deal with false positives.

- Binary search trees, especially the self-balancing kind, are great for sets with sizes that fluctuate a lot [4] or many union operations are being done, for example (they preserve an ordering of the elements that can be useful). If maximum or minimum element values need to be quickly accessed, binary heaps are often good ways to implement those kind of sets.

Obviously, there are a variety of ways to implement this kind of object. Object-oriented languages like C++ and Java have ways to separate the notion of an object from its implementation [5], but I reiterate, the object-orientedness is not necessarily the point. It’s simply a convenient way to organize and express the fact that you’re using a set object with commonly-used functions.

What seems to be deserving of more attention is if there are some handy structures with well-defined operations like a set, it’s probably worth using those structures and optimizing for what we need to based on some helpful directives at compile-time [6]. In other words, suppose I’m using some objects with well-defined operations like a set with each other. Suppose I also have a way to express how I’m using my information with these objects. Finally, suppose I can use techniques that some resourceful computer scientists have come up with to piece together these objects in a relatively optimal way based on how I plan to use the structures. And presto! We could potentially do a lot of work from the get-go to save the programmer some implementation details yet still wind up with some pretty efficient code.

OK, so there are a lot of assumptions there, but I imagine one can begin by citing more examples of commonly-used objects, they’re uses, and common implementations by themselves and as hybrids with other data structures. Perhaps there are some interesting rules and techniques that could come about.

And all that just reminded of some Ph.D. work a friend is doing at the moment. Maybe this topic will be revisited.

——

[1] Note that I’m just focusing on data structures and basic operations as well as their optimization with other data structures and ignoring the more language-oriented aspects.

[2] Often implemented as an adjacency list

[3] That is, the way it can check equality of sets on insertion is the fact that each member is some object that has some defined way of saying it’s “equal” to another object. In the example above, there is a set of lines and a “set of sets of those lines”. With object-oriented techniques the “equals” function works for both even though the member types are different (i.e. a “line”, and a “set of lines”).

[4] As a brief motivator an allocator can be associated with a data structure to deal with this issue (though it can get hairy at times). A good review of allocation techniques as well as some implementation motivation can be found at this starting point.

[5] Though C++ is super-guilty of not using a multiple-inheritance trick to do this in practice.

[6] I don’t think this is too far-fetched given we’ve done quite a lot of work in linear algebra and higher-level matrix-oriented languages tend to do a good job with running optimized routines based on a few directives.