Geet Duggal

Explorations in the Computer and Natural Sciences

Archive for the ‘Compilation and Languages’ Category

Flex and Bison Fun

without comments

So I’ve been writing parsers lately with Flex and Bison.

Some notes of interest:

  • The “info” pages on both are really helpful and I’ve learned most of what I needed from them (as opposed to Googling around or finding a book)
  • The Regular Expression Explorer has been quite handy
  • While learning, I wrote a bunch of toy parsers, and I realized that I was copying “template code” a lot. This seemed unnecessary. A tool like this simply takes a BNF-like input and outputs into different languages.  Seems pretty cool but haven’t played with it too much yet.

Fun.

Written by Geet

August 13, 2009 at 9:24 pm

How can I trust you, code?

without comments

I’ve been programming a bit lately in C++ Boost for two reasons (1) I find the template metaprogramming aspects intriguing and the generic programming aspects super-convenient. Yet, when I’m really concerned with knowing precisely what’s going on to avoid wasting CPU cycles and memory, I go back to C. But then of course C is doing all sorts of things for me as well (just see the difference in speed in your code with the -O3 optimization flag).

One trend I’ve noticed after reading a lot of code is that bit-twiddlers will often over-optimize. I’m more and more convinced that this not only affects the readability of the code, but it may also be indicative of a lack of trust in the compiler’s optimization. The more control you have, the better you feel, right?

On the other end of things, some people trust objects and functions as black boxes and could care less what’s going on underneath. This post is not for them.

I hate to pull the big CS “language” and “compiler” words out of my pocket again [1], but how can I trust you language and compiler? When do I know I’m over-twiddling and it’s better that the compiler do the work for me? While there are guidelines, I think what it comes down to is a decent understanding of the compiler you’re dealing with and then, basic experimentation.

Most of my questions about whether certain optimizations are actually occuring are typically answered by doing timing tests, creating little “experiment” code bits to test a simple hypothesis, or look directly at the assembly code (e.g. the “-S” flag in gcc).

This did get me thinking, though, with so many different compilers and architectures–even with standards and benchmarks–a lot of these little questions aren’t answered. This process of creating precise tests and getting the answers seems to be the best approach [2].

—–
[1] But I do believe these topics have some of the most interesting research opportunities from a CS perspective. As I foray into scientific computing and its applications to the sciences, I hope I will have the chance to tackle a CSee problem or two along the way.
[2] Wouldn’t it be nice if there was a nice repository of contributed tests for a variety of architectures, compilers, and languages, and that these tests can be tagged with all this information? In the spirit of my last post, I say let’s start a massive conglomerate blog where people can post these tests. And then there could be another one where people just bitch about the goodness and badness of the tests. It’ll be like a somewhat moderated coding newsgroup for today’s age. Oh how I’ve blurred the line between honest desire and sarcasm here.

Written by Geet

November 12, 2008 at 12:32 am

Compile-time optimization of data structures?

without comments

Or: “it’s all fun and games until you realize you’re a monkey. Bob brings up a big point about the expressiveness of a language and its relation to concise code. In a similar carpal tunnel vein (but certainly a bit less meaningful in scope) I find it difficult to find decent work done on compile-time optimization of data structures based on suggestions for space and time efficiencies and how they intend to be used.

In other words, if you enjoyed and/or revisited an undergraduate course on data structures, based on any given problem at hand, the chances are good that you can concoct some monstrosity of a hybrid data structure that keeps efficiency and your use of the code in mind.

The handwavy question I’m posing is, what aspects of concocting this crazy structure are actually due to clever thinking about the problem at hand vs. “this data structure is good for this, and this one for that” kind of algorithmic work? The end result of the algorithmic work is a tired hand and wordy, unexpressive code. I have this sneaking suspicion that a lot of the work is indeed algorithmic and it’s worth exploring what can be automated at compile-time based on potentially simple rules and patterns.

Perhaps a good example to keep things somewhat concrete is the notion of a set, essentially an unordered collection of unique elements. It has some basic binary operations like union and intersection and there are a variety of things one imagine doing with a set: query for membership of an element, add an element to it, etc [1]. For instance, in a graph, every node might have an associated set of nodes it’s connected to [2], or it may alternatively be represented as a set of edges and a set of nodes where each edge is a set of two nodes. Now the last thing I’m advocating is to always use some mathematical structure to express what objects you’re dealing with in your program, but it does turn out that a “set object” can be quite handy. Here is some simple Python code that basically takes as an input a set of files and returns the number of unique files where each element of a file is a line (keep in mind there can be duplicate lines as well as entire duplicate files):

import sysfrom sets import Set
uniqpars = Set()
for parfile in sys.stdin:
  parset = Set()pf = open(parfile.rstrip())
  for line in pf:
    parset.add(line)
    pf.close()
    uniqpars.add(parset)
print len(uniqpars)

The fact that the example above uses a set in a traditional object-oriented way [3] is somewhat beside the point. What sometimes matters are simple questions like how fast are membership and addition operations or are we doing a lot of unions or how much space does my set actually take? There are many ways to implement a set depending on how you’re using it. A few examples:

Super-fast lookup, but potential memory or false positive concerns:

  • A directly-mapped array or bitset if you have enough space to store your entire possible set of members.
  • A valueless hash table if you have a decent amount of memory around, a potentially super-large set of possible members and want to store only the members you see. These also have super-quick lookup and insertion/deletion times.
  • A Bloom filter if memory is a concern, you have a super-large set of possible members, don’t mind not storing the members, and are willing to accept a small probability of false positives.

The structures above are great for fast operations given the set sizes don’t fluctuate that much and you have the memory or are willing to deal with false positives.

  • Binary search trees, especially the self-balancing kind, are great for sets with sizes that fluctuate a lot [4] or many union operations are being done, for example (they preserve an ordering of the elements that can be useful). If maximum or minimum element values need to be quickly accessed, binary heaps are often good ways to implement those kind of sets.

Obviously, there are a variety of ways to implement this kind of object. Object-oriented languages like C++ and Java have ways to separate the notion of an object from its implementation [5], but I reiterate, the object-orientedness is not necessarily the point. It’s simply a convenient way to organize and express the fact that you’re using a set object with commonly-used functions.

What seems to be deserving of more attention is if there are some handy structures with well-defined operations like a set, it’s probably worth using those structures and optimizing for what we need to based on some helpful directives at compile-time [6]. In other words, suppose I’m using some objects with well-defined operations like a set with each other. Suppose I also have a way to express how I’m using my information with these objects. Finally, suppose I can use techniques that some resourceful computer scientists have come up with to piece together these objects in a relatively optimal way based on how I plan to use the structures. And presto! We could potentially do a lot of work from the get-go to save the programmer some implementation details yet still wind up with some pretty efficient code.

OK, so there are a lot of assumptions there, but I imagine one can begin by citing more examples of commonly-used objects, they’re uses, and common implementations by themselves and as hybrids with other data structures. Perhaps there are some interesting rules and techniques that could come about.

And all that just reminded of some Ph.D. work a friend is doing at the moment. Maybe this topic will be revisited.

——
[1] Note that I’m just focusing on data structures and basic operations as well as their optimization with other data structures and ignoring the more language-oriented aspects.
[2] Often implemented as an adjacency list
[3] That is, the way it can check equality of sets on insertion is the fact that each member is some object that has some defined way of saying it’s “equal” to another object. In the example above, there is a set of lines and a “set of sets of those lines”. With object-oriented techniques the “equals” function works for both even though the member types are different (i.e. a “line”, and a “set of lines”).
[4] As a brief motivator an allocator can be associated with a data structure to deal with this issue (though it can get hairy at times). A good review of allocation techniques as well as some implementation motivation can be found at this starting point.
[5] Though C++ is super-guilty of not using a multiple-inheritance trick to do this in practice.
[6] I don’t think this is too far-fetched given we’ve done quite a lot of work in linear algebra and higher-level matrix-oriented languages tend to do a good job with running optimized routines based on a few directives.

Written by Geet

July 8, 2008 at 3:07 am

R project map-reduce-filter tidbit

without comments

The R-project is a great tool set for statistical computing (this is its speciality) and even just to have around for quick calculations, plots [1], and data manipulations. The community is large and the source is open. It provides a nifty Unix-like environment to work in and is available on three major operating systems. </advertisement>

Because of the large community size and the highly-interpreted nature of the language, there is definitely an “impure” feeling about using it since some packages have procedures that will call their own C or Fortran code while others use the language directly. I personally like to see it as a good platform for exploratory data analysis/prototyping ideas, and like to leave the more heavy-lifting to something..different.

That said, since its internals are kind of Scheme-like, the expressiveness of the language for data manipulation in particular can be quite handy [2]. The introduction describes a function “tapply” which is useful for applying functions to groups of items with the same category. In that neighborhood there is also “lapply/sapply/mapply” [3] which are like the traditional “map” function. “subset” is very much like the traditional “filter” function.

Not-so-advertised are the “Map“, “Reduce“, and “Filter” functions (tricky-little capital letters). The differences between the traditional FP functions above and the two R analogs listed in the last paragraph are mostly conveniences for the way R treats its data.

If you use R or are interested in experimenting with it, keep these functions in mind because they can make just a few lines of code do some pretty awesome things.

——
[1] Gnuplot and matplotlib are also pretty good open source alternatives to plotting, and there are of course any non-open source options. In my opinion, experimentation with R is definitely worth the time if you’re playing with plots, and willing to side-step a bit from the Python bandwagon, since Python does have some well-developed statistical and scientific computing tools.

[2] See their “Manuals” section for some good introductory documentation and language definition. Many scripting languages these days support higher order functions.

[3] “mapply” is a neat sort-of multimodal map.

Written by Geet

June 21, 2008 at 6:39 pm

High-level UI translation

without comments

With all the rave on generic higher-level interfaces looking more native to the host OS (as opposed the more free-form interface experimentation and design with generic low-level interfaces), I wonder if there has been much work done on actually translating the high-level widget description to OS (widget toolkit)-specific code with function-call bindings for events. This is clearly doable considering Firefox already defines these bindings for the three major OSes. OS-independent GUI code can then be translated into a native look and feel while not sacrificing efficiency because the interface description compiles to native OS widget code.

Not that I’m super-thrilled about the “widget-based” standards we’ve created for high level interfaces these days, but it’s a thought….

Update: WxWidgets seems to do the trick.

Written by Geet

May 20, 2008 at 2:25 am

The Structure and Interpretation of *

without comments

For a while now I’ve been meaning to note a couple links that may be of interest.  Some computer science undergrads are familiar with a happy-little purple book by Abelson and Sussman titled The Structure and Interpretation of Computer Programs (SICP, pronounced “sick-pea”) [1].   For those of you who don’t know, if you click on that last link, you’ll notice the entire text is available online, which is great because this book is a superb reference (at least it has been for me).  SICP is known for its usage of a LISP-like variant: Scheme to introduce a precocious individual to computer science.

If you can get through that and an introductory classical mechanics course, then why not learn about classical mechanics vis-a-vis Scheme?  Also available online, The Structure and Interpretation of Classical Mechanics (SICM or “sick-’em”, I presume).
——————–
[1] Sadly, most schools that adopt MIT’s SICP program probably teach the course with some pretension as the first (or one of the first) computer science course for undergrads.  While the upper-echelon students enjoy and potentially fly through the material, most well-intentioned and eager students seem to become disenchanted with the complexity and challenge of the material.  I appreciate the rationale for teaching this sooner than later: this book is a great intro to real computer science, not for-loops and try-catch statements.  That being said, perhaps a lighter introduction with the book is in order, because even the best of students will just scrape at the surface of the cool stuff in it.

Written by Geet

May 7, 2008 at 2:15 am