Flex and Bison Fun
So I’ve been writing parsers lately with Flex and Bison.
Some notes of interest:
- The “info” pages on both are really helpful and I’ve learned most of what I needed from them (as opposed to Googling around or finding a book)
- The Regular Expression Explorer has been quite handy
- While learning, I wrote a bunch of toy parsers, and I realized that I was copying “template code” a lot. This seemed unnecessary. A tool like this simply takes a BNF-like input and outputs into different languages. Seems pretty cool but haven’t played with it too much yet.
Fun.
I’ve Spent the Last Fifteen Minutes
YouTubing Nora the piano cat thanks to this video:
via Erik.
Is clustering just an art?
Clustering data is the practice of grouping or partitioning a set of points based on how similar they are to one another. One of the most basic forms of this would be by eye when looking at a plot of data. For example, if we see two clumps on the x and y axes, then we say these data are grouped into two clusters.

plot(matrix(ncol=2,byrow=TRUE,jitter(c(rep(c(1,1),50),rep(c(2,2),50)))), xlim=c(.5,2.5), ylim=c(0.5,2.5),xlab=NA, ylab=NA)
Here, I artificially created two clusters in the R project tool grouped around (1,1), and (2,2). The job of clustering tools is to figure out that these are indeed clusters. This becomes harder when dealing with more than two dimensions, different types of data, and the question of what a cluster actually is.
Clustering tools are often used as a form of exploratory data analysis and fit within the class of unsupervised learning techniques. It’s not hard to see their practical relevance when they can help form groups from data elements having many different properties.
For some time now, I’ve heard the phrase “clustering is an art not a science.” This becomes even more evident when faced with the frustrating quantity of clustering algorithms for specialized tasks and intimate interaction with the data to realize what’s actually happening. Moreover, I can’t help sneaking suspicion on how certain techniques seem to overlap with one another. Now I’m not trying to knock specialized techniques. In fact, quite the opposite. For the thoughtful ones that seem to tackle a particular class of data, at least they do that and can be used. That being said, without being able to formally lay out what’s actually happening, it’s hard to have a sound basis on which to compare methods.
What I’d like to (quickly) highlight in this post is a couple of interesting notions that tickle the possibility of a cleaner way to think about clustering. I’m not going to focus on the usual classifications of algorithmic approaches (e.g. hierarchical vs. partitioning). While those distinctions are useful in choosing techniques and tools to cluster with, I think it might be valuable to look at the problem without concerning ourselves too much with traditional algorithms.
A data point can be quite flexible. It’ s just a tuple. It can even be of mixed types (e.g. the first dimension can be some point in the real numbers, the second can be one of n nominal elements of a set, …).
All that matters is that we can define the distance between any two tuples. Even with mixed types, there are reasonable ways to approach this [1]. With this, we have a metric space.
We can stick this metric space in a Euclidean space. In the mid-80s, it was shown that any metric space with n elements (in our case tuples) can be embedded in a O(log n)-dimensional Euclidean space with some reasonable constraints on distortion [2]. Distortion is essentially the discrepancy between the true distances in the n element metric space and the distances in the Euclidean space. A decade later, randomized algorithms were developed to actually do this (e.g [3]).
The curse of dimensionality is a problem where the higher we go in dimension, the more likely it is for points to hover around the edges of some abstract hypersphere of our cluster. This is because as the dimension increases, most of the volume of the hypersphere is contained in some small distance from the edge of the sphere. Say our tuple was just a vector in the real number space, then it may be possible to reduce the dimensionality considerably and not bias our cluster shapes too much due to high dimension.
These facts (the first reference being quite practical and the last two more foundational) are quite relevant to clustering and I’d argue make the whole thing a little more of “a science” (eh hem, “a math”).
There are, I feel, two big issues I’ve ignored ’till yet, and these seem to really make up the “art” part.
1. We still have to take these points in some lower-dimensional Euclidean space and partition them into groups. This requires some assumption of what a cluster actually is. We can assume they’re spherical, but that’s likely not always the case. What about different types of shapes? While clustering shape sounds minor and in some cases may not be practically that relevant, to my knowledge, most ways of defining distances between clusters tend to restrict all the clusters to relatively homogeneous shapes. This seems like a problem. Maybe it can be dealt with mathematically in some way?
2. Let’s go back to the start where I talk about defining distances between points. This can be a potentially arbitrary process depending on the methods by which the data was gotten in the first place. But even aside from that, the structure one is clustering on may suggest a variety of alternate ways to define distance between two points. For example, we can define the distance between two nodes on a graph in many ways (not just the shortest path).
While for the moment I’m certainly more at home as a “user/developer” of tools in this regard, here’s to theory.
——
[1] Kaufman L and Rousseeuw PJ. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons (1990).
[2] Bourgain J. “On Lipschitz embedding of finite metric spaces in Hilbert space”. Israel Journal of Math 52, 46-52 (1985).
[3] Linial N, London E, and Rabinovich U. “The geometry of graphs and some of its algorithmic applications”. Combinatorica 15, 215-245 (1995).
Git on that train
Warning: As the title suggests, I’m on a train.
In the last week and a half or so I’ve been using Git for a project amongst coworkers and most recently for my own code and text files. I was a bit skeptical. But after going through their excellent documentation, seeing the videos, and most importantly, a lot of tinkering, I’m realizing that it’s making life better for me.
There are a lot of resources that compare SCMs, so I don’t want to worry here about which is better and why. But I’d like to share two things that I’ve really liked about using Git.
- First off, using something like Git doesn’t necessarily have to be for a big collaborative project. This may be a sort of different take than the usual, but I like to see it as a tool that helps me see different “views” of my files depending on the job. Anyone who’s written a lot of text or code realizes that it’s actually quite hard to make things as modular as one would like and that sometimes we’re relegated to grabbing snippets of text here and there rather than making black boxes. While this is not encouraged as an engineering practice, it’s sometimes very useful for play. With Git, I’m less concerned about making all my source code fit in a super-consistent modular framework and more concerned with focusing on doing a task cleanly. This is possible because I can branch and merge with relative ease (which I understand is a pain in the ass with other SCMs). In a branch I can move some files around and do whatever I want without affecting the source. I can then cherry-pick commits I like from that into another branch. These branches can look wholly different, but the code can still be updated. This flexibility makes me worry less about organization and directory structure and more about just choosing the right, minimalistic view for the job. I now see things in terms of diffs and commits and Git provides the machinery to do real work with them.
- Git is minimalistic, local, and fast. Which is great. It’s a small source code base that compiles quickly and gives me handy command line utilities. Proper use of them = power (though even with good documentation there’s a learning curve involved). Unlike a lot of SCMs, Git is designed to be local. While it can do a lot of stuff over the network, it’s modus operandi is in a local repository (which is just one .git/ directory in your root directory). I, in fact, don’t even use the SSH/SSL features layered on top. Git helps me realize what I’ve changed and worked on and I build patches from there. You can email them to whoever, and applying them is easy.
And if I’m frustrated with some apparent inadequacy, it’s likely I can find some post with Linus himself justifying it with a little intellegence (e.g.).
Handy Makefile
The following is a snippet for a handy and concise Makefile that will make executables for all C files in a directory. It’s good for “sandboxing” this illustrates some of Make’s useful features without before consulting a larger resource. Tip of the hat to Erik for helping me polish it up.
# Flags for gcc FLAGS = -D_GNU_SOURCE -O3 -g # All C files as sources, and chop off the .c for targets SOURCES = $(wildcard *.c) TARGETS = $(patsubst %.c, %, $(SOURCES)) all: $(TARGETS) # All targets without an extension depend on their .c files %: %.c @echo "Building $@" @gcc $(FLAGS) $< -o $@ # The "@" symbol suppresses Make actually displaying the command. clean: @echo "Removing hidden files" @rm -rf .*.swp *.dSYM ._* 2> /dev/null @echo "Removing executables" @rm -rf $(TARGETS) 2> /dev/null
The nice thing about Make is that it’s useful not only for things like C code. I’ve even used it (quite some time ago) to piece together tracks of music using ecasound.
Wikipedia-like blog entries?
This is a sort-of half-baked notion that may have some flame war started somewhere on the net, but I can’t resist.
The combination of a noble call for better Wikipedia articles in a specific subject with a recent example of someone blogging to correct his own entry makes me fantasize: what if individuals wrote blog-like entries that were, in effect, articles like those on Wikipedia?
Why not just create an HTML page or contribute to Wikipedia? As far as contributing to Wikipedia, don’t get me wrong: I think that’s great. But as outlined in linked “noble call”, there are some legitimate issues–most notably that of authorship. Blogging allows one to retain authorship and control but, unlike HTML, it facilitates commenting and allows RSSee updates to interested readers and contributors.
In the second link, the creator of Bittorent is correcting his own Wikipedia entry. This makes sense at some level for the sake of clarification and posterity. There’s no guarantee that if he edits his own entry (say anonymously), the changes will persist, and it may not be appealing to constantly keep watch of minor changes to articles either.
In the days of web one point yore (say before Wikipedia and Wolfram’s Math/Physics pages), if I Googled “Maxwell’s equations”, any page put up by some schmuck like me would probably be dicey at best. But now, I can just feel lucky and get some pretty decent information from Wikipedia.
But due to the number of people editing the entry, there is a certain lack of authorship and potentially unwillingness to even contribute in the first place.
These days, we see lots of academic bloggers posting their lecture notes on specific topics online (some even making the lecture notes themselves posts rather than linking to PDFs), so parts of me feel like we’re already seeing this kind of behavior, though lectures tend to be less encyclopedic.
The thing I like about this approach is that it’s decentralized and personable in the sense that we can use the fact that we trust/value certain people’s discourse more than others to our advantage. Potential downsides: (1) we’re that much more reliant on a YourFavoriteArticleRank algorithm to rank the articles for us (for example, when you link to another article, do you link to your favorite one? the top-ranked one? eh hem, Wikipedia?), (2) more people/groups would need to get blogs, and (3) the notion of a “collaborative article” is substituted with a “first author” (the writer of the post) and “contributers” (commenters). But those concerns don’t seem that dire.
Science in the recent stimulus
Prodded by prof Tao’s post, I spent a little too much time sifting through official documents on the congressional budget as well as other articles related to NSF funding and the stimulus amendment.
Jake Young at Pure Pedantry has a nice post discussing the issue and outlines recent attention given to this subject by Science and Nature. Something I thought I’d forward on….
</politic>
Column-based text editing
Many text editors have a “block” or “column” mode where items can be copied, pasted, and edited vertically rather than horizontally. I’ve never come across an editor (or something like a Vi mod) that allows actual column-based editing. This implies no scrolling up or down. When you reach the bottom of the screen you go back to the top like a two-column paper. This would imply scrolling left or right. This mode would seem somewhat natural for a variety of reasons: it’s often easier to read lines of code and sentences when there is a more limited width (and this can also be more space-efficient for wider screens):
Assuming no explicit line-breaks within textual paragraphs and code that can fit within the column limits, I wonder if there are some other major issues with this…
Caribbean Christmas Tree
I’m not an avid traveler, but occasionally I’m prodded from where my rump rests to go somewhere. This time, my cousin who was born and raised in Trinidad left his Seattle-Google residence to get married in his home country.
On the trip, I had the chance to visit a bird sanctuary where the Scarlet Ibis is the main attraction. One of the neat things was that in the evening time, all the birds congregate around a similar area to sleep. On our tour, it was a mini island rock with a bunch of trees that looked much like this:
Pretty sharp contrast. Guess they do Christmas a little differently there.
A fun electromagnetic induction demo
A couple of days ago, I had a lunchtime-preoccupation with antennas and electromagnetic induction. We observe a neat phenomenon if you put two wires near each other, and Maxwell stated the phenomenon quite eloquently in his treatise on electromagnetism. Basically, the instant you change the voltage across one of the wires, the wire next to it (but not touching it) responds with a current that quickly goes away.
By wiggling around the voltage in certain patterns, we can find interesting ways to transmit information back and forth between one another, and through induction we can do this without direct physical contact. These days it’s easy to take for granted the all the internets sent to us via WiFi and forget about how highly developed an area radio is.
But to get taste of the physical communication, not much is required: basically some power sources and wires. A wire can in fact be practically used as an antenna for certain radio frequencies.
A simple way to check out induction with a laptop is to see if you can transmit a song that usually plays from your speaker through the air and back to yourself. Since the sound card does both audio input and output as well as the analog-to-digital conversion, a lot of the hairy parts are taken care of for us.
To do this, we need to have the sound output (i.e. from your headphone audio jack) go to a wire. Next to that wire is another wire that goes into the microphone or “line in” audio jack. Keep in mind, these are just wires–that’s all. I just forced myself to part with the headphone that came with my music player and another that I got on an airline flight.
As you can see, the ends of the two headphones lay prostrate next to the two wires connected into the microphone and headphone jacks:
So, now all you have to do is start up some music and record the input with some audio recording program (I used GarageBand since I’m on a Mac):
The green bars towards the middle-right wiggle around while recording indicating that the wire is receiving something. Doing the same thing without the wires in the line in and out yields no green bars wiggling around.
Because there’s so much power loss, the signal is faint, so I digitally amplified it some in the software. But horray! I rather shittily transmitted sound back tomyself over air via just two wires. This, I think, is a neat way to illustrate induction. Listen for yourself. You’ll need to turn up your volume a bit.




