Monte carlo sampling the modularity space

Warning: I’m not really doing this entry justice, but it should be good-enough for logging purposes.

For the past few years, the complex networks community has spawned a thread in community detection: very roughly the problem of detecting subgraphs within a network that have a greater density of edges within-community than not.

Though greedy algorithms optimizing a very successful criteria called modularity (obviously there are many alternative methods to optimizing modularity overviewed in the literature) are good for very large networks, one side-issue is properly analyzing the variety of partitions that could be generated that have similar modularities. This has been explored a bit (too many links for the spirit of this post).

One way to tackle this issue is to use a technique called Markov Chain Monte Carlo (MCMC), in which the transition probabilities in your Markov chain are from one possible partitioning of the network to another. Since the number of possible partitions of n nodes in the network is equivalent to the number of possible set partitions with n members, we can’t possibly exhaustively explore the entire space for even moderately sized networks. But we can, if careful, sample from this space with proportional to a bias of our choice because it can be shown that the Markov chain will eventually reach a state of equilibrium.

This technique, among other things, helps us learn more about the properties of the space of partitions as well as analyze whether the partitioning buys us much (i.e. if the nodes really stay in one partition a lot or not for the ensemble of partitions generated at equilibrium). I’ve come across and have been pointed to literature where it’s in statistics, quantum chemistry, and statistical physics, and am most certain it’s used in many more fields of study. Part of the appeal in using the technique is that it allows for a surprising amount of flexibility in choosing the transition probabilities and is conceptually easy to grasp (I wonder how badly it’s abused).

I can now link you to some code I wrote.

It uses the same transition scheme described in Massen-Doye ’06, but samples directly proportional to the modularity rather than via parallel tempering on the Boltzmann distribution (an exploratory suggestion from Aaron Clauset). The results show that this works quite well itself, and the modularity of the partitions at equilibrium usually hover below the maximum determined modularity, as expected.

With this “starting point” (the cleaned up code) I have tried a variety of other transition schemes as well as done some thinking on what would be valuable in terms of the target distribution. This exploration serves as a good basis for continuing to play with modularity, its connection to hierarchical structure, and even dynamical systems.

On a less technical level, I have been slowly gathering notes on what’s actually happening (a.k.a WAH?) in the field of statistical analysis of complex networks (both static and dynamic) as a whole and what certain analysis techniques buy us for particular networks. Right now I’m still engrossed in seminal work, but as they become more refined I’ll post a link to them so that I can perhaps get more feedback on my “worldview” of the area.

And even more tangentially related, those of you who may reproducing research, there is a slight error in the reporting of the modularity of the karate club network for the fast algorithm (the author has been aware of this) and that the second pass of the CNM algorithm should be run at step MAXQ-1 to get the partition with optimal modularity.