Lately I’ve found that prototyping code in a higher-level language like Python is more enjoyable, readable, and time-efficient than directly coding in C (against my usual instinct when coding something that I think needs to go fast). Mostly this is because of the simplicity in syntax of the higher level language as well as the fact that I’m not caught up in mundane aspects of making my code more optimized/efficient. That being said, it is still desirable to make portions of the code run efficiently and so creating a C/ctypes module is called for.
I recently created an application (I won’t go into the details now) that had a portion of it that could be significantly sped up if compiled as a C module. This spawned a whole exploration into speeding up my Python code (ideally while making minimal modifications to it).
I created a C module directly, used the Shedskin compiler to compile my Python code into C++, and tried the JIT solutions PyPy and Unladen Swallow. The time results for running the first few iteration for this application were surprising to me:
While this is not an exhaustive test, PyPy consistently beats a handwritten module using C++ and STL! Moreover, PyPy required little modification to my source (itertools had some issues) . I’m surprised that Uladen and Shedskin took so long (all code was compiled at O3 optimization and run on multiple systems to make sure the performance numbers were relatively consistent).
Apparently out-of-the-box solutions these days can offer nearly a 10x improvement over default Python for a particular app. and I wonder what aspects of PyPy’s system accounts for this large performance improvement (their JIT implementation?).
 Uladen required no modifications to my program to run properly and Shedskin required quite a few to get going. Of course, creating a C-based version took a moment :-).
Update 1: Thanks for the comments below. I added Cython, re-ran the analysis, and emailed off the source to those who were interested.
Update 2: The main meat of the code is a nested for loop that does string slicing and comparisons and it turns out that it’s in the slicing and comparisons that was the bottleneck for Shedskin. The new numbers are below with a faster matching function for all tests (note that this kind of addition requires call ‘code twiddling’, where we find ourselves fiddling with a very straightforward, readable set of statements to gain efficiency).
cython: 26.486s (3.5s after adding a few types)
So C comes out the winner here, but Shedskin and Cython are quite competitive. PyPy’s JIT performance is impressive and I’ve been scrolling through some of the blog entries on their website to learn more about why this could be. Thanks to Mark (Shedskin) and Maciej (PyPy) for their comments in general and and to Mark for profling the various Shedskin versions himself and providing a matching function. It would be interesting to see if the developers of Unladen and Cython have some suggestions for improvement.
I also think it’s important not to look at this comparison as a ‘bake-off’ to see which one is better. PyPy is doing some very different things than Shedskin, for example. Which one you use at this point will likely be highly dependent on the application and your urge to create more optimized code. I think in general hand-writing C code and code-twiddling it will almost always get faster results, but this comes at the cost of time and headache. In the meanwhile, the folks behind these tools are making it more feasible to take our Python code and optimize it right out of the box.
Update 3: I also added (per request below :-)) just a few basic ‘cdef’s and types to my Cython version. It does a lot better, getting about 3.5s on average per run!