Speed up your Python: Unladen vs. Shedskin vs. PyPy vs. Cython vs. C

Lately I’ve found that prototyping code in a higher-level language like Python is more enjoyable, readable, and time-efficient than directly coding in C (against my usual instinct when coding something that I think needs to go fast).  Mostly this is because of the simplicity in syntax of the higher level language as well as the fact that I’m not caught up in mundane aspects of making my code more optimized/efficient. That being said, it is still desirable to make portions of the code run efficiently and so creating a C/ctypes module is called for.

I recently created an application (I won’t go into the details now) that had a portion of it that could be significantly sped up if compiled as a C module.  This spawned a whole exploration into speeding up my Python code (ideally while making minimal modifications to it).

I created a C module directly, used the Shedskin compiler to compile my Python code into C++, and tried the JIT solutions PyPy and Unladen Swallow.  The time results for running the first few iteration for this application were surprising to me:

cpython:        59.174s
shedskin: 1m18.428s
c-stl:             12.515s
pypy:           10.316s
unladen:       44.050s
cython:         39.824

While this is not an exhaustive test, PyPy consistently beats a handwritten module using C++ and STL!  Moreover, PyPy required little modification to my source (itertools had some issues) [1].  I’m surprised that Uladen and Shedskin took so long (all code was compiled at O3 optimization and run on multiple systems to make sure the performance numbers were relatively consistent).

Apparently out-of-the-box solutions these days can offer nearly a 10x improvement over default Python for a particular app. and I wonder what aspects of PyPy’s system accounts for this large performance improvement (their JIT implementation?).

[1] Uladen required no modifications to my program to run properly and Shedskin required quite a few to get going.  Of course, creating a C-based version took a moment :-).

Update 1: Thanks for the comments below.  I added Cython, re-ran the analysis, and emailed off the source to those who were interested.

Update 2: The main meat of the code is a nested for loop that does string slicing and comparisons and it turns out that it’s in the slicing and comparisons that was the bottleneck for Shedskin.  The new numbers are below with a faster matching function for all tests (note that this kind of addition requires call ‘code twiddling’, where we find ourselves fiddling with a very straightforward, readable set of statements to gain efficiency).

cpython:       59.593s
shedskin0.6:   8.602s
shedskin0.7:   3.332s
c-stl:              1.423s
pypy:             8.947s
unladen:       29.163s
cython:         26.486s (3.5s after adding a few types)

 

So C comes out the winner here, but Shedskin and Cython are quite competitive.  PyPy’s JIT performance is impressive and I’ve been scrolling through some of the blog entries on their website to learn more about why this could be. Thanks to Mark (Shedskin) and Maciej (PyPy) for their comments in general and and to Mark for profling the various Shedskin versions himself and providing a matching function. It would be interesting to see if the developers of Unladen and Cython have some suggestions for improvement.

I also think it’s important not to look at this comparison as a ‘bake-off’ to see which one is better.  PyPy is doing some very different things than Shedskin, for example.  Which one you use at this point will likely be highly dependent on the application and your urge to create more optimized code.  I think in general hand-writing C code and code-twiddling it will almost always get faster results, but this comes at the cost of time and headache.  In the meanwhile, the folks behind these tools are making it more feasible to take our Python code and optimize it right out of the box.

Update 3: I also added (per request below :-)) just a few basic ‘cdef’s and types to my Cython version.  It does a lot better, getting about 3.5s on average per run!

21 comments

  1. fijal · November 25, 2010

    Hey.

    Can I find source code to reproduce those results?

    Cheers,
    fijal

  2. mark dufour · November 25, 2010

    hi,

    I’d like to know why shedskin performs so badly here. if you don’t wish to publish your source code, could you perhaps send it to me in private?

    thanks!

  3. Geet · November 25, 2010

    Hi Mark and Fijal — thanks for the quick responses!

    I’ll try to comment-up the source and send it to you both in the next couple of days. (I do not wish to publish the source because it’s a portion of a class lab that I simply wanted to speed up :-))

    I’ll also send a README with exactly what I did for each test. Again, they weren’t exhaustive and not meant to be a ‘show-down’. I just wanted to see what worked fastest for me.

    Any corrections/comments to my procedures would be greatly appreciated and I’ll update my entry accordingly.

  4. Nelle · November 25, 2010

    Hello,

    I would also be interested in seeing the code. Would it be possible for you to send it to me as well ?

    Many thanks

  5. Mike R · November 25, 2010

    You’d mentioned itertools had issues with PyPy. Can you elaborate?

    I’ve been trying to find less time-intensive ways of speeding up code than rewriting the slow parts in C. It sounds like PyPy might be a good starting point, but itertools makes it into virtually all the complicated code.

  6. joaquin · November 25, 2010

    Why didnt try Cython?

  7. bryan · November 25, 2010

    Adding Cython to your list would be interesting.

  8. Geet · November 25, 2010

    Hi Mike — my only issue with itertools was having to re-code ‘product’ as nested for loops. Joaquin and Bryan — I added Cython in as well. Thanks for the comments!

  9. gregor · November 26, 2010

    Hello. I would like to dig in deeper in that subject, could you some source code, that I could start with?
    I’m want to check, why unladen is so slow.
    Thanks.

  10. Carl Friedrich Bolz · November 26, 2010

    Just wanted to add that in the meantime, PyPy has added better itertools support (which took about 20 minutes 🙂 ). Not sure about product though.

  11. Geet · November 26, 2010

    Carl — nice to hear that itertools support is improved.
    Gregor — I sent you a version of the test code so you can check out Unladen

  12. mark dufour · November 27, 2010

    note that shedskin can probably generate even faster code when you use -bw flags (avoid checking for index-out-of-bounds or wrap-around). this reduces the runtime by about 20% here.

  13. Rob · November 28, 2010

    Are there plans in PyPy to support the multiprocessing module? I’m looking to try it out for one of my projects, but the lack of multi-threading support makes it a no-go for the time being.

  14. mark dufour · November 28, 2010

    do you really need threads, or would processes work too..?

  15. Pingback: Python Hatchlings part 0 | RoBlog
  16. Robert · November 30, 2010

    Did you annotate any types in the Cython code? This is how one typically gets speeds close to that of pure C. I’d love to see how fast I could get it going with Cython.

  17. Geet · November 30, 2010

    Hi Robert — thanks for the comment! I have not done any type annotation. I just tried out some very basic type annotations and I get running times around 3.5 seconds — does make a big difference! (I’ll update the post sometime today)

  18. Francisco Costa · February 28, 2011

    Hi, can you send me that source code?
    thanks!

  19. Geet · February 28, 2011

    Sure, I’ll send it to your email

  20. louis · June 24, 2011

    Geet,
    I have been searching high and low for such a comparison as I am moving from matlab. Another request for source please.

  21. dr.benton · August 23, 2011

    Hi Geet —

    I’m trying to get my head around how to incorporate pypy into standard python to achieve these kind of speed-ups.

    Could i ask you to please pass me your pypy source code, and maybe also the appropriate “import” statements (or equiv) to fold the pypy bits into vanilla python?

    And any links that you found helpful or could recommend would be very greatly appreciated. I am seriously struggling to wade through what documentation there is/i can find!

Leave a comment