Sunday, June 29, 2008

How fast is Scheme? Well....

Thinking that it'd be a good idea to become familiar with Scheme as well as Common Lisp (but, er... probably after I've read Practical Common Lisp and amn't mentally overloaded, what with me being mental so busy and needing to focus my efforts in a more single-threaded manner), I checked out Gambit Scheme in the hopes that it'd produce faster code than the other free Scheme implementations.
I don't know much about Scheme (or Lisp at all really; I've only become interested in it during the last month or so) but it seems that the Scheme compilers produce quite sluggish code, at least looking through the grainy, distorted lens that is the Computer Language Benchmarks Game.

From some of the microbenchmarks published by the Gambit guy(s) on their homepage, it looks as if it fares really well against the Chicken and Mzscheme compilers, but I wanted to examine its performance on problems relative to SBCL, which is the Common Lisp compiler I've arbitrarily chosen.
Since it's not featured on the language shootout, I pulled the SBCL code for a few tests there as well as the Gambit code which was already prepared on their site (why hasn't Gambit appeared on the shootout site for the nosy hackers who want to find out if less mainstream (man I hate that word) languages are ridiculously slow? Many of them aren't! And they're more expressive and comfortable than C++/Java, use them!) and ran a few comparisons.
SBCL comes out very much on top; in general the Gambit programs take two or three times as long to do their job (although I haven't looked at memory usage). But as far as Scheme compilers go, Gambit seems to be improving things.

(Really dodgy) timings on my Macbook (input values = whatever's on the shootout site):

Gambit:
pidigits
real 0m3.611s
user 0m3.508s
sys 0m0.066s

binary-trees
real 0m7.519s
user 0m7.309s
sys 0m0.130s

fannkuch
real 0m12.571s
user 0m12.089s
sys 0m0.205s

fasta
real 0m46.334s
user 0m42.284s
sys 0m2.668s

Lisp SBCL:
pidigits
real 0m2.422s
user 0m2.018s
sys 0m0.308s

binary-trees
real 0m1.899s
user 0m1.723s
sys 0m0.127s

fannkuch
real 0m6.221s
user 0m5.982s
sys 0m0.098s

fasta
real 0m20.730s
user 0m15.267s
sys 0m4.375s

(ratio-pidigits: 1.5
ratio-binary-trees: 3.959
ratio-fannkuch: 2.021
ratio-fasta: 2.235)


One thing I've noticed from these coarse measurements is the "sys" component of the execution time is mostly higher for SBCL; maybe this means a lot of allocation and de-allocation of stuff is going on ...Or something.

8 comments:

  1. Anonymous29/6/08 17:56

    SBCL pre-allocates a pretty good chunk of memory by default when it starts up, that may account for the sys timings.

    When you're doing more interesting things that live longer than fractions of seconds it pays off though, much like a JIT compiler (which I ranted about plenty enough in my own post linking to this one). The preallocation saves the need to be constantly calling malloc/free until you get start dealing with big-ish datasets as a pure C/C++ implementation would need. Of course, you could do the same thing in C as well.

    I think 'grainy, distorted lens' is a pretty apt description for most 'benchmarks' in general. =)

    ReplyDelete
  2. Absolutely prael; I'm really impressed by the performance of SBCL on the tougher benchmarks on the shootout site. Preallocating your own heap makes sense, although I wonder what happens if you start multiple Lisp images (e.g. if you distribute programs as compiled executables, will different programs share a common heap or will they hog all of the machine's RAM?).

    It looks like Scheme Ikarus grabs a chunk of heap similarly, but MzScheme and Chicken don't, from the memory usage stats.

    ReplyDelete
  3. Anonymous30/6/08 00:32

    I'm pretty certain that they would actually all allocate their own heaps. But you can just pass a switch to SBCL to tell it to not preallocate so much, or perhaps consider a threading approach instead. Maybe even try something like MemCache, depending on what sorts of computations you're doing. There's a lot of possibilities. Far as distributing binaries though, I'd just tell it to prealloc as little as possible.

    ReplyDelete
  4. Anonymous30/6/08 01:45

    This post caught my eye (as the person who posted the gambit version of the shootout programs to the gambit web site ;-).

    First, the reason they're not on the shootout page is that alioth never seemed to send me my passwords (but it seems that the user names "lucier" and "bjlucier" are now forever taken). It wasn't worth the time to get it fixed.

    SBCL's compiler is a great optimizing compiler, and Gambit's compiler can't touch it. However, I don't see the differences that you're seeing on my 2.3GHz Core 2 Duo running SBCL 1:1.0.11 (an older version) on Ubuntu 8.04 and the 64-bit version of Gambit-C 4.2.8. (If you're using the universal Gambit installer on your Macbook, it installs the 32-bit version. You could check by asking (fixnum? (expt 2 32)).)

    Gambit:
    pidigits: 1.308u 0.012s 0:01.34
    binary-trees: 2.256u 0.064s 0:02.32
    fannkuch: 6.880u 0.020s 0:06.90
    fasta: 18.957u 0.084s 0:19.08

    SBCL:
    pidigits: 0.844u 0.148s 0:01.05
    binary-trees: 1.940u 0.116s 0:02.06
    fannkuch: 4.280u 0.024s 0:04.40
    fasta: 15.448u 0.156s 0:15.64

    (I changed the runtime instructions on the wiki for binary-trees to be "gsi -:m100000 binary-trees 16" to give it a 100MB minimum heap size.)

    So the ratios of User+System times are
    pidigits: 1.33
    binary-trees: 1.13
    fannkuch: 1.60
    fasta: 1.22

    I don't see these ratios as so big that one would use them as the sole basis to choose SBCL over Gambit. (There are lots of other good reasons to choose SBCL over Gambit.) And, of course, there are other benchmarks (e.g., for bignum arithmetic, which are not in the shootout because they're not easy to do in C) where Gambit would win.

    ReplyDelete
  5. Anonymous30/6/08 01:56

    Speak of the devil?

    I just came back here to mention a fascinating thread involving you, Gambit-C, and performance.

    https://webmail.iro.umontreal.ca/pipermail/mslug/2007-December/000241.html

    ReplyDelete
  6. Anonymous Brad Lucier said... And, of course, there are other benchmarks (e.g., for bignum arithmetic, which are not in the shootout because they're not easy to do in C) where Gambit would win.

    Yet another person apparently with expert knowledge about why something is or is not in the shootout!

    Strangely without enough expert knowledge to even work an alioth id :-)

    ReplyDelete
  7. Anonymous30/6/08 07:41

    For benchmarks which allocate a lot (like the ones that were used in this evaluation) the performance of the memory management will dominate the computation. Using a large heap makes memory management more efficient because the garbage collector gets called less often. With the objective of being memory friendly with other processes by default, Gambit's runtime uses a small heap by default (500KB to 1MB). When I run benchmarks I typically tell the runtime to use a heap of at least 10MB (with the switch -:m10000). I've seen benchmarks double in speed with this switch.

    Moreover, what level of optimization did you request from the Gambit compiler? Did you use the -D___SINGLE_HOST option (or --enable-single-host when you compiled Gambit)? That also boosts performance by a large factor.

    Now I'm not saying that Gambit can beat SBCL at all benchmarks. SBCL generates machine code directly (with no constraints). The machine code generated by Gambit is really generated by the C compiler, and the cost of supporting tail-calls in C adds an overhead that can't be avoided (so programs with frequent function calls suffer more). On the other hand, Gambit is extremely portable. I'm not sure how portable SBCL is... The port matrix for SBCL (http://www.sbcl.org/platform-table.html) is sparse and it takes considerable time to port to a new platform. Gambit's matrix is dense (any platform with a decent C compiler is supported).

    Marc

    ReplyDelete
  8. Brad and Marc,

    Thanks for your comments! I think I'm at fault here: in my tests, I never specified the heap size manually; I simply used gsc/gcc to build an executable with no single-host option to gsc, only passing -O3 and a no-frame-pointer optimisation to gcc.
    Later on, I'll run the tests again with these options; silly of me not to ask for a larger heap, even though it's mentioned on the Gambit site under each shootout example.

    Maybe once an application is running for a little while, even if it started out with a small heap, things will even out as the allocator keeps requesting more and more memory and the heap is expanded automatically?
    Or would heap fragmentation in that case slow things down a bit?

    Still, using a small heap in the usual case and allowing it to be manually set indicates that Gambit could work pretty well on embedded hardware without lots of memory. Which is nice!

    Oisín

    ReplyDelete