Silly benchmarks

04 May 2017

It started with me being curious: if I use an Integer in Kotlin, am I going to be paying a penalty for using a boxed primitive? So I wrote some silly benchmarks to confirm. First, in Java:

public class SumTestPrimitive {
  public static void main(String[] args) {
    long sum = 0;
    for (long i = 1; i <= Integer.MAX_VALUE; i++) {
      sum += i;
    }
    
    System.out.println(sum);
  }
}

public class SumTestBoxed {
  public static void main(String[] args) {
    Long sum = 0L;
    for (Long i = 1L; i <= Integer.MAX_VALUE; i++) {
      sum += i;
    }
    
    System.out.println(sum);
  }
}

On my system (a mid-2011 MacBook Pro, 2.4 GHz i5) the primitive version takes 1.77 seconds, and the boxed version takes 20.8 seconds. Then in Kotlin:

fun main(args: Array<String>) {
	var sum = 0L
	var i = 0L
	while (i <= Integer.MAX_VALUE) {
		sum += i
		i += 1L
	}

	println(sum)
}

The Kotlin one takes 1.81 seconds. Tiny bit slower than the Java primitive one, but that’s probably just due to needing a little more time for Kotlin’s runtime to load. Kotlin does unboxed primitives properly, yay!

Now I’m curious though: how do the other languages I use on the regular perform? Let’s try Clojure first, both a straightforward implementation and one tailored to match the Java one better:

(defn sum-test-straightforward []
  (loop [sum 0
         i 0]
    (if (<= i Integer/MAX_VALUE)
      (recur (+ sum i) (+ i 1))
      sum)))

(defn ^long sum-test-gofast []
  (loop [sum 0
         i 0]
    (if (<= i Integer/MAX_VALUE)
      (recur (unchecked-add sum i) (unchecked-add i 1))
      sum)))

sum-test-straightforward took 5.1 seconds, and sum-test-gofast 1.69 seconds. The gofast one is comparable to the Java one, probably a little slower: I ran these at a REPL, so there’s no startup time involved.

Ok, how about Common Lisp? I can think of 3 approaches to take off the top of my head.

;; 2147483647 is the same value as Java's Integer/MAX_VALUE.

(defun sum-test-iterative ()
  (let ((sum 0))
    (dotimes (i 2147483647)      
      (setf sum (+ sum i)))
    sum))

(defun sum-test-recursive ()
  (labels ((sum-fn (sum i)
             (if (<= i 2147483647)
                 (sum-fn (+ sum i) (+ 1 i))
                 sum)))
    (sum-fn 0 0)))

(defun sum-test-loop ()
  (loop for i from 1 to 2147483647 sum i))

Using ClozureCL, all 3 of these perform abysmally:

sum-test-iterative: 62.98 seconds, 2.82 of which were GC time. 20 GiB allocated.
sum-test-recursive: 74.11 seconds, 3.52 of which were GC. 20 GiB allocated.
sum-test-loop: 50.7 seconds, 2.58 of which were GC. 20 GiB allocated.

SBCL does much better off the bat, but still not great:

sum-test-iterative: 7.7 seconds, no allocation
sum-test-recursive: 17.1 seconds, no allocation
sum-test-loop: 7.63 seconds, no allocation

Adding some type annotations and optimize flags helped SBCL, but ClozureCL’s times stayed the same:

(defun sum-test-iterative ()
  (declare (optimize speed (safety 0)))
  (let ((sum 0))
    (declare ((signed-byte 64) sum))
    (dotimes (i 2147483647)
      (setf sum (the (signed-byte 64) (+ sum i))))
    sum))

SBCL’s sum-test-iterative drops down to 3.13 seconds, still no allocation. No change on Clozure. I’m probably doing something wrong here, but it’s not clear to me what. The disassembly of sum-test-iterative on SBCL shows that there’s still an allocation going on there: maybe the problem is just that 64-bit integers don’t work unboxed due to SBCL’s pointer tagging?

* (disassemble 'sum-test-iterative)

; disassembly for SUM-TEST-ITERATIVE
; Size: 69 bytes. Origin: #x1002AF97D7
; 7D7:       31C9             XOR ECX, ECX                    ; no-arg-parsing entry point
; 7D9:       31C0             XOR EAX, EAX
; 7DB:       EB2D             JMP L3
; 7DD:       0F1F00           NOP
; 7E0: L0:   488BD1           MOV RDX, RCX
; 7E3:       48D1F9           SAR RCX, 1
; 7E6:       7304             JNB L1
; 7E8:       488B4AF9         MOV RCX, [RDX-7]
; 7EC: L1:   488BD0           MOV RDX, RAX
; 7EF:       48D1FA           SAR RDX, 1
; 7F2:       4801D1           ADD RCX, RDX
; 7F5:       48D1E1           SHL RCX, 1
; 7F8:       710C             JNO L2
; 7FA:       48D1D9           RCR RCX, 1
; 7FD:       41BB00070020     MOV R11D, #x20000700            ; ALLOC-SIGNED-BIGNUM-IN-RCX
; 803:       41FFD3           CALL R11
; 806: L2:   4883C002         ADD RAX, 2
; 80A: L3:   483B057FFFFFFF   CMP RAX, [RIP-129]              ; [#x1002AF9790] = FFFFFFFE
; 811:       7CCD             JL L0
; 813:       488BD1           MOV RDX, RCX
; 816:       488BE5           MOV RSP, RBP
; 819:       F8               CLC
; 81A:       5D               POP RBP
; 81B:       C3               RET
NIL

Next up, Swift:

var sum: UInt64 = 0
var count: UInt64 = 2147483647

for i: UInt64 in 0 ..< count {
    sum = sum + i
}

print(sum)

Without optimizations, 16 minutes 50 seconds. Holy shit.

With optimizations, 1.11 seconds.

Ok, last one, C:

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv) {
	long sum = 0;
	const long count = strtol(argv[1], NULL, 10);
	for (long i = 0; i <= count; i++) {
		sum += i;
	}
	printf("%ld", sum);
	return 0;
}

Why take in the count parameter from the command line? Because Clang cheats. If I use the constant in there, it’s smart enough to just precalculate the whole thing and just return the final result.

Without optimizations:

solace:sum-tests jfischer$ clang -o sum-test SumTest.c 
solace:sum-tests jfischer$ time ./sum-test 2147483647
2305843008139952128
real	0m8.247s
user	0m8.190s
sys	0m0.035s

With optimizations:

solace:sum-tests jfischer$ clang -Os -o sum-test SumTest.c
solace:sum-tests jfischer$ time ./sum-test 2147483647
2305843008139952128
real	0m0.006s
user	0m0.002s
sys	0m0.002s

It turns out, Clang still cheats even if the loop counter comes from outside. I’m pretty sure it recognized what I’m doing and just turned that loop into Gauss’ trick for computing an arithmetic series. It doesn’t matter what loop count I give it, it always takes the same amount of time with optimizations.

I can’t read/write assembly, but playing around on godbolt.org makes it look like that’s the case: https://godbolt.org/g/FmL66q. (There’s no loop in the disassembly.) And I can’t figure out how to trick it into not doing that, so I’ll call it quits for now.

Jonathan Fischer

Silly benchmarks