From Elegance to Speed
code lisp clojure physics .....
Next: Another Migration
Prev: Common Lisp How-Tos
Making Things Fast
A while back, we used a toy particle physics trigger to explore concepts in lazy evaluation and functional programming in Clojure. We came up with the following function which selected groups of eight events clustered close together in time:
(defn smt-8 [times]
(->> times
(partition 8 1)
(map (juxt identity
(comp (partial apply -)
(juxt last first))))
(filter (comp (partial > 1000) second))))
Though elegant, this function processed events at a rate of 250 kHz, which I found slightly disappointing, performance-wise.
I have been interested in writing faster programs lately, which is one of the reasons I've been learning Common Lisp. Common Lisp lets you create extremely high-level programming constructs, but also lets you get closer to the metal when you need to. In this post we'll look at some ways for improving performance and compare our results to the original 250 kHz event rate.
First Steps
The first step is to translate our code as is to Common Lisp. We're going to punt entirely on the question of laziness for this post, though we might take it up in the future.
Here's my first attempt. We have to start with our source of random times. In Clojure, we had:
(def times (iterate #(+ % (rand-int 1000)) 0))
Since I'm going to benchmark performance, we'll need to set a limit to the number of events (times) we'll process. So, in Common Lisp:
(defun time-sequence (n)
(loop repeat n
for y = 0 then (+ y (random 1000))
collect y))
where n
is the number of event times to process. Here and
throughout what follows, I'm using the loop
macro, which is a Swiss
army knife for iteration with a somewhat strange, non-Lispy, but
readable syntax.
We also want to set our random state so as to guarantee different results each time we execute our simulation:
(setf *random-state* (make-random-state t))
My first attempt in Common Lisp looked like this benchmarking snippet:
(timing
(length
(->> 1000000
time-sequence
(partition-n 8 1)
(mapcar (juxt #'car (compose #'car #'last) #'identity))
(mapcar #'(lambda (l) `(,(- (cadr l) (car l)) ,(caddr l))))
(remove-if-not #'(lambda (l) (< (car l) 1000))))))
This is less lovely than the Clojure, partly because of all the
hash-quoting of functions (Common Lisp is a Lisp-2, and Clojure is a
Lisp 1). ->>
is from the arrow-macros
library, and compose
is
from the cl-utilities
library. The other Clojure-ish functions,
partition-n
(a replacement for Clojure's partition
, whose name
collides with an entirely different function in Common Lisp) and
juxt
are from cl-oju
, a small library I've been writing for those
still-frequent times when I want a Clojure function or idiom in Common
Lisp. (The library ignores laziness for now, since I haven't needed it
yet.)
timing
is a macro I adapted from the Common Lisp Cookbook, which
captures both the results of a computation and the elapsed CPU time
used (not wall clock time) in msec:
(defmacro timing (&body forms)
(let ((run1 (gensym))
(run2 (gensym))
(result (gensym)))
`(let ((,run1 (get-internal-run-time))
(,result (progn ,@forms))
(,run2 (get-internal-run-time)))
`(duration ,(- ,run2 ,run1) msec... result ,,result))))
For now, result
is simply the number of eight-fold time clusters
occuring within 1000 units of time. The execution time was roughly a
second for the first few times I ran this:
'(DURATION 1045 MSEC... RESULT 235)
'(DURATION 1554 MSEC... RESULT 201)
'(DURATION 827 MSEC... RESULT 164)
This already processed events at 1 MHz, roughly 4x faster than the Clojure speed of 250kHz.
Later, however, when I revisited the code, I noticed it ran significantly faster:
'(DURATION 435 MSEC... RESULT 193)
'(DURATION 279 MSEC... RESULT 189)
'(DURATION 205 MSEC... RESULT 177)
'(DURATION 601 MSEC... RESULT 180)
This is 2.6 Mhz, 10x the Clojure code speed. It took me a while to figure out that I had changed the SBCL heap size in order to handle larger arrays for testing (this is one area where laziness would help!). The larger heap size was making the code run faster.
It should be said at this point that I made no attempt in my original blog post to optimize the Clojure code for performance. Nevertheless, I think it's interesting that my first attempt to write the same algorithm in Common Lisp, using roughly the same idioms, performed so much faster.
Interlude
Looking back over my career, it seems I have been moving gradually
from fast, lower-level languages to slower, more expressive,
high-level languages: FORTRAN
and C, to Perl and Python, then to
Clojure (which is faster than Python for most long-running programs,
while being arguably more expressive[fn::Reasonable people can
disagree on the relative expressiveness of languages; for me, Lisps
win on expressiveness because I can bend the language to my will using
macros in a way that I cannot with non-Lisps.]). I've enjoyed this
trajectory, because higher level languages let you implement so much, so
quickly.
My feelings on this are shifting, however. I wrote some code a decade ago for a system that has been running constantly for a decade on thousands of devices. My code is not particularly CPU efficient and, therefore, not as energy-efficient as it could be. As the planet warms, I think it's important to try and find small efficiencies where we can… not necessarily under the assumption that it will make a massive difference, but because it's just the right thing to do as conscientious professionals. (The fact that the sins of my past are executing on processors running on top of and within Antarctic ice adds an extra bit of irony here).
Common Lisp is particularly interesting to me of late because, having been designed largely for Good Old-Fashioned AI, it allows you to work at a very high level (you can, for example, add Prolog-like functionality to a Lisp in a few hundred lines of code [fn::See Paul Graham, On Lisp, p. vii]) – but, you can generate very fast (i.e., more energy efficient) code as well, as we shall see shortly.
Improvements
The next thing I tried was to separate the event generation from the trigger calculation and to use a vector for the input times, instead of a list:
(defun time-sequence-vec (n)
(make-array `(,n)
:initial-contents
(loop repeat n
for y = 0 then (+ y (random 1000))
collect y)))
(defparameter array-size (* 100 1000 1000))
(defparameter *s* (time-sequence-vec array-size))
I also make the test sample much larger to improve the run durations and statistics.
Both Graham1 and Norvig[fn::Peter
Norvig, Paradigms of Artificial Intelligence Programming: Case Studies
in Common Lisp] emphasize tuning your algorithm first before turning
to other optimizations. In our case the problem is O(n), which is
the best time complexity we can hope for, but our use of function
composition, while elegant, could be turned into something more
efficient. To improve the algorithm proper, I perform the entire
computation in a single loop
statement. I also change the output
slightly to show the index for matching 8-fold events, start and end
time, and event duration:
(defun second-try ()
(loop for x below (- array-size 8)
if (< (- (elt *s* (+ x 8))
(elt *s* x))
1000)
collect `(index ,x
startval ,(elt *s* x)
endval ,(elt *s* (+ x 8))
diff ,(- (elt *s* (+ x 8))
(elt *s* x)))))
(take 4 (second-try))
;;=>
'((INDEX 69087 STARTVAL 34504666 ENDVAL 34505615 DIFF 949)
(INDEX 153990 STARTVAL 76991815 ENDVAL 76992630 DIFF 815)
(INDEX 237070 STARTVAL 118617491 ENDVAL 118618358 DIFF 867)
(INDEX 294951 STARTVAL 147513479 ENDVAL 147514346 DIFF 867))
;; Time it, a few times:
(timing (length (second-try)))
;;=>
'(DURATION 2158 MSEC... RESULT 2427)
;;=>
'(DURATION 2167 MSEC... RESULT 2427)
;;=>
'(DURATION 2162 MSEC... RESULT 2427)
;;=>
Note that the starting and ending time values increase monotonically,
and the reported time differences (DIFF
) are under 1000, as desired2.
The new trigger function improves the processing rate to 46.2 MHz. At this point I'm starting to wonder what the assembly language looks like:
(disassemble 'second-try)
;;=>
; disassembly for SECOND-TRY
; Size: 1061 bytes. Origin: #x228FE0B0 ; SECOND-TRY
; 0B0: 498B7510 MOV RSI, [R13+16] ; thread.binding-stack-pointer
; 0B4: 488975F8 MOV [RBP-8], RSI
; 0B8: 488B3591FFFFFF MOV RSI, [RIP-111] ; 'ARRAY-SIZE
; 0BF: 8B56F5 MOV EDX, [RSI-11]
; 0C2: 4A8B142A MOV RDX, [RDX+R13]
; 0C6: 83FA61 CMP EDX, 97
; 0C9: 480F4456F9 CMOVEQ RDX, [RSI-7]
; 0CE: 83FA51 CMP EDX, 81
; 0D1: 0F84BF030000 JEQ L17
; 0D7: BF10000000 MOV EDI, 16
; 0DC: FF1425B000B021 CALL QWORD PTR [#x21B000B0] ; GENERIC--
;; .................. many lines omitted ................
; 4B6: L24: CC18 BREAK 24 ; UNBOUND-SYMBOL-ERROR
; 4B8: 27 BYTE #X27 ; '*S*
; 4B9: L25: 6880000000 PUSH 128
; 4BE: FF14252800B021 CALL QWORD PTR [#x21B00028] ; ALLOC-TRAMP-R11
; 4C5: E9FFFEFFFF JMP L12
; 4CA: L26: 6A10 PUSH 16
; 4CC: FF14252800B021 CALL QWORD PTR [#x21B00028] ; ALLOC-TRAMP-R11
; 4D3: EB94 JMP L14
(258 lines in total.)
My assembler-fu is not strong, but I notice the GENERIC--
and we do
not need generic functions: there are many opportunities here to
introduce type declarations.
After declaring to the compiler we want more speed and less
type-checking ((declaim (optimize (speed 3) (debug 0) (safety 0)))
),
let's declare our array to be type fixnum
(which on SBCL is an
integer from -4611686018427387904 to 4611686018427387903).
(defparameter *s* (make-array array-size
:element-type 'fixnum))
;; Initialize array:
(loop
for y = 0 then (+ y (random 1000))
for i below array-size
do (setf (elt *s* i) y))
(timing
(length
(loop for x below (- array-size 8)
if (< (- (elt *s* (+ x 8))
(elt *s* x))
1000)
collect `(index ,x
startval ,(elt *s* x)
endval ,(elt *s* (+ x 8))
diff ,(- (elt *s* (+ x 8))
(elt *s* x))))))
;;=>
'(DURATION 1964 MSEC... RESULT 2435)
;;=>
'(DURATION 1939 MSEC... RESULT 2435)
;;=>
'(DURATION 2010 MSEC... RESULT 2435)
… maybe a slight improvement. Let's now type-hint where we can throughout the tight loop:
(timing
(length
(loop for x below (- array-size 8)
if (< (- (the fixnum (elt *s* (the fixnum (+ x 8))))
(the fixnum (elt *s* (the fixnum x))))
1000)
collect `(index ,x
startval ,(elt *s* x)
endval ,(elt *s* (+ x 8))
diff ,(- (elt *s* (+ x 8))
(elt *s* x))))))
;;=>
'(DURATION 1740 MSEC... RESULT 2435)
;;=>
'(DURATION 1738 MSEC... RESULT 2435)
;;=>
'(DURATION 1753 MSEC... RESULT 2435)
A 10% or so improvement… not too bad.
At this point I'm tempted to look at the macroexpansion of my loop
expression and see what it's doing. For simplicity, I collect
just
the starting time index for times satisfying the trigger:
(macroexpand-1
'(loop for x below (- array-size 8)
if (< (- (the fixnum (elt *s* (the fixnum (+ x 8))))
(the fixnum (elt *s* (the fixnum x))))
1000)
collect x))
;;=>
'(block nil
(let ((#:loop-limit-711 (- array-size 8)) (x 0))
(declare (type (and number real) x)
(type (and number real) #:loop-limit-711))
(sb-loop::with-loop-list-collection-head (#:loop-list-head-712
#:loop-list-tail-713)
(tagbody
sb-loop::next-loop
(when (>= x #:loop-limit-711) (go sb-loop::end-loop))
(if (<
(- (the fixnum (elt *s* (the fixnum (+ x 8))))
(the fixnum (elt *s* (the fixnum x))))
1000)
(sb-loop::loop-collect-rplacd
(#:loop-list-head-712 #:loop-list-tail-713) (list x)))
(sb-loop::loop-desetq x (1+ x))
(go sb-loop::next-loop)
sb-loop::end-loop
(return-from nil
(sb-loop::loop-collect-answer #:loop-list-head-712))))))
I notice that x
and :loop-limit-711
have an overly-general type declaration. I'm not sure how to type-hint x
, but I can type-hint the upper bound:
(timing
(length
(loop for x below (- (the fixnum array-size) 8)
if (< (- (the fixnum (elt *s* (+ (the fixnum x) 8)))
(the fixnum (elt *s* (the fixnum x))))
1000)
collect `(index ,x
startval ,(elt *s* x)
endval ,(elt *s* (+ x 8))
diff ,(- (elt *s* (+ x 8))
(elt *s* x))))))
;;=>
'(DURATION 1460 MSEC... RESULT 2513)
;;=>
'(DURATION 1455 MSEC... RESULT 2513)
;;=>
'(DURATION 1486 MSEC... RESULT 2513)
Another modest but significant improvement.
At this point I reached for the SBCL profiler to see what I could learn from it. Most of the processing time is spent in array lookups, which I don't think we can avoid.
Let's look at the assembler again. Since our trigger matches are rare,
most of our time is going to be spent in finding the events, rather
than in collecting the detailed results for the matches. If we switch
back to using collect x
to avoid the extra assembly associated with
building `(index ,x startval ...)
:
(defun trigger ()
(loop for x below (- (the fixnum array-size) 8)
if (< (- (the fixnum (elt *s* (+ (the fixnum x) 8)))
(the fixnum (elt *s* (the fixnum x))))
1000)
collect x))
(disassemble 'trigger)
;;=>
;; disassembly for TRIGGER
;; Size: 448 bytes. Origin: #x22610EA0 ; TRIGGER
;; 0EA0: 498B4510 MOV RAX, [R13+16] ; thread.binding-stack-pointer
;; 0EA4: 488945F8 MOV [RBP-8], RAX
;; 0EA8: 488B05B1FFFFFF MOV RAX, [RIP-79] ; 'ARRAY-SIZE
;; 0EAF: 8B70F5 MOV ESI, [RAX-11]
;; 0EB2: 4A8B342E MOV RSI, [RSI+R13]
;; 0EB6: 83FE61 CMP ESI, 97
;; 0EB9: 480F4470F9 CMOVEQ RSI, [RAX-7]
;; 0EBE: 83FE51 CMP ESI, 81
;; 0EC1: 0F8477010000 JEQ L11
;; 0EC7: 40F6C601 TEST SIL, 1
;; 0ECB: 0F8568010000 JNE L10
;; 0ED1: 48D1FE SAR RSI, 1
;; 0ED4: 4883EE08 SUB RSI, 8
;; 0ED8: 488975F0 MOV [RBP-16], RSI
;; 0EDC: 31F6 XOR ESI, ESI
;; 0EDE: 49896D28 MOV [R13+40], RBP ; thread.pseudo-atomic-bits
;; 0EE2: 4D8B5D68 MOV R11, [R13+104] ; thread.alloc-region
;; 0EE6: 498D4310 LEA RAX, [R11+16]
;; 0EEA: 493B4570 CMP RAX, [R13+112]
;; 0EEE: 0F874D010000 JNBE L12
;; 0EF4: 49894568 MOV [R13+104], RAX ; thread.alloc-region
;; 0EF8: L0: 498D4307 LEA RAX, [R11+7]
;; 0EFC: 49316D28 XOR [R13+40], RBP ; thread.pseudo-atomic-bits
;; 0F00: 7402 JEQ L1
;; 0F02: CC09 BREAK 9 ; pending interrupt trap
;; 0F04: L1: C740F917001020 MOV DWORD PTR [RAX-7], #x20100017 ; NIL
;; 0F0B: C7400117001020 MOV DWORD PTR [RAX+1], #x20100017 ; NIL
;; 0F12: 488945E8 MOV [RBP-24], RAX
;; 0F16: 488945E0 MOV [RBP-32], RAX
;; 0F1A: E9BB000000 JMP L4
;; 0F1F: 90 NOP
;; 0F20: L2: 8B142524144A20 MOV EDX, [#x204A1424] ; tls_index: *S*
;; 0F27: 4A8B142A MOV RDX, [RDX+R13]
;; 0F2B: 83FA61 CMP EDX, 97
;; 0F2E: 480F44142528144A20 CMOVEQ RDX, [#x204A1428] ; *S*
;; 0F37: 83FA51 CMP EDX, 81
;; 0F3A: 0F840F010000 JEQ L13
;; 0F40: 488D7E10 LEA RDI, [RSI+16]
;; 0F44: 4883EC10 SUB RSP, 16
;; 0F48: 488975D8 MOV [RBP-40], RSI
;; 0F4C: B904000000 MOV ECX, 4
;; 0F51: 48892C24 MOV [RSP], RBP
;; 0F55: 488BEC MOV RBP, RSP
;; 0F58: B862934F22 MOV EAX, #x224F9362 ; #<FDEFN ELT>
;; 0F5D: FFD0 CALL RAX
;; 0F5F: 488B75D8 MOV RSI, [RBP-40]
;; 0F63: F6C201 TEST DL, 1
;; 0F66: 0F85CA000000 JNE L9
;; 0F6C: 4C8BC2 MOV R8, RDX
;; 0F6F: 49D1F8 SAR R8, 1
;; 0F72: 4C8945D0 MOV [RBP-48], R8
;; 0F76: 8B142524144A20 MOV EDX, [#x204A1424] ; tls_index: *S*
;; 0F7D: 4A8B142A MOV RDX, [RDX+R13]
;; 0F81: 83FA61 CMP EDX, 97
;; 0F84: 480F44142528144A20 CMOVEQ RDX, [#x204A1428] ; *S*
;; 0F8D: 83FA51 CMP EDX, 81
;; 0F90: 0F84BC000000 JEQ L14
;; 0F96: 488BFE MOV RDI, RSI
;; 0F99: 4883EC10 SUB RSP, 16
;; 0F9D: 488975D8 MOV [RBP-40], RSI
;; 0FA1: B904000000 MOV ECX, 4
;; 0FA6: 48892C24 MOV [RSP], RBP
;; 0FAA: 488BEC MOV RBP, RSP
;; 0FAD: B862934F22 MOV EAX, #x224F9362 ; #<FDEFN ELT>
;; 0FB2: FFD0 CALL RAX
;; 0FB4: 4C8B45D0 MOV R8, [RBP-48]
;; 0FB8: 488B75D8 MOV RSI, [RBP-40]
;; 0FBC: F6C201 TEST DL, 1
;; 0FBF: 7572 JNE L8
;; 0FC1: 488BC2 MOV RAX, RDX
;; 0FC4: 48D1F8 SAR RAX, 1
;; 0FC7: 498BD0 MOV RDX, R8
;; 0FCA: 4829C2 SUB RDX, RAX
;; 0FCD: 4881FAE8030000 CMP RDX, 1000
;; 0FD4: 7C22 JL L5
;; 0FD6: L3: 4883C602 ADD RSI, 2
;; 0FDA: L4: 488BC6 MOV RAX, RSI
;; 0FDD: 48D1F8 SAR RAX, 1
;; 0FE0: 483B45F0 CMP RAX, [RBP-16]
;; 0FE4: 0F8C36FFFFFF JL L2
;; 0FEA: 488B45E8 MOV RAX, [RBP-24]
;; 0FEE: 488B5001 MOV RDX, [RAX+1]
;; 0FF2: 488BE5 MOV RSP, RBP
;; 0FF5: F8 CLC
;; 0FF6: 5D POP RBP
;; 0FF7: C3 RET
;; 0FF8: L5: 4C8B45E0 MOV R8, [RBP-32]
;; 0FFC: 49896D28 MOV [R13+40], RBP ; thread.pseudo-atomic-bits
;; 1000: 4D8B5D68 MOV R11, [R13+104] ; thread.alloc-region
;; 1004: 498D4310 LEA RAX, [R11+16]
;; 1008: 493B4570 CMP RAX, [R13+112]
;; 100C: 7747 JNBE L15
;; 100E: 49894568 MOV [R13+104], RAX ; thread.alloc-region
;; 1012: L6: 498D4307 LEA RAX, [R11+7]
;; 1016: 49316D28 XOR [R13+40], RBP ; thread.pseudo-atomic-bits
;; 101A: 7402 JEQ L7
;; 101C: CC09 BREAK 9 ; pending interrupt trap
;; 101E: L7: 488970F9 MOV [RAX-7], RSI
;; 1022: C7400117001020 MOV DWORD PTR [RAX+1], #x20100017 ; NIL
;; 1029: 488945E0 MOV [RBP-32], RAX
;; 102D: 49894001 MOV [R8+1], RAX
;; 1031: EBA3 JMP L3
;; 1033: L8: CC48 BREAK 72 ; OBJECT-NOT-FIXNUM-ERROR
;; 1035: 08 BYTE #X08 ; RDX
;; 1036: L9: CC48 BREAK 72 ; OBJECT-NOT-FIXNUM-ERROR
;; 1038: 08 BYTE #X08 ; RDX
;; 1039: L10: CC48 BREAK 72 ; OBJECT-NOT-FIXNUM-ERROR
;; 103B: 18 BYTE #X18 ; RSI
;; 103C: CC10 BREAK 16 ; Invalid argument count trap
;; 103E: L11: CC18 BREAK 24 ; UNBOUND-SYMBOL-ERROR
;; 1040: 00 BYTE #X00 ; RAX
;; 1041: L12: 6A10 PUSH 16
;; 1043: FF14252800B021 CALL QWORD PTR [#x21B00028] ; ALLOC-TRAMP-R11
;; 104A: E9A9FEFFFF JMP L0
;; 104F: L13: CC18 BREAK 24 ; UNBOUND-SYMBOL-ERROR
;; 1051: 27 BYTE #X27 ; '*S*
;; 1052: L14: CC18 BREAK 24 ; UNBOUND-SYMBOL-ERROR
;; 1054: 27 BYTE #X27 ; '*S*
;; 1055: L15: 6A10 PUSH 16
;; 1057: FF14252800B021 CALL QWORD PTR [#x21B00028] ; ALLOC-TRAMP-R11
;; 105E: EBB2 JMP L6
… this is almost a manageable amount of assembler, though I don't claim to understand all of it. If I needed more speed, though, I could dig into the assembler as a way to try and ferret out even more efficiency.
Summary
We went from our initial POC in very functional-looking Clojure, running at 250 kHz, to optimized Common Lisp running at 100,000,000 / 1467 msec = 68.2 MHz, which is nearly 300x faster.
Checking our intuition about how fast this really is: my processor is 2.7 GHz, and our processing rate is 68.2 MHz, meaning we're spending about 40 clock ticks per event, which doesn't strike me as too bad.
In the end, we made the following changes to our initial attempt:
- Separated out the initial event time-sequence generation as a separate step;
- Rewrote the algorithm as a single loop;
declaim
'ed optimizations to the compiler;- Type-hinted the input array, the variables used in the time difference calculation, and the loop upper bound.
We also touched on profiling (which didn't provide much benefit this time, but can be very useful generally), and showed how to inspect the generated assembler code.
Conclusion
Common Lisp can be made pretty fast! I should write a C-language version of this toy problem to compare. The final code is not nearly as beautiful as the original, but without too much effort we got much faster code. I'll conclude with this line from ANSI Common Lisp: "Lisp is really two languages: a language for writing fast programs and a language for writing programs fast." It seems quite powerful to me to have both of these on hand.