From Elegance to Speed

code lisp clojure physics ..... September 25, 2019

Next: Another Migration
Prev: Common Lisp How-Tos

Making Things Fast

A while back, we used a toy particle physics trigger to explore concepts in lazy evaluation and functional programming in Clojure. We came up with the following function which selected groups of eight events clustered close together in time:

(defn smt-8 [times]
  (->> times
       (partition 8 1)
       (map (juxt identity
                  (comp (partial apply -)
                        (juxt last first))))
       (filter (comp (partial > 1000) second))))

Though elegant, this function processed events at a rate of 250 kHz, which I found slightly disappointing, performance-wise.

I have been interested in writing faster programs lately, which is one of the reasons I've been learning Common Lisp. Common Lisp lets you create extremely high-level programming constructs, but also lets you get closer to the metal when you need to. In this post we'll look at some ways for improving performance and compare our results to the original 250 kHz event rate.

First Steps

The first step is to translate our code as is to Common Lisp. We're going to punt entirely on the question of laziness for this post, though we might take it up in the future.

Here's my first attempt. We have to start with our source of random times. In Clojure, we had:

(def times (iterate #(+ % (rand-int 1000)) 0))

Since I'm going to benchmark performance, we'll need to set a limit to the number of events (times) we'll process. So, in Common Lisp:

(defun time-sequence (n)
  (loop repeat n
     for y = 0 then (+ y (random 1000))
     collect y))

where n is the number of event times to process. Here and throughout what follows, I'm using the loop macro, which is a Swiss army knife for iteration with a somewhat strange, non-Lispy, but readable syntax.

We also want to set our random state so as to guarantee different results each time we execute our simulation:

(setf *random-state* (make-random-state t))

My first attempt in Common Lisp looked like this benchmarking snippet:

(timing
  (length
    (->> 1000000
         time-sequence
         (partition-n 8 1)
         (mapcar (juxt #'car (compose #'car #'last) #'identity))
         (mapcar #'(lambda (l) `(,(- (cadr l) (car l)) ,(caddr l))))
         (remove-if-not #'(lambda (l) (< (car l) 1000))))))

This is less lovely than the Clojure, partly because of all the hash-quoting of functions (Common Lisp is a Lisp-2, and Clojure is a Lisp 1). ->> is from the arrow-macros library, and compose is from the cl-utilities library. The other Clojure-ish functions, partition-n (a replacement for Clojure's partition, whose name collides with an entirely different function in Common Lisp) and juxt are from cl-oju, a small library I've been writing for those still-frequent times when I want a Clojure function or idiom in Common Lisp. (The library ignores laziness for now, since I haven't needed it yet.)

timing is a macro I adapted from the Common Lisp Cookbook, which captures both the results of a computation and the elapsed CPU time used (not wall clock time) in msec:

(defmacro timing (&body forms)
  (let ((run1 (gensym))
	(run2 (gensym))
	(result (gensym)))
    `(let ((,run1 (get-internal-run-time))
	   (,result (progn ,@forms))
	   (,run2 (get-internal-run-time)))
       `(duration ,(- ,run2 ,run1) msec... result ,,result))))

For now, result is simply the number of eight-fold time clusters occuring within 1000 units of time. The execution time was roughly a second for the first few times I ran this:

 '(DURATION 1045 MSEC... RESULT 235)
 '(DURATION 1554 MSEC... RESULT 201)
 '(DURATION 827 MSEC... RESULT 164)

This already processed events at 1 MHz, roughly 4x faster than the Clojure speed of 250kHz.

Later, however, when I revisited the code, I noticed it ran significantly faster:

 '(DURATION 435 MSEC... RESULT 193)
 '(DURATION 279 MSEC... RESULT 189)
 '(DURATION 205 MSEC... RESULT 177)
 '(DURATION 601 MSEC... RESULT 180)

This is 2.6 Mhz, 10x the Clojure code speed. It took me a while to figure out that I had changed the SBCL heap size in order to handle larger arrays for testing (this is one area where laziness would help!). The larger heap size was making the code run faster.

It should be said at this point that I made no attempt in my original blog post to optimize the Clojure code for performance. Nevertheless, I think it's interesting that my first attempt to write the same algorithm in Common Lisp, using roughly the same idioms, performed so much faster.

Interlude

Looking back over my career, it seems I have been moving gradually from fast, lower-level languages to slower, more expressive, high-level languages: FORTRAN and C, to Perl and Python, then to Clojure (which is faster than Python for most long-running programs, while being arguably more expressive[fn::Reasonable people can disagree on the relative expressiveness of languages; for me, Lisps win on expressiveness because I can bend the language to my will using macros in a way that I cannot with non-Lisps.]). I've enjoyed this trajectory, because higher level languages let you implement so much, so quickly.

My feelings on this are shifting, however. I wrote some code a decade ago for a system that has been running constantly for a decade on thousands of devices. My code is not particularly CPU efficient and, therefore, not as energy-efficient as it could be. As the planet warms, I think it's important to try and find small efficiencies where we can… not necessarily under the assumption that it will make a massive difference, but because it's just the right thing to do as conscientious professionals. (The fact that the sins of my past are executing on processors running on top of and within Antarctic ice adds an extra bit of irony here).

Common Lisp is particularly interesting to me of late because, having been designed largely for Good Old-Fashioned AI, it allows you to work at a very high level (you can, for example, add Prolog-like functionality to a Lisp in a few hundred lines of code [fn::See Paul Graham, On Lisp, p. vii]) – but, you can generate very fast (i.e., more energy efficient) code as well, as we shall see shortly.

Improvements

The next thing I tried was to separate the event generation from the trigger calculation and to use a vector for the input times, instead of a list:

(defun time-sequence-vec (n)
  (make-array `(,n)
              :initial-contents
              (loop repeat n
                 for y = 0 then (+ y (random 1000))
                 collect y)))

(defparameter array-size (* 100 1000 1000))

(defparameter *s* (time-sequence-vec array-size))

I also make the test sample much larger to improve the run durations and statistics.

Both Graham¹ and Norvig[fn::Peter Norvig, Paradigms of Artificial Intelligence Programming: Case Studies in Common Lisp] emphasize tuning your algorithm first before turning to other optimizations. In our case the problem is O(n), which is the best time complexity we can hope for, but our use of function composition, while elegant, could be turned into something more efficient. To improve the algorithm proper, I perform the entire computation in a single loop statement. I also change the output slightly to show the index for matching 8-fold events, start and end time, and event duration:

(defun second-try ()
  (loop for x below (- array-size 8)
     if (< (- (elt *s* (+ x 8))
              (elt *s* x))
           1000)
     collect `(index ,x
                     startval ,(elt *s* x)
                     endval   ,(elt *s* (+ x 8))
                     diff     ,(- (elt *s* (+ x 8))
                                  (elt *s* x)))))
(take 4 (second-try))
;;=>
'((INDEX 69087 STARTVAL 34504666 ENDVAL 34505615 DIFF 949)
  (INDEX 153990 STARTVAL 76991815 ENDVAL 76992630 DIFF 815)
  (INDEX 237070 STARTVAL 118617491 ENDVAL 118618358 DIFF 867)
  (INDEX 294951 STARTVAL 147513479 ENDVAL 147514346 DIFF 867))

;; Time it, a few times:
(timing (length (second-try)))
;;=>
'(DURATION 2158 MSEC... RESULT 2427)
;;=>
'(DURATION 2167 MSEC... RESULT 2427)
;;=>
'(DURATION 2162 MSEC... RESULT 2427)
;;=>

Note that the starting and ending time values increase monotonically, and the reported time differences (DIFF) are under 1000, as desired².

The new trigger function improves the processing rate to 46.2 MHz. At this point I'm starting to wonder what the assembly language looks like:

(disassemble 'second-try)
;;=>
; disassembly for SECOND-TRY
; Size: 1061 bytes. Origin: #x228FE0B0                        ; SECOND-TRY
; 0B0:       498B7510         MOV RSI, [R13+16]               ; thread.binding-stack-pointer
; 0B4:       488975F8         MOV [RBP-8], RSI
; 0B8:       488B3591FFFFFF   MOV RSI, [RIP-111]              ; 'ARRAY-SIZE
; 0BF:       8B56F5           MOV EDX, [RSI-11]
; 0C2:       4A8B142A         MOV RDX, [RDX+R13]
; 0C6:       83FA61           CMP EDX, 97
; 0C9:       480F4456F9       CMOVEQ RDX, [RSI-7]
; 0CE:       83FA51           CMP EDX, 81
; 0D1:       0F84BF030000     JEQ L17
; 0D7:       BF10000000       MOV EDI, 16
; 0DC:       FF1425B000B021   CALL QWORD PTR [#x21B000B0]     ; GENERIC--

;; .................. many lines omitted ................

; 4B6: L24:  CC18             BREAK 24                        ; UNBOUND-SYMBOL-ERROR
; 4B8:       27               BYTE #X27                       ; '*S*
; 4B9: L25:  6880000000       PUSH 128
; 4BE:       FF14252800B021   CALL QWORD PTR [#x21B00028]     ; ALLOC-TRAMP-R11
; 4C5:       E9FFFEFFFF       JMP L12
; 4CA: L26:  6A10             PUSH 16
; 4CC:       FF14252800B021   CALL QWORD PTR [#x21B00028]     ; ALLOC-TRAMP-R11
; 4D3:       EB94             JMP L14

(258 lines in total.)

My assembler-fu is not strong, but I notice the GENERIC-- and we do not need generic functions: there are many opportunities here to introduce type declarations.

After declaring to the compiler we want more speed and less type-checking ((declaim (optimize (speed 3) (debug 0) (safety 0)))), let's declare our array to be type fixnum (which on SBCL is an integer from -4611686018427387904 to 4611686018427387903).

(defparameter *s* (make-array array-size
                              :element-type 'fixnum))
;; Initialize array:
(loop
   for y = 0 then (+ y (random 1000))
   for i below array-size
   do (setf (elt *s* i) y))

(timing
  (length
   (loop for x below (- array-size 8)
      if (< (- (elt *s* (+ x 8))
               (elt *s* x))
            1000)
      collect `(index ,x
                      startval ,(elt *s* x)
                      endval   ,(elt *s* (+ x 8))
                      diff     ,(- (elt *s* (+ x 8))
                                   (elt *s* x))))))
;;=>
'(DURATION 1964 MSEC... RESULT 2435)
;;=>
'(DURATION 1939 MSEC... RESULT 2435)
;;=>
'(DURATION 2010 MSEC... RESULT 2435)

… maybe a slight improvement. Let's now type-hint where we can throughout the tight loop:

(timing
  (length
   (loop for x below (- array-size 8)
      if (< (- (the fixnum (elt *s* (the fixnum (+ x 8))))
               (the fixnum (elt *s* (the fixnum x))))
            1000)
      collect `(index ,x
                      startval ,(elt *s* x)
                      endval   ,(elt *s* (+ x 8))
                      diff     ,(- (elt *s* (+ x 8))
                                   (elt *s* x))))))
;;=>
'(DURATION 1740 MSEC... RESULT 2435)
;;=>
'(DURATION 1738 MSEC... RESULT 2435)
;;=>
'(DURATION 1753 MSEC... RESULT 2435)

A 10% or so improvement… not too bad.

At this point I'm tempted to look at the macroexpansion of my loop expression and see what it's doing. For simplicity, I collect just the starting time index for times satisfying the trigger:

(macroexpand-1
 '(loop for x below (- array-size 8)
     if (< (- (the fixnum (elt *s* (the fixnum (+ x 8))))
            (the fixnum (elt *s* (the fixnum x))))
         1000)
     collect x))
;;=>
'(block nil
  (let ((#:loop-limit-711 (- array-size 8)) (x 0))
    (declare (type (and number real) x)
             (type (and number real) #:loop-limit-711))
    (sb-loop::with-loop-list-collection-head (#:loop-list-head-712
                                              #:loop-list-tail-713)
      (tagbody
       sb-loop::next-loop
         (when (>= x #:loop-limit-711) (go sb-loop::end-loop))
         (if (<
              (- (the fixnum (elt *s* (the fixnum (+ x 8))))
                 (the fixnum (elt *s* (the fixnum x))))
              1000)
             (sb-loop::loop-collect-rplacd
              (#:loop-list-head-712 #:loop-list-tail-713) (list x)))
         (sb-loop::loop-desetq x (1+ x))
         (go sb-loop::next-loop)
       sb-loop::end-loop
         (return-from nil
           (sb-loop::loop-collect-answer #:loop-list-head-712))))))

I notice that x and :loop-limit-711 have an overly-general type declaration. I'm not sure how to type-hint x, but I can type-hint the upper bound:

(timing
  (length
   (loop for x below (- (the fixnum array-size) 8)
      if (< (- (the fixnum (elt *s* (+ (the fixnum x) 8)))
               (the fixnum (elt *s* (the fixnum x))))
            1000)
      collect `(index ,x
                      startval ,(elt *s* x)
                      endval   ,(elt *s* (+ x 8))
                      diff     ,(- (elt *s* (+ x 8))
                                   (elt *s* x))))))
;;=>
'(DURATION 1460 MSEC... RESULT 2513)
;;=>
'(DURATION 1455 MSEC... RESULT 2513)
;;=>
'(DURATION 1486 MSEC... RESULT 2513)

Another modest but significant improvement.

At this point I reached for the SBCL profiler to see what I could learn from it. Most of the processing time is spent in array lookups, which I don't think we can avoid.

Let's look at the assembler again. Since our trigger matches are rare, most of our time is going to be spent in finding the events, rather than in collecting the detailed results for the matches. If we switch back to using collect x to avoid the extra assembly associated with building `(index ,x startval ...):

(defun trigger ()
(loop for x below (- (the fixnum array-size) 8)
     if (< (- (the fixnum (elt *s* (+ (the fixnum x) 8)))
              (the fixnum (elt *s* (the fixnum x))))
           1000)
     collect x))
(disassemble 'trigger)
;;=>
;; disassembly for TRIGGER
;; Size: 448 bytes. Origin: #x22610EA0                         ; TRIGGER
;; 0EA0:       498B4510         MOV RAX, [R13+16]              ; thread.binding-stack-pointer
;; 0EA4:       488945F8         MOV [RBP-8], RAX
;; 0EA8:       488B05B1FFFFFF   MOV RAX, [RIP-79]              ; 'ARRAY-SIZE
;; 0EAF:       8B70F5           MOV ESI, [RAX-11]
;; 0EB2:       4A8B342E         MOV RSI, [RSI+R13]
;; 0EB6:       83FE61           CMP ESI, 97
;; 0EB9:       480F4470F9       CMOVEQ RSI, [RAX-7]
;; 0EBE:       83FE51           CMP ESI, 81
;; 0EC1:       0F8477010000     JEQ L11
;; 0EC7:       40F6C601         TEST SIL, 1
;; 0ECB:       0F8568010000     JNE L10
;; 0ED1:       48D1FE           SAR RSI, 1
;; 0ED4:       4883EE08         SUB RSI, 8
;; 0ED8:       488975F0         MOV [RBP-16], RSI
;; 0EDC:       31F6             XOR ESI, ESI
;; 0EDE:       49896D28         MOV [R13+40], RBP              ; thread.pseudo-atomic-bits
;; 0EE2:       4D8B5D68         MOV R11, [R13+104]             ; thread.alloc-region
;; 0EE6:       498D4310         LEA RAX, [R11+16]
;; 0EEA:       493B4570         CMP RAX, [R13+112]
;; 0EEE:       0F874D010000     JNBE L12
;; 0EF4:       49894568         MOV [R13+104], RAX             ; thread.alloc-region
;; 0EF8: L0:   498D4307         LEA RAX, [R11+7]
;; 0EFC:       49316D28         XOR [R13+40], RBP              ; thread.pseudo-atomic-bits
;; 0F00:       7402             JEQ L1
;; 0F02:       CC09             BREAK 9                        ; pending interrupt trap
;; 0F04: L1:   C740F917001020   MOV DWORD PTR [RAX-7], #x20100017  ; NIL
;; 0F0B:       C7400117001020   MOV DWORD PTR [RAX+1], #x20100017  ; NIL
;; 0F12:       488945E8         MOV [RBP-24], RAX
;; 0F16:       488945E0         MOV [RBP-32], RAX
;; 0F1A:       E9BB000000       JMP L4
;; 0F1F:       90               NOP
;; 0F20: L2:   8B142524144A20   MOV EDX, [#x204A1424]          ; tls_index: *S*
;; 0F27:       4A8B142A         MOV RDX, [RDX+R13]
;; 0F2B:       83FA61           CMP EDX, 97
;; 0F2E:       480F44142528144A20 CMOVEQ RDX, [#x204A1428]     ; *S*
;; 0F37:       83FA51           CMP EDX, 81
;; 0F3A:       0F840F010000     JEQ L13
;; 0F40:       488D7E10         LEA RDI, [RSI+16]
;; 0F44:       4883EC10         SUB RSP, 16
;; 0F48:       488975D8         MOV [RBP-40], RSI
;; 0F4C:       B904000000       MOV ECX, 4
;; 0F51:       48892C24         MOV [RSP], RBP
;; 0F55:       488BEC           MOV RBP, RSP
;; 0F58:       B862934F22       MOV EAX, #x224F9362            ; #<FDEFN ELT>
;; 0F5D:       FFD0             CALL RAX
;; 0F5F:       488B75D8         MOV RSI, [RBP-40]
;; 0F63:       F6C201           TEST DL, 1
;; 0F66:       0F85CA000000     JNE L9
;; 0F6C:       4C8BC2           MOV R8, RDX
;; 0F6F:       49D1F8           SAR R8, 1
;; 0F72:       4C8945D0         MOV [RBP-48], R8
;; 0F76:       8B142524144A20   MOV EDX, [#x204A1424]          ; tls_index: *S*
;; 0F7D:       4A8B142A         MOV RDX, [RDX+R13]
;; 0F81:       83FA61           CMP EDX, 97
;; 0F84:       480F44142528144A20 CMOVEQ RDX, [#x204A1428]     ; *S*
;; 0F8D:       83FA51           CMP EDX, 81
;; 0F90:       0F84BC000000     JEQ L14
;; 0F96:       488BFE           MOV RDI, RSI
;; 0F99:       4883EC10         SUB RSP, 16
;; 0F9D:       488975D8         MOV [RBP-40], RSI
;; 0FA1:       B904000000       MOV ECX, 4
;; 0FA6:       48892C24         MOV [RSP], RBP
;; 0FAA:       488BEC           MOV RBP, RSP
;; 0FAD:       B862934F22       MOV EAX, #x224F9362            ; #<FDEFN ELT>
;; 0FB2:       FFD0             CALL RAX
;; 0FB4:       4C8B45D0         MOV R8, [RBP-48]
;; 0FB8:       488B75D8         MOV RSI, [RBP-40]
;; 0FBC:       F6C201           TEST DL, 1
;; 0FBF:       7572             JNE L8
;; 0FC1:       488BC2           MOV RAX, RDX
;; 0FC4:       48D1F8           SAR RAX, 1
;; 0FC7:       498BD0           MOV RDX, R8
;; 0FCA:       4829C2           SUB RDX, RAX
;; 0FCD:       4881FAE8030000   CMP RDX, 1000
;; 0FD4:       7C22             JL L5
;; 0FD6: L3:   4883C602         ADD RSI, 2
;; 0FDA: L4:   488BC6           MOV RAX, RSI
;; 0FDD:       48D1F8           SAR RAX, 1
;; 0FE0:       483B45F0         CMP RAX, [RBP-16]
;; 0FE4:       0F8C36FFFFFF     JL L2
;; 0FEA:       488B45E8         MOV RAX, [RBP-24]
;; 0FEE:       488B5001         MOV RDX, [RAX+1]
;; 0FF2:       488BE5           MOV RSP, RBP
;; 0FF5:       F8               CLC
;; 0FF6:       5D               POP RBP
;; 0FF7:       C3               RET
;; 0FF8: L5:   4C8B45E0         MOV R8, [RBP-32]
;; 0FFC:       49896D28         MOV [R13+40], RBP              ; thread.pseudo-atomic-bits
;; 1000:       4D8B5D68         MOV R11, [R13+104]             ; thread.alloc-region
;; 1004:       498D4310         LEA RAX, [R11+16]
;; 1008:       493B4570         CMP RAX, [R13+112]
;; 100C:       7747             JNBE L15
;; 100E:       49894568         MOV [R13+104], RAX             ; thread.alloc-region
;; 1012: L6:   498D4307         LEA RAX, [R11+7]
;; 1016:       49316D28         XOR [R13+40], RBP              ; thread.pseudo-atomic-bits
;; 101A:       7402             JEQ L7
;; 101C:       CC09             BREAK 9                        ; pending interrupt trap
;; 101E: L7:   488970F9         MOV [RAX-7], RSI
;; 1022:       C7400117001020   MOV DWORD PTR [RAX+1], #x20100017  ; NIL
;; 1029:       488945E0         MOV [RBP-32], RAX
;; 102D:       49894001         MOV [R8+1], RAX
;; 1031:       EBA3             JMP L3
;; 1033: L8:   CC48             BREAK 72                       ; OBJECT-NOT-FIXNUM-ERROR
;; 1035:       08               BYTE #X08                      ; RDX
;; 1036: L9:   CC48             BREAK 72                       ; OBJECT-NOT-FIXNUM-ERROR
;; 1038:       08               BYTE #X08                      ; RDX
;; 1039: L10:  CC48             BREAK 72                       ; OBJECT-NOT-FIXNUM-ERROR
;; 103B:       18               BYTE #X18                      ; RSI
;; 103C:       CC10             BREAK 16                       ; Invalid argument count trap
;; 103E: L11:  CC18             BREAK 24                       ; UNBOUND-SYMBOL-ERROR
;; 1040:       00               BYTE #X00                      ; RAX
;; 1041: L12:  6A10             PUSH 16
;; 1043:       FF14252800B021   CALL QWORD PTR [#x21B00028]    ; ALLOC-TRAMP-R11
;; 104A:       E9A9FEFFFF       JMP L0
;; 104F: L13:  CC18             BREAK 24                       ; UNBOUND-SYMBOL-ERROR
;; 1051:       27               BYTE #X27                      ; '*S*
;; 1052: L14:  CC18             BREAK 24                       ; UNBOUND-SYMBOL-ERROR
;; 1054:       27               BYTE #X27                      ; '*S*
;; 1055: L15:  6A10             PUSH 16
;; 1057:       FF14252800B021   CALL QWORD PTR [#x21B00028]    ; ALLOC-TRAMP-R11
;; 105E:       EBB2             JMP L6

… this is almost a manageable amount of assembler, though I don't claim to understand all of it. If I needed more speed, though, I could dig into the assembler as a way to try and ferret out even more efficiency.

Summary

We went from our initial POC in very functional-looking Clojure, running at 250 kHz, to optimized Common Lisp running at 100,000,000 / 1467 msec = 68.2 MHz, which is nearly 300x faster.

Checking our intuition about how fast this really is: my processor is 2.7 GHz, and our processing rate is 68.2 MHz, meaning we're spending about 40 clock ticks per event, which doesn't strike me as too bad.

In the end, we made the following changes to our initial attempt:

Separated out the initial event time-sequence generation as a separate step;
Rewrote the algorithm as a single loop;
declaim'ed optimizations to the compiler;
Type-hinted the input array, the variables used in the time difference calculation, and the loop upper bound.

We also touched on profiling (which didn't provide much benefit this time, but can be very useful generally), and showed how to inspect the generated assembler code.

Conclusion

Common Lisp can be made pretty fast! I should write a C-language version of this toy problem to compare. The final code is not nearly as beautiful as the original, but without too much effort we got much faster code. I'll conclude with this line from ANSI Common Lisp: "Lisp is really two languages: a language for writing fast programs and a language for writing programs fast." It seems quite powerful to me to have both of these on hand.

Paul Graham, ANSI Common Lisp

The original Clojure code reported the entire sequence of eight times for each triggered event, but we omit that level of detail for clarity. Since the triggered clusters are rare, this doesn't significantly impact the timing measurements.