読者です 読者をやめる 読者になる 読者になる

Common Lispでdisassembleを使ったりチューニングしたり

Common Lispはdisassembleも付いてきてチューニングできるらしいとゆー話をよく聞くので、逆アセンブルしてみた。 題材としてはこの辺を参考に

caching-gemmがバグってたので厳密には同じコードではなくなっている。

処理系のバージョンは次の通り

  • SBCL 1.3.2
  • CCL 1.11

simple-gemm

単純なGEMM。

(defun simple-gemm (ma mb)
  (declare (optimize (speed 3) (debug 0) (safety 0)))
  (declare (type (simple-array single-float (* *)) ma mb))
  (let ((rows (array-dimension ma 0))
        (cols (array-dimension mb 1)))
    (declare (type fixnum rows cols))
    (let ((result (make-matrix rows cols)))
      (declare (type (simple-array single-float (* *)) result))
      (dotimes (row rows)
        (dotimes (col cols)
          (dotimes (k cols)
            (incf (aref result row col)
                  (* (aref ma row k) (aref mb k col))))))
      result)))

SBCLの場合。 普通っぽい。

CL-USER> (disassemble 'simple-gemm)
; disassembly for SIMPLE-GEMM
; Size: 283 bytes. Origin: #x1004C5CA65
; A65:       4C894DE8         MOV [RBP-24], R9                ; no-arg-parsing entry point
; A69:       4C8945E0         MOV [RBP-32], R8
; A6D:       4D8B6831         MOV R13, [R8+49]
; A71:       4D8B5139         MOV R10, [R9+57]
; A75:       4C896DF8         MOV [RBP-8], R13
; A79:       4C8955F0         MOV [RBP-16], R10
; A7D:       488D5C24F0       LEA RBX, [RSP-16]
; A82:       4883EC18         SUB RSP, 24
; A86:       498BD5           MOV RDX, R13
; A89:       498BFA           MOV RDI, R10
; A8C:       488B057DFFFFFF   MOV RAX, [RIP-131]              ; #<FDEFINITION for MAKE-MATRIX>
; A93:       B904000000       MOV ECX, 4
; A98:       48892B           MOV [RBX], RBP
; A9B:       488BEB           MOV RBP, RBX
; A9E:       FF5009           CALL QWORD PTR [RAX+9]
; AA1:       480F42E3         CMOVB RSP, RBX
; AA5:       4C8B45E0         MOV R8, [RBP-32]
; AA9:       4C8B4DE8         MOV R9, [RBP-24]
; AAD:       4C8B55F0         MOV R10, [RBP-16]
; AB1:       4C8B6DF8         MOV R13, [RBP-8]
; AB5:       488BF2           MOV RSI, RDX
; AB8:       31DB             XOR EBX, EBX
; ABA:       E9AF000000       JMP L5
; ABF:       90               NOP
; AC0: L0:   31C0             XOR EAX, EAX
; AC2:       E99A000000       JMP L4
; AC7:       660F1F840000000000 NOP
; AD0: L1:   31C9             XOR ECX, ECX
; AD2:       E981000000       JMP L3
; AD7:       660F1F840000000000 NOP
; AE0: L2:   498B5039         MOV RDX, [R8+57]
; AE4:       488BFB           MOV RDI, RBX
; AE7:       48D1FF           SAR RDI, 1
; AEA:       480FAFFA         IMUL RDI, RDX
; AEE:       4801CF           ADD RDI, RCX
; AF1:       498B5011         MOV RDX, [R8+17]
; AF5:       F30F104C7A01     MOVSS XMM1, [RDX+RDI*2+1]
; AFB:       498B5139         MOV RDX, [R9+57]
; AFF:       488BF9           MOV RDI, RCX
; B02:       48D1FF           SAR RDI, 1
; B05:       480FAFFA         IMUL RDI, RDX
; B09:       4801C7           ADD RDI, RAX
; B0C:       498B5111         MOV RDX, [R9+17]
; B10:       F30F10547A01     MOVSS XMM2, [RDX+RDI*2+1]
; B16:       F30F59D1         MULSS XMM2, XMM1
; B1A:       488B5639         MOV RDX, [RSI+57]
; B1E:       488BFB           MOV RDI, RBX
; B21:       48D1FF           SAR RDI, 1
; B24:       480FAFFA         IMUL RDI, RDX
; B28:       4801C7           ADD RDI, RAX
; B2B:       488B5611         MOV RDX, [RSI+17]
; B2F:       F30F104C7A01     MOVSS XMM1, [RDX+RDI*2+1]
; B35:       F30F58CA         ADDSS XMM1, XMM2
; B39:       488B5639         MOV RDX, [RSI+57]
; B3D:       488BFB           MOV RDI, RBX
; B40:       48D1FF           SAR RDI, 1
; B43:       480FAFFA         IMUL RDI, RDX
; B47:       4801C7           ADD RDI, RAX
; B4A:       488B5611         MOV RDX, [RSI+17]
; B4E:       F30F114C7A01     MOVSS [RDX+RDI*2+1], XMM1
; B54:       4883C102         ADD RCX, 2
; B58: L3:   4C39D1           CMP RCX, R10
; B5B:       7C83             JL L2
; B5D:       4883C002         ADD RAX, 2
; B61: L4:   4C39D0           CMP RAX, R10
; B64:       0F8C66FFFFFF     JL L1
; B6A:       4883C302         ADD RBX, 2
; B6E: L5:   4C39EB           CMP RBX, R13
; B71:       0F8C49FFFFFF     JL L0
; B77:       488BD6           MOV RDX, RSI
; B7A:       488BE5           MOV RSP, RBP
; B7D:       F8               CLC
; B7E:       5D               POP RBP
; B7F:       C3               RET

CCLの場合。 対応するLispコードがコメントに付いてて分かりやすい。 S式風謎アセンブラよく見るとよく分からないの色々あるな。 $が即値で%が変数かな? CCLのドキュメントちゃんと読まないと無理そう

CL-USER> (disassemble 'caching-gemm)
;; "home:sandbox;common-lisp;simple-gemm.lisp.newest":1039-1729
    (recover-fn-from-rip)                   ;     [7]
    (pushq (% rbp))                         ;    [14]
    (movq (% rsp) (% rbp))                  ;    [15]
    (pushq (% arg_y))                       ;    [18]
    (pushq (% arg_z))                       ;    [19]
    (movq (@ (% gs) 80) (% stack-temp))     ;    [20]
    (subq ($ 64) (@ (% gs) 80))             ;    [29]
    (movq (@ (% gs) 80) (% imm0))           ;    [39]
    (movq (% stack-temp) (@ (% imm0)))      ;    [48]
    (movq (@ (% gs) #x178) (% stack-temp))  ;    [51]
    (movq (% stack-temp) (@ 8 (% imm0)))    ;    [60]
    (movq (% imm0) (@ (% gs) #x178))        ;    [64]
    (pushq (% save0))                       ;    [73]
    (pushq (% save1))                       ;    [75]
    (pushq (% save2))                       ;    [77]
    (pushq (% save3))                       ;    [79]

;;; (let ((rows (array-dimension ma 0)) (cols (array-dimension mb 1))) (declare (type fixnum rows cols))
    (xorl (% arg_z.l) (% arg_z.l))          ;    [81]
    (movl ($ 16) (% nargs))                 ;    [83]
    (movq (@ 'ARRAY-DIMENSION (% fn)) (% temp0)) ;    [88]
    (lisp-call (@ 10 (% temp0)))            ;    [97]
    (recover-fn-from-rip)                   ;   [100]
    (movq (% arg_z) (% arg_x))              ;   [107]
    (pushq (% arg_x))                       ;   [110]
    (movq (@ -16 (% rbp)) (% arg_y))        ;   [112]
    (movl ($ 8) (% arg_z.l))                ;   [116]
    (movl ($ 16) (% nargs))                 ;   [121]
    (movq (@ 'ARRAY-DIMENSION (% fn)) (% temp0)) ;   [126]
    (lisp-call (@ 10 (% temp0)))            ;   [137]
    (recover-fn-from-rip)                   ;   [140]
    (movq (% arg_z) (% save3))              ;   [147]

;;; (let ((result (make-matrix rows cols))) (declare (type (simple-array single-float (* *)) result)) (d
    (movq (@ -56 (% rbp)) (% arg_y))        ;   [150]
    (movq (% save3) (% arg_z))              ;   [154]
    (movl ($ 16) (% nargs))                 ;   [157]
    (movq (@ 'MAKE-MATRIX (% fn)) (% temp0)) ;   [162]
    (lisp-call (@ 10 (% temp0)))            ;   [169]
    (recover-fn-from-rip)                   ;   [172]
    (pushq (% arg_z))                       ;   [179]

;;; (dotimes (row rows) (dotimes (col cols) (let ((cell (aref result row col))) (declare (type (single-f
    (xorl (% save2.l) (% save2.l))          ;   [180]
    (jmpq L562)                             ;   [183]

;;; (dotimes (col cols) (let ((cell (aref result row col))) (declare (type (single-float) cell)) (dotime
L181
    (xorl (% save1.l) (% save1.l))          ;   [188]
    (jmpq L549)                             ;   [191]

;;; (let ((cell (aref result row col))) (declare (type (single-float) cell)) (dotimes (k cols) (incf cel
L189
    (movq (@ -64 (% rbp)) (% arg_x))        ;   [196]
    (movq (@ 43 (% arg_x)) (% imm0))        ;   [200]
    (sarq ($ 3) (% imm0))                   ;   [204]
    (imulq (% save2) (% imm0))              ;   [208]
    (leaq (@ (% save1) (% imm0)) (% temp0)) ;   [212]
    (movq (@ 11 (% arg_x)) (% arg_z))       ;   [216]
    (movq (% temp0) (% imm0))               ;   [220]
    (shrq (% imm0))                         ;   [223]
    (movss (@ -5 (% arg_z) (% imm0)) (% fp0)) ;   [226]
    (movd (% imm0) (% stack-temp))          ;   [232]
    (movq (@ (% gs) #x178) (% imm0))        ;   [236]
    (movss (% fp0) (@ 16 (% imm0)))         ;   [245]
    (movd (% stack-temp) (% imm0))          ;   [250]

;;; (dotimes (k cols) (incf cell (* (aref ma row k) (aref mb k col))))
    (xorl (% save0.l) (% save0.l))          ;   [254]
    (jmpq L471)                             ;   [257]

;;; (aref ma row k)
L255
    (movq (@ -8 (% rbp)) (% arg_x))         ;   [262]
    (movq (@ 43 (% arg_x)) (% imm0))        ;   [266]
    (sarq ($ 3) (% imm0))                   ;   [270]
    (imulq (% save2) (% imm0))              ;   [274]
    (leaq (@ (% save0) (% imm0)) (% temp0)) ;   [278]
    (movq (@ 11 (% arg_x)) (% arg_z))       ;   [282]
    (movq (% temp0) (% imm0))               ;   [286]
    (shrq (% imm0))                         ;   [289]
    (movss (@ -5 (% arg_z) (% imm0)) (% fp0)) ;   [292]

;;; (aref mb k col)
    (movq (@ -16 (% rbp)) (% arg_z))        ;   [298]
    (movq (@ 43 (% arg_z)) (% imm0))        ;   [302]
    (sarq ($ 3) (% imm0))                   ;   [306]
    (imulq (% save0) (% imm0))              ;   [310]
    (leaq (@ (% save1) (% imm0)) (% temp0)) ;   [314]
    (movq (@ 11 (% arg_z)) (% arg_z))       ;   [318]
    (movq (% temp0) (% imm0))               ;   [322]
    (shrq (% imm0))                         ;   [325]
    (movss (@ -5 (% arg_z) (% imm0)) (% fp1)) ;   [328]

;;; (* (aref ma row k) (aref mb k col))
    (mulss (% fp1) (% fp0))                 ;   [334]

;;; (incf cell (* (aref ma row k) (aref mb k col)))
    (movd (% imm0) (% stack-temp))          ;   [338]
    (movq (@ (% gs) #x178) (% imm0))        ;   [342]
    (movss (% fp0) (@ 32 (% imm0)))         ;   [351]
    (movd (% stack-temp) (% imm0))          ;   [356]
    (movd (% imm0) (% stack-temp))          ;   [360]
    (movq (@ (% gs) #x178) (% imm0))        ;   [364]
    (movss (@ 32 (% imm0)) (% fp1))         ;   [373]
    (movd (% stack-temp) (% imm0))          ;   [378]
    (movd (% imm0) (% stack-temp))          ;   [382]
    (movq (@ (% gs) #x178) (% imm0))        ;   [386]
    (movss (@ 16 (% imm0)) (% fp0))         ;   [395]
    (movd (% stack-temp) (% imm0))          ;   [400]
    (addss (% fp1) (% fp0))                 ;   [404]
    (movd (% imm0) (% stack-temp))          ;   [408]
    (movq (@ (% gs) #x178) (% imm0))        ;   [412]
    (movss (% fp0) (@ 48 (% imm0)))         ;   [421]
    (movd (% stack-temp) (% imm0))          ;   [426]
    (movd (% imm0) (% stack-temp))          ;   [430]
    (movq (@ (% gs) #x178) (% imm0))        ;   [434]
    (movss (@ 48 (% imm0)) (% fp0))         ;   [443]
    (movd (% stack-temp) (% imm0))          ;   [448]
    (movd (% imm0) (% stack-temp))          ;   [452]
    (movq (@ (% gs) #x178) (% imm0))        ;   [456]
    (movss (% fp0) (@ 16 (% imm0)))         ;   [465]
    (movd (% stack-temp) (% imm0))          ;   [470]

;;; (dotimes (k cols) (incf cell (* (aref ma row k) (aref mb k col))))
    (addq ($ 8) (% save0))                  ;   [474]
L471
    (cmpq (% save3) (% save0))              ;   [478]
    (jl L255)                               ;   [481]

;;; (setf (aref result row col) cell)
    (movd (% imm0) (% stack-temp))          ;   [487]
    (movq (@ (% gs) #x178) (% imm0))        ;   [491]
    (movss (@ 16 (% imm0)) (% fp0))         ;   [500]
    (movd (% stack-temp) (% imm0))          ;   [505]
    (movq (% save1) (% arg_y))              ;   [509]
    (movq (% save2) (% arg_x))              ;   [512]
    (movq (@ -64 (% rbp)) (% temp0))        ;   [515]
    (movq (@ 43 (% temp0)) (% imm0))        ;   [519]
    (sarq ($ 3) (% imm0))                   ;   [523]
    (imulq (% arg_x) (% imm0))              ;   [527]
    (leaq (@ (% arg_y) (% imm0)) (% arg_y)) ;   [531]
    (movq (@ 11 (% temp0)) (% arg_x))       ;   [535]
    (movq (% arg_y) (% imm2))               ;   [539]
    (shrq (% imm2))                         ;   [542]
    (movss (% fp0) (@ -5 (% arg_x) (% imm2))) ;   [545]

;;; (dotimes (col cols) (let ((cell (aref result row col))) (declare (type (single-float) cell)) (dotime
    (addq ($ 8) (% save1))                  ;   [552]
L549
    (cmpq (% save3) (% save1))              ;   [556]
    (jl L189)                               ;   [559]

;;; (dotimes (row rows) (dotimes (col cols) (let ((cell (aref result row col))) (declare (type (single-f
    (addq ($ 8) (% save2))                  ;   [565]
L562
    (movq (@ -56 (% rbp)) (% arg_z))        ;   [569]
    (cmpq (% arg_z) (% save2))              ;   [573]
    (jl L181)                               ;   [576]

;;; (let ((result (make-matrix rows cols))) (declare (type (simple-array single-float (* *)) result)) (d
    (movq (@ -64 (% rbp)) (% arg_z))        ;   [582]
    (addq ($ 16) (% rsp))                   ;   [586]
    (popq (% save3))                        ;   [590]
    (popq (% save2))                        ;   [592]
    (popq (% save1))                        ;   [594]
    (popq (% save0))                        ;   [596]
    (movq (@ (% gs) #x178) (% imm0))        ;   [598]
    (movq (@ 8 (% imm0)) (% stack-temp))    ;   [607]
    (movq (@ (% imm0)) (% imm0))            ;   [611]
    (movq (% imm0) (@ (% gs) 80))           ;   [614]
    (movq (% stack-temp) (@ (% gs) #x178))  ;   [623]
    (leaveq)                                ;   [632]
    (retq)                                  ;   [633]
CL-USER> (disassemble 'simple-gemm)
;; "home:sandbox;common-lisp;simple-gemm.lisp.newest":479-1037
    (recover-fn-from-rip)                   ;     [7]
    (pushq (% rbp))                         ;    [14]
    (movq (% rsp) (% rbp))                  ;    [15]
    (pushq (% arg_y))                       ;    [18]
    (pushq (% arg_z))                       ;    [19]
    (movq (@ (% gs) 80) (% stack-temp))     ;    [20]
    (subq ($ 48) (@ (% gs) 80))             ;    [29]
    (movq (@ (% gs) 80) (% imm0))           ;    [39]
    (movq (% stack-temp) (@ (% imm0)))      ;    [48]
    (movq (@ (% gs) #x178) (% stack-temp))  ;    [51]
    (movq (% stack-temp) (@ 8 (% imm0)))    ;    [60]
    (movq (% imm0) (@ (% gs) #x178))        ;    [64]
    (pushq (% save0))                       ;    [73]
    (pushq (% save1))                       ;    [75]
    (pushq (% save2))                       ;    [77]
    (pushq (% save3))                       ;    [79]

;;; (let ((rows (array-dimension ma 0)) (cols (array-dimension mb 1))) (declare (type fixnum rows cols))
    (xorl (% arg_z.l) (% arg_z.l))          ;    [81]
    (movl ($ 16) (% nargs))                 ;    [83]
    (movq (@ 'ARRAY-DIMENSION (% fn)) (% temp0)) ;    [88]
    (lisp-call (@ 10 (% temp0)))            ;    [97]
    (recover-fn-from-rip)                   ;   [100]
    (movq (% arg_z) (% arg_x))              ;   [107]
    (pushq (% arg_x))                       ;   [110]
    (movq (@ -16 (% rbp)) (% arg_y))        ;   [112]
    (movl ($ 8) (% arg_z.l))                ;   [116]
    (movl ($ 16) (% nargs))                 ;   [121]
    (movq (@ 'ARRAY-DIMENSION (% fn)) (% temp0)) ;   [126]
    (lisp-call (@ 10 (% temp0)))            ;   [137]
    (recover-fn-from-rip)                   ;   [140]
    (pushq (% arg_z))                       ;   [147]

;;; (let ((result (make-matrix rows cols))) (declare (type (simple-array single-float (* *)) result)) (d
    (movq (@ -56 (% rbp)) (% arg_y))        ;   [148]
    (movl ($ 16) (% nargs))                 ;   [152]
    (movq (@ 'MAKE-MATRIX (% fn)) (% temp0)) ;   [157]
    (lisp-call (@ 10 (% temp0)))            ;   [169]
    (recover-fn-from-rip)                   ;   [172]
    (movq (% arg_z) (% save3))              ;   [179]

;;; (dotimes (row rows) (dotimes (col cols) (dotimes (k cols) (incf (aref result row col) (* (aref ma ro
    (xorl (% save2.l) (% save2.l))          ;   [182]
    (jmpq L479)                             ;   [185]

;;; (dotimes (col cols) (dotimes (k cols) (incf (aref result row col) (* (aref ma row k) (aref mb k col)
L183
    (xorl (% save1.l) (% save1.l))          ;   [190]
    (jmpq L462)                             ;   [193]

;;; (dotimes (k cols) (incf (aref result row col) (* (aref ma row k) (aref mb k col))))
L191
    (xorl (% save0.l) (% save0.l))          ;   [198]
    (jmpq L445)                             ;   [201]

;;; (aref ma row k)
L199
    (movq (@ -8 (% rbp)) (% arg_x))         ;   [206]
    (movq (@ 43 (% arg_x)) (% imm0))        ;   [210]
    (sarq ($ 3) (% imm0))                   ;   [214]
    (imulq (% save2) (% imm0))              ;   [218]
    (leaq (@ (% save0) (% imm0)) (% temp0)) ;   [222]
    (movq (@ 11 (% arg_x)) (% arg_z))       ;   [226]
    (movq (% temp0) (% imm0))               ;   [230]
    (shrq (% imm0))                         ;   [233]
    (movss (@ -5 (% arg_z) (% imm0)) (% fp0)) ;   [236]

;;; (aref mb k col)
    (movq (@ -16 (% rbp)) (% arg_z))        ;   [242]
    (movq (@ 43 (% arg_z)) (% imm0))        ;   [246]
    (sarq ($ 3) (% imm0))                   ;   [250]
    (imulq (% save0) (% imm0))              ;   [254]
    (leaq (@ (% save1) (% imm0)) (% temp0)) ;   [258]
    (movq (@ 11 (% arg_z)) (% arg_z))       ;   [262]
    (movq (% temp0) (% imm0))               ;   [266]
    (shrq (% imm0))                         ;   [269]
    (movss (@ -5 (% arg_z) (% imm0)) (% fp1)) ;   [272]

;;; (* (aref ma row k) (aref mb k col))
    (mulss (% fp1) (% fp0))                 ;   [278]

;;; (incf (aref result row col) (* (aref ma row k) (aref mb k col)))
    (movd (% imm0) (% stack-temp))          ;   [282]
    (movq (@ (% gs) #x178) (% imm0))        ;   [286]
    (movss (% fp0) (@ 16 (% imm0)))         ;   [295]
    (movd (% stack-temp) (% imm0))          ;   [300]
    (movq (@ 43 (% save3)) (% imm0))        ;   [304]
    (sarq ($ 3) (% imm0))                   ;   [308]
    (imulq (% save2) (% imm0))              ;   [312]
    (leaq (@ (% save1) (% imm0)) (% temp0)) ;   [316]
    (movq (@ 11 (% save3)) (% arg_z))       ;   [320]
    (movq (% temp0) (% imm0))               ;   [324]
    (shrq (% imm0))                         ;   [327]
    (movss (@ -5 (% arg_z) (% imm0)) (% fp0)) ;   [330]
    (movd (% imm0) (% stack-temp))          ;   [336]
    (movq (@ (% gs) #x178) (% imm0))        ;   [340]
    (movss (@ 16 (% imm0)) (% fp1))         ;   [349]
    (movd (% stack-temp) (% imm0))          ;   [354]
    (addss (% fp1) (% fp0))                 ;   [358]
    (movd (% imm0) (% stack-temp))          ;   [362]
    (movq (@ (% gs) #x178) (% imm0))        ;   [366]
    (movss (% fp0) (@ 32 (% imm0)))         ;   [375]
    (movd (% stack-temp) (% imm0))          ;   [380]
    (movd (% imm0) (% stack-temp))          ;   [384]
    (movq (@ (% gs) #x178) (% imm0))        ;   [388]
    (movss (@ 32 (% imm0)) (% fp0))         ;   [397]
    (movd (% stack-temp) (% imm0))          ;   [402]
    (movq (% save1) (% arg_y))              ;   [406]
    (movq (% save2) (% arg_x))              ;   [409]
    (movq (% save3) (% temp0))              ;   [412]
    (movq (@ 43 (% temp0)) (% imm0))        ;   [415]
    (sarq ($ 3) (% imm0))                   ;   [419]
    (imulq (% arg_x) (% imm0))              ;   [423]
    (leaq (@ (% arg_y) (% imm0)) (% arg_y)) ;   [427]
    (movq (@ 11 (% temp0)) (% arg_x))       ;   [431]
    (movq (% arg_y) (% imm2))               ;   [435]
    (shrq (% imm2))                         ;   [438]
    (movss (% fp0) (@ -5 (% arg_x) (% imm2))) ;   [441]

;;; (dotimes (k cols) (incf (aref result row col) (* (aref ma row k) (aref mb k col))))
    (addq ($ 8) (% save0))                  ;   [448]
L445
    (movq (@ -64 (% rbp)) (% arg_z))        ;   [452]
    (cmpq (% arg_z) (% save0))              ;   [456]
    (jl L199)                               ;   [459]

;;; (dotimes (col cols) (dotimes (k cols) (incf (aref result row col) (* (aref ma row k) (aref mb k col)
    (addq ($ 8) (% save1))                  ;   [465]
L462
    (movq (@ -64 (% rbp)) (% arg_z))        ;   [469]
    (cmpq (% arg_z) (% save1))              ;   [473]
    (jl L191)                               ;   [476]

;;; (dotimes (row rows) (dotimes (col cols) (dotimes (k cols) (incf (aref result row col) (* (aref ma ro
    (addq ($ 8) (% save2))                  ;   [482]
L479
    (movq (@ -56 (% rbp)) (% arg_z))        ;   [486]
    (cmpq (% arg_z) (% save2))              ;   [490]
    (jl L183)                               ;   [493]

;;; (let ((result (make-matrix rows cols))) (declare (type (simple-array single-float (* *)) result)) (d
    (movq (% save3) (% arg_z))              ;   [499]
    (addq ($ 16) (% rsp))                   ;   [502]
    (popq (% save3))                        ;   [506]
    (popq (% save2))                        ;   [508]
    (popq (% save1))                        ;   [510]
    (popq (% save0))                        ;   [512]
    (movq (@ (% gs) #x178) (% imm0))        ;   [514]
    (movq (@ 8 (% imm0)) (% stack-temp))    ;   [523]
    (movq (@ (% imm0)) (% imm0))            ;   [527]
    (movq (% imm0) (@ (% gs) 80))           ;   [530]
    (movq (% stack-temp) (@ (% gs) #x178))  ;   [539]
    (leaveq)                                ;   [548]
    (retq)                                  ;   [549]

これアセンブラ部分が意図的にコメントになってないけど、このままREPLに入れると動くんだろうか?

CL-USER> (recover-fn-from-rip) 
; Evaluation aborted on #<CCL::UNDEFINED-FUNCTION-CALL #x3020018BF73D>.

ダメらしい。

caching-gemm

中間結果を変数にキャッシュするGEMM。

(defun caching-gemm (ma mb)
  (declare (optimize (speed 3) (debug 0) (safety 0)))
  (declare (type (simple-array single-float (* *)) ma mb))
  (let ((rows (array-dimension ma 0))
        (cols (array-dimension mb 1)))
    (declare (type fixnum rows cols))
    (let ((result (make-matrix rows cols)))
      (declare (type (simple-array single-float (* *)) result))
      (dotimes (row rows)
        (dotimes (col cols)
          (let ((cell (aref result row col)))
            (declare (type (single-float) cell))
            (dotimes (k cols)
              (incf cell
                    (* (aref ma row k) (aref mb k col))))
            (setf (aref result row col) cell))))
      result)))

SBCLの場合。 L2が多分最内ループで、SARしてるのなんだ? 後、幅とか先頭アドレスかを毎回ロードして計算しなおしてる?

CL-USER> (disassemble 'caching-gemm)
; disassembly for CACHING-GEMM
; Size: 272 bytes. Origin: #x1006D8D8F5
; 8F5:       4C894DE8         MOV [RBP-24], R9                ; no-arg-parsing entry point
; 8F9:       4C8945E0         MOV [RBP-32], R8
; 8FD:       4D8B6831         MOV R13, [R8+49]
; 901:       4D8B5139         MOV R10, [R9+57]
; 905:       4C896DF8         MOV [RBP-8], R13
; 909:       4C8955F0         MOV [RBP-16], R10
; 90D:       488D5C24F0       LEA RBX, [RSP-16]
; 912:       4883EC18         SUB RSP, 24
; 916:       498BD5           MOV RDX, R13
; 919:       498BFA           MOV RDI, R10
; 91C:       488B057DFFFFFF   MOV RAX, [RIP-131]              ; #<FDEFINITION for MAKE-MATRIX>
; 923:       B904000000       MOV ECX, 4
; 928:       48892B           MOV [RBX], RBP
; 92B:       488BEB           MOV RBP, RBX
; 92E:       FF5009           CALL QWORD PTR [RAX+9]
; 931:       480F42E3         CMOVB RSP, RBX
; 935:       4C8B45E0         MOV R8, [RBP-32]
; 939:       4C8B4DE8         MOV R9, [RBP-24]
; 93D:       4C8B55F0         MOV R10, [RBP-16]
; 941:       4C8B6DF8         MOV R13, [RBP-8]
; 945:       488BF2           MOV RSI, RDX
; 948:       31DB             XOR EBX, EBX
; 94A:       E9A4000000       JMP L5
; 94F:       90               NOP
; 950: L0:   31C0             XOR EAX, EAX
; 952:       E98F000000       JMP L4
; 957:       660F1F840000000000 NOP
; 960: L1:   488B4E39         MOV RCX, [RSI+57]
; 964:       488BFB           MOV RDI, RBX
; 967:       48D1FF           SAR RDI, 1
; 96A:       480FAFF9         IMUL RDI, RCX
; 96E:       4801C7           ADD RDI, RAX
; 971:       488B4E11         MOV RCX, [RSI+17]
; 975:       F30F104C7901     MOVSS XMM1, [RCX+RDI*2+1]
; 97B:       31C9             XOR ECX, ECX
; 97D:       EB43             JMP L3
; 97F:       90               NOP
; 980: L2:   498B5039         MOV RDX, [R8+57]
; 984:       488BFB           MOV RDI, RBX
; 987:       48D1FF           SAR RDI, 1
; 98A:       480FAFFA         IMUL RDI, RDX
; 98E:       4801CF           ADD RDI, RCX
; 991:       498B5011         MOV RDX, [R8+17]
; 995:       F30F10547A01     MOVSS XMM2, [RDX+RDI*2+1]
; 99B:       498B5139         MOV RDX, [R9+57]
; 99F:       488BF9           MOV RDI, RCX
; 9A2:       48D1FF           SAR RDI, 1
; 9A5:       480FAFFA         IMUL RDI, RDX
; 9A9:       4801C7           ADD RDI, RAX
; 9AC:       498B5111         MOV RDX, [R9+17]
; 9B0:       F30F105C7A01     MOVSS XMM3, [RDX+RDI*2+1]
; 9B6:       F30F59DA         MULSS XMM3, XMM2
; 9BA:       F30F58CB         ADDSS XMM1, XMM3
; 9BE:       4883C102         ADD RCX, 2
; 9C2: L3:   4C39D1           CMP RCX, R10
; 9C5:       7CB9             JL L2
; 9C7:       488B4E39         MOV RCX, [RSI+57]
; 9CB:       488BFB           MOV RDI, RBX
; 9CE:       48D1FF           SAR RDI, 1
; 9D1:       480FAFF9         IMUL RDI, RCX
; 9D5:       4801C7           ADD RDI, RAX
; 9D8:       488B4E11         MOV RCX, [RSI+17]
; 9DC:       F30F114C7901     MOVSS [RCX+RDI*2+1], XMM1
; 9E2:       4883C002         ADD RAX, 2
; 9E6: L4:   4C39D0           CMP RAX, R10
; 9E9:       0F8C71FFFFFF     JL L1
; 9EF:       4883C302         ADD RBX, 2
; 9F3: L5:   4C39EB           CMP RBX, R13
; 9F6:       0F8C54FFFFFF     JL L0
; 9FC:       488BD6           MOV RDX, RSI
; 9FF:       488BE5           MOV RSP, RBP
; A02:       F8               CLC
; A03:       5D               POP RBP
; A04:       C3               RET

CCLの場合。 L255あたりが最内ループ。 途中にlisp-callとか入ってたり、stack-tempが多分レジスタスピルだと思う。

CL-USER> (disassemble 'caching-gemm)
;; "home:sandbox;common-lisp;simple-gemm.lisp.newest":1039-1729
    (recover-fn-from-rip)                   ;     [7]
    (pushq (% rbp))                         ;    [14]
    (movq (% rsp) (% rbp))                  ;    [15]
    (pushq (% arg_y))                       ;    [18]
    (pushq (% arg_z))                       ;    [19]
    (movq (@ (% gs) 80) (% stack-temp))     ;    [20]
    (subq ($ 64) (@ (% gs) 80))             ;    [29]
    (movq (@ (% gs) 80) (% imm0))           ;    [39]
    (movq (% stack-temp) (@ (% imm0)))      ;    [48]
    (movq (@ (% gs) #x178) (% stack-temp))  ;    [51]
    (movq (% stack-temp) (@ 8 (% imm0)))    ;    [60]
    (movq (% imm0) (@ (% gs) #x178))        ;    [64]
    (pushq (% save0))                       ;    [73]
    (pushq (% save1))                       ;    [75]
    (pushq (% save2))                       ;    [77]
    (pushq (% save3))                       ;    [79]

;;; (let ((rows (array-dimension ma 0)) (cols (array-dimension mb 1))) (declare (type fixnum rows cols))
    (xorl (% arg_z.l) (% arg_z.l))          ;    [81]
    (movl ($ 16) (% nargs))                 ;    [83]
    (movq (@ 'ARRAY-DIMENSION (% fn)) (% temp0)) ;    [88]
    (lisp-call (@ 10 (% temp0)))            ;    [97]
    (recover-fn-from-rip)                   ;   [100]
    (movq (% arg_z) (% arg_x))              ;   [107]
    (pushq (% arg_x))                       ;   [110]
    (movq (@ -16 (% rbp)) (% arg_y))        ;   [112]
    (movl ($ 8) (% arg_z.l))                ;   [116]
    (movl ($ 16) (% nargs))                 ;   [121]
    (movq (@ 'ARRAY-DIMENSION (% fn)) (% temp0)) ;   [126]
    (lisp-call (@ 10 (% temp0)))            ;   [137]
    (recover-fn-from-rip)                   ;   [140]
    (movq (% arg_z) (% save3))              ;   [147]

;;; (let ((result (make-matrix rows cols))) (declare (type (simple-array single-float (* *)) result)) (d
    (movq (@ -56 (% rbp)) (% arg_y))        ;   [150]
    (movq (% save3) (% arg_z))              ;   [154]
    (movl ($ 16) (% nargs))                 ;   [157]
    (movq (@ 'MAKE-MATRIX (% fn)) (% temp0)) ;   [162]
    (lisp-call (@ 10 (% temp0)))            ;   [169]
    (recover-fn-from-rip)                   ;   [172]
    (pushq (% arg_z))                       ;   [179]

;;; (dotimes (row rows) (dotimes (col cols) (let ((cell (aref result row col))) (declare (type (single-f
    (xorl (% save2.l) (% save2.l))          ;   [180]
    (jmpq L562)                             ;   [183]

;;; (dotimes (col cols) (let ((cell (aref result row col))) (declare (type (single-float) cell)) (dotime
L181
    (xorl (% save1.l) (% save1.l))          ;   [188]
    (jmpq L549)                             ;   [191]

;;; (let ((cell (aref result row col))) (declare (type (single-float) cell)) (dotimes (k cols) (incf cel
L189
    (movq (@ -64 (% rbp)) (% arg_x))        ;   [196]
    (movq (@ 43 (% arg_x)) (% imm0))        ;   [200]
    (sarq ($ 3) (% imm0))                   ;   [204]
    (imulq (% save2) (% imm0))              ;   [208]
    (leaq (@ (% save1) (% imm0)) (% temp0)) ;   [212]
    (movq (@ 11 (% arg_x)) (% arg_z))       ;   [216]
    (movq (% temp0) (% imm0))               ;   [220]
    (shrq (% imm0))                         ;   [223]
    (movss (@ -5 (% arg_z) (% imm0)) (% fp0)) ;   [226]
    (movd (% imm0) (% stack-temp))          ;   [232]
    (movq (@ (% gs) #x178) (% imm0))        ;   [236]
    (movss (% fp0) (@ 16 (% imm0)))         ;   [245]
    (movd (% stack-temp) (% imm0))          ;   [250]

;;; (dotimes (k cols) (incf cell (* (aref ma row k) (aref mb k col))))
    (xorl (% save0.l) (% save0.l))          ;   [254]
    (jmpq L471)                             ;   [257]

;;; (aref ma row k)
L255
    (movq (@ -8 (% rbp)) (% arg_x))         ;   [262]
    (movq (@ 43 (% arg_x)) (% imm0))        ;   [266]
    (sarq ($ 3) (% imm0))                   ;   [270]
    (imulq (% save2) (% imm0))              ;   [274]
    (leaq (@ (% save0) (% imm0)) (% temp0)) ;   [278]
    (movq (@ 11 (% arg_x)) (% arg_z))       ;   [282]
    (movq (% temp0) (% imm0))               ;   [286]
    (shrq (% imm0))                         ;   [289]
    (movss (@ -5 (% arg_z) (% imm0)) (% fp0)) ;   [292]

;;; (aref mb k col)
    (movq (@ -16 (% rbp)) (% arg_z))        ;   [298]
    (movq (@ 43 (% arg_z)) (% imm0))        ;   [302]
    (sarq ($ 3) (% imm0))                   ;   [306]
    (imulq (% save0) (% imm0))              ;   [310]
    (leaq (@ (% save1) (% imm0)) (% temp0)) ;   [314]
    (movq (@ 11 (% arg_z)) (% arg_z))       ;   [318]
    (movq (% temp0) (% imm0))               ;   [322]
    (shrq (% imm0))                         ;   [325]
    (movss (@ -5 (% arg_z) (% imm0)) (% fp1)) ;   [328]

;;; (* (aref ma row k) (aref mb k col))
    (mulss (% fp1) (% fp0))                 ;   [334]

;;; (incf cell (* (aref ma row k) (aref mb k col)))
    (movd (% imm0) (% stack-temp))          ;   [338]
    (movq (@ (% gs) #x178) (% imm0))        ;   [342]
    (movss (% fp0) (@ 32 (% imm0)))         ;   [351]
    (movd (% stack-temp) (% imm0))          ;   [356]
    (movd (% imm0) (% stack-temp))          ;   [360]
    (movq (@ (% gs) #x178) (% imm0))        ;   [364]
    (movss (@ 32 (% imm0)) (% fp1))         ;   [373]
    (movd (% stack-temp) (% imm0))          ;   [378]
    (movd (% imm0) (% stack-temp))          ;   [382]
    (movq (@ (% gs) #x178) (% imm0))        ;   [386]
    (movss (@ 16 (% imm0)) (% fp0))         ;   [395]
    (movd (% stack-temp) (% imm0))          ;   [400]
    (addss (% fp1) (% fp0))                 ;   [404]
    (movd (% imm0) (% stack-temp))          ;   [408]
    (movq (@ (% gs) #x178) (% imm0))        ;   [412]
    (movss (% fp0) (@ 48 (% imm0)))         ;   [421]
    (movd (% stack-temp) (% imm0))          ;   [426]
    (movd (% imm0) (% stack-temp))          ;   [430]
    (movq (@ (% gs) #x178) (% imm0))        ;   [434]
    (movss (@ 48 (% imm0)) (% fp0))         ;   [443]
    (movd (% stack-temp) (% imm0))          ;   [448]
    (movd (% imm0) (% stack-temp))          ;   [452]
    (movq (@ (% gs) #x178) (% imm0))        ;   [456]
    (movss (% fp0) (@ 16 (% imm0)))         ;   [465]
    (movd (% stack-temp) (% imm0))          ;   [470]

;;; (dotimes (k cols) (incf cell (* (aref ma row k) (aref mb k col))))
    (addq ($ 8) (% save0))                  ;   [474]
L471
    (cmpq (% save3) (% save0))              ;   [478]
    (jl L255)                               ;   [481]

;;; (setf (aref result row col) cell)
    (movd (% imm0) (% stack-temp))          ;   [487]
    (movq (@ (% gs) #x178) (% imm0))        ;   [491]
    (movss (@ 16 (% imm0)) (% fp0))         ;   [500]
    (movd (% stack-temp) (% imm0))          ;   [505]
    (movq (% save1) (% arg_y))              ;   [509]
    (movq (% save2) (% arg_x))              ;   [512]
    (movq (@ -64 (% rbp)) (% temp0))        ;   [515]
    (movq (@ 43 (% temp0)) (% imm0))        ;   [519]
    (sarq ($ 3) (% imm0))                   ;   [523]
    (imulq (% arg_x) (% imm0))              ;   [527]
    (leaq (@ (% arg_y) (% imm0)) (% arg_y)) ;   [531]
    (movq (@ 11 (% temp0)) (% arg_x))       ;   [535]
    (movq (% arg_y) (% imm2))               ;   [539]
    (shrq (% imm2))                         ;   [542]
    (movss (% fp0) (@ -5 (% arg_x) (% imm2))) ;   [545]

;;; (dotimes (col cols) (let ((cell (aref result row col))) (declare (type (single-float) cell)) (dotime
    (addq ($ 8) (% save1))                  ;   [552]
L549
    (cmpq (% save3) (% save1))              ;   [556]
    (jl L189)                               ;   [559]

;;; (dotimes (row rows) (dotimes (col cols) (let ((cell (aref result row col))) (declare (type (single-f
    (addq ($ 8) (% save2))                  ;   [565]
L562
    (movq (@ -56 (% rbp)) (% arg_z))        ;   [569]
    (cmpq (% arg_z) (% save2))              ;   [573]
    (jl L181)                               ;   [576]

;;; (let ((result (make-matrix rows cols))) (declare (type (simple-array single-float (* *)) result)) (d
    (movq (@ -64 (% rbp)) (% arg_z))        ;   [582]
    (addq ($ 16) (% rsp))                   ;   [586]
    (popq (% save3))                        ;   [590]
    (popq (% save2))                        ;   [592]
    (popq (% save1))                        ;   [594]
    (popq (% save0))                        ;   [596]
    (movq (@ (% gs) #x178) (% imm0))        ;   [598]
    (movq (@ 8 (% imm0)) (% stack-temp))    ;   [607]
    (movq (@ (% imm0)) (% imm0))            ;   [611]
    (movq (% imm0) (@ (% gs) 80))           ;   [614]
    (movq (% stack-temp) (@ (% gs) #x178))  ;   [623]
    (leaveq)                                ;   [632]
    (retq)                                  ;   [633]

row-major-gemm

arefが何となく遅そうなので、row-major-arefで書き直した。

(defun row-major-gemm (ma mb)
  (declare (optimize (speed 3) (debug 0) (safety 0)))
  (declare (type (simple-array single-float (* *)) ma mb))
  (let ((rows (array-dimension ma 0))
        (cols (array-dimension mb 1)))
    (declare (type fixnum rows cols))
    (let ((result (make-matrix rows cols)))
      (declare (type (simple-array single-float (* *)) result))
      (dotimes (row rows)
        (dotimes (col cols)
          (let ((cell (aref result row col))
                (ma-index (array-row-major-index ma row 0))
                (mb-index (array-row-major-index mb 0 col)))
            (declare (type (single-float) cell))
            (declare (type fixnum ma-index mb-index))
            (dotimes (k cols)
              (incf cell (* (row-major-aref ma ma-index)
                            (row-major-aref mb mb-index)))
              (incf ma-index)
              (incf mb-index cols))
            (setf (aref result row col) cell))))
      result)))

SBCLの場合。 L2が最内ループで、indexの掛け算を消せた気がするけど、先頭アドレスのロードらしきやつが消せない。

CL-USER> (disassemble 'row-major-gemm)
; disassembly for ROW-MAJOR-GEMM
; Size: 266 bytes. Origin: #x1003296715
; 715:       4C8975E8         MOV [RBP-24], R14               ; no-arg-parsing entry point
; 719:       4C896DE0         MOV [RBP-32], R13
; 71D:       4D8B7D31         MOV R15, [R13+49]
; 721:       4D8B4E39         MOV R9, [R14+57]
; 725:       4C897DF8         MOV [RBP-8], R15
; 729:       4C894DF0         MOV [RBP-16], R9
; 72D:       488D5C24F0       LEA RBX, [RSP-16]
; 732:       4883EC18         SUB RSP, 24
; 736:       498BD7           MOV RDX, R15
; 739:       498BF9           MOV RDI, R9
; 73C:       488B057DFFFFFF   MOV RAX, [RIP-131]              ; #<FDEFINITION for MAKE-MATRIX>
; 743:       B904000000       MOV ECX, 4
; 748:       48892B           MOV [RBX], RBP
; 74B:       488BEB           MOV RBP, RBX
; 74E:       FF5009           CALL QWORD PTR [RAX+9]
; 751:       480F42E3         CMOVB RSP, RBX
; 755:       4C8B6DE0         MOV R13, [RBP-32]
; 759:       4C8B75E8         MOV R14, [RBP-24]
; 75D:       4C8B4DF0         MOV R9, [RBP-16]
; 761:       4C8B7DF8         MOV R15, [RBP-8]
; 765:       488BDA           MOV RBX, RDX
; 768:       31C9             XOR ECX, ECX
; 76A:       E99E000000       JMP L5
; 76F:       90               NOP
; 770: L0:   31C0             XOR EAX, EAX
; 772:       E989000000       JMP L4
; 777:       660F1F840000000000 NOP
; 780: L1:   488B5339         MOV RDX, [RBX+57]
; 784:       488BF1           MOV RSI, RCX
; 787:       48D1FE           SAR RSI, 1
; 78A:       480FAFF2         IMUL RSI, RDX
; 78E:       4801C6           ADD RSI, RAX
; 791:       488B5311         MOV RDX, [RBX+17]
; 795:       F30F104C7201     MOVSS XMM1, [RDX+RSI*2+1]
; 79B:       498B5539         MOV RDX, [R13+57]
; 79F:       488BF9           MOV RDI, RCX
; 7A2:       48D1FF           SAR RDI, 1
; 7A5:       480FAFFA         IMUL RDI, RDX
; 7A9:       4C8BC0           MOV R8, RAX
; 7AC:       31F6             XOR ESI, ESI
; 7AE:       EB2C             JMP L3
; 7B0: L2:   498B5511         MOV RDX, [R13+17]
; 7B4:       F30F10547A01     MOVSS XMM2, [RDX+RDI*2+1]
; 7BA:       498B5611         MOV RDX, [R14+17]
; 7BE:       F3420F105C4201   MOVSS XMM3, [RDX+R8*2+1]
; 7C5:       F30F59DA         MULSS XMM3, XMM2
; 7C9:       F30F58CB         ADDSS XMM1, XMM3
; 7CD:       4883C702         ADD RDI, 2
; 7D1:       4F8D1401         LEA R10, [R9+R8]
; 7D5:       4D8BC2           MOV R8, R10
; 7D8:       4883C602         ADD RSI, 2
; 7DC: L3:   4C39CE           CMP RSI, R9
; 7DF:       7CCF             JL L2
; 7E1:       488B5339         MOV RDX, [RBX+57]
; 7E5:       488BF1           MOV RSI, RCX
; 7E8:       48D1FE           SAR RSI, 1
; 7EB:       480FAFF2         IMUL RSI, RDX
; 7EF:       4801C6           ADD RSI, RAX
; 7F2:       488B5311         MOV RDX, [RBX+17]
; 7F6:       F30F114C7201     MOVSS [RDX+RSI*2+1], XMM1
; 7FC:       4883C002         ADD RAX, 2
; 800: L4:   4C39C8           CMP RAX, R9
; 803:       0F8C77FFFFFF     JL L1
; 809:       4883C102         ADD RCX, 2
; 80D: L5:   4C39F9           CMP RCX, R15
; 810:       0F8C5AFFFFFF     JL L0
; 816:       488BD3           MOV RDX, RBX
; 819:       488BE5           MOV RSP, RBP
; 81C:       F8               CLC
; 81D:       5D               POP RBP
; 81E:       C3               RET

一応実測する。

CL-USER> (setf *N* 256) (run #'simple-gemm 100) (run #'on-register-gemm 100) (run #'row-major-gemm 100)
Evaluation took:
  6.159 seconds of real time
  6.164000 seconds of total run time (6.164000 user, 0.000000 system)
  [ Run times consist of 0.004 seconds GC time, and 6.160 seconds non-GC time. ]
  100.08% CPU
  14,747,074,008 processor cycles
  26,281,520 bytes consed
  
Evaluation took:
  3.955 seconds of real time
  3.960000 seconds of total run time (3.960000 user, 0.000000 system)
  100.13% CPU
  9,469,087,053 processor cycles
  26,216,000 bytes consed
  
Evaluation took:
  3.003 seconds of real time
  3.008000 seconds of total run time (3.008000 user, 0.000000 system)
  [ Run times consist of 0.004 seconds GC time, and 3.004 seconds non-GC time. ]
  100.17% CPU
  7,192,309,782 processor cycles
  26,216,000 bytes consed

unroll無しでon-register-gemmよりは速くなってるけど、やっぱりCと同レベルなコードにはならなそう。

参考ページ