Common Lispはdisassembleも付いてきてチューニングできるらしいとゆー話をよく聞くので、逆アセンブルしてみた。 題材としてはこの辺を参考に
- Common Lispで高速行列演算 | κeenのHappy Hacκing Blog
- Python (NumPy) と Common Lisp (LLA) で行列積の実行速度を比較する - 不確定特異点
caching-gemm
がバグってたので厳密には同じコードではなくなっている。
処理系のバージョンは次の通り
- SBCL 1.3.2
- CCL 1.11
simple-gemm
単純なGEMM。
(defun simple-gemm (ma mb) (declare (optimize (speed 3) (debug 0) (safety 0))) (declare (type (simple-array single-float (* *)) ma mb)) (let ((rows (array-dimension ma 0)) (cols (array-dimension mb 1))) (declare (type fixnum rows cols)) (let ((result (make-matrix rows cols))) (declare (type (simple-array single-float (* *)) result)) (dotimes (row rows) (dotimes (col cols) (dotimes (k cols) (incf (aref result row col) (* (aref ma row k) (aref mb k col)))))) result)))
SBCLの場合。 普通っぽい。
CL-USER> (disassemble 'simple-gemm) ; disassembly for SIMPLE-GEMM ; Size: 283 bytes. Origin: #x1004C5CA65 ; A65: 4C894DE8 MOV [RBP-24], R9 ; no-arg-parsing entry point ; A69: 4C8945E0 MOV [RBP-32], R8 ; A6D: 4D8B6831 MOV R13, [R8+49] ; A71: 4D8B5139 MOV R10, [R9+57] ; A75: 4C896DF8 MOV [RBP-8], R13 ; A79: 4C8955F0 MOV [RBP-16], R10 ; A7D: 488D5C24F0 LEA RBX, [RSP-16] ; A82: 4883EC18 SUB RSP, 24 ; A86: 498BD5 MOV RDX, R13 ; A89: 498BFA MOV RDI, R10 ; A8C: 488B057DFFFFFF MOV RAX, [RIP-131] ; #<FDEFINITION for MAKE-MATRIX> ; A93: B904000000 MOV ECX, 4 ; A98: 48892B MOV [RBX], RBP ; A9B: 488BEB MOV RBP, RBX ; A9E: FF5009 CALL QWORD PTR [RAX+9] ; AA1: 480F42E3 CMOVB RSP, RBX ; AA5: 4C8B45E0 MOV R8, [RBP-32] ; AA9: 4C8B4DE8 MOV R9, [RBP-24] ; AAD: 4C8B55F0 MOV R10, [RBP-16] ; AB1: 4C8B6DF8 MOV R13, [RBP-8] ; AB5: 488BF2 MOV RSI, RDX ; AB8: 31DB XOR EBX, EBX ; ABA: E9AF000000 JMP L5 ; ABF: 90 NOP ; AC0: L0: 31C0 XOR EAX, EAX ; AC2: E99A000000 JMP L4 ; AC7: 660F1F840000000000 NOP ; AD0: L1: 31C9 XOR ECX, ECX ; AD2: E981000000 JMP L3 ; AD7: 660F1F840000000000 NOP ; AE0: L2: 498B5039 MOV RDX, [R8+57] ; AE4: 488BFB MOV RDI, RBX ; AE7: 48D1FF SAR RDI, 1 ; AEA: 480FAFFA IMUL RDI, RDX ; AEE: 4801CF ADD RDI, RCX ; AF1: 498B5011 MOV RDX, [R8+17] ; AF5: F30F104C7A01 MOVSS XMM1, [RDX+RDI*2+1] ; AFB: 498B5139 MOV RDX, [R9+57] ; AFF: 488BF9 MOV RDI, RCX ; B02: 48D1FF SAR RDI, 1 ; B05: 480FAFFA IMUL RDI, RDX ; B09: 4801C7 ADD RDI, RAX ; B0C: 498B5111 MOV RDX, [R9+17] ; B10: F30F10547A01 MOVSS XMM2, [RDX+RDI*2+1] ; B16: F30F59D1 MULSS XMM2, XMM1 ; B1A: 488B5639 MOV RDX, [RSI+57] ; B1E: 488BFB MOV RDI, RBX ; B21: 48D1FF SAR RDI, 1 ; B24: 480FAFFA IMUL RDI, RDX ; B28: 4801C7 ADD RDI, RAX ; B2B: 488B5611 MOV RDX, [RSI+17] ; B2F: F30F104C7A01 MOVSS XMM1, [RDX+RDI*2+1] ; B35: F30F58CA ADDSS XMM1, XMM2 ; B39: 488B5639 MOV RDX, [RSI+57] ; B3D: 488BFB MOV RDI, RBX ; B40: 48D1FF SAR RDI, 1 ; B43: 480FAFFA IMUL RDI, RDX ; B47: 4801C7 ADD RDI, RAX ; B4A: 488B5611 MOV RDX, [RSI+17] ; B4E: F30F114C7A01 MOVSS [RDX+RDI*2+1], XMM1 ; B54: 4883C102 ADD RCX, 2 ; B58: L3: 4C39D1 CMP RCX, R10 ; B5B: 7C83 JL L2 ; B5D: 4883C002 ADD RAX, 2 ; B61: L4: 4C39D0 CMP RAX, R10 ; B64: 0F8C66FFFFFF JL L1 ; B6A: 4883C302 ADD RBX, 2 ; B6E: L5: 4C39EB CMP RBX, R13 ; B71: 0F8C49FFFFFF JL L0 ; B77: 488BD6 MOV RDX, RSI ; B7A: 488BE5 MOV RSP, RBP ; B7D: F8 CLC ; B7E: 5D POP RBP ; B7F: C3 RET
CCLの場合。
対応するLispコードがコメントに付いてて分かりやすい。
S式風謎アセンブラよく見るとよく分からないの色々あるな。
$
が即値で%
が変数かな?
CCLのドキュメントちゃんと読まないと無理そう
CL-USER> (disassemble 'caching-gemm) ;; "home:sandbox;common-lisp;simple-gemm.lisp.newest":1039-1729 (recover-fn-from-rip) ; [7] (pushq (% rbp)) ; [14] (movq (% rsp) (% rbp)) ; [15] (pushq (% arg_y)) ; [18] (pushq (% arg_z)) ; [19] (movq (@ (% gs) 80) (% stack-temp)) ; [20] (subq ($ 64) (@ (% gs) 80)) ; [29] (movq (@ (% gs) 80) (% imm0)) ; [39] (movq (% stack-temp) (@ (% imm0))) ; [48] (movq (@ (% gs) #x178) (% stack-temp)) ; [51] (movq (% stack-temp) (@ 8 (% imm0))) ; [60] (movq (% imm0) (@ (% gs) #x178)) ; [64] (pushq (% save0)) ; [73] (pushq (% save1)) ; [75] (pushq (% save2)) ; [77] (pushq (% save3)) ; [79] ;;; (let ((rows (array-dimension ma 0)) (cols (array-dimension mb 1))) (declare (type fixnum rows cols)) (xorl (% arg_z.l) (% arg_z.l)) ; [81] (movl ($ 16) (% nargs)) ; [83] (movq (@ 'ARRAY-DIMENSION (% fn)) (% temp0)) ; [88] (lisp-call (@ 10 (% temp0))) ; [97] (recover-fn-from-rip) ; [100] (movq (% arg_z) (% arg_x)) ; [107] (pushq (% arg_x)) ; [110] (movq (@ -16 (% rbp)) (% arg_y)) ; [112] (movl ($ 8) (% arg_z.l)) ; [116] (movl ($ 16) (% nargs)) ; [121] (movq (@ 'ARRAY-DIMENSION (% fn)) (% temp0)) ; [126] (lisp-call (@ 10 (% temp0))) ; [137] (recover-fn-from-rip) ; [140] (movq (% arg_z) (% save3)) ; [147] ;;; (let ((result (make-matrix rows cols))) (declare (type (simple-array single-float (* *)) result)) (d (movq (@ -56 (% rbp)) (% arg_y)) ; [150] (movq (% save3) (% arg_z)) ; [154] (movl ($ 16) (% nargs)) ; [157] (movq (@ 'MAKE-MATRIX (% fn)) (% temp0)) ; [162] (lisp-call (@ 10 (% temp0))) ; [169] (recover-fn-from-rip) ; [172] (pushq (% arg_z)) ; [179] ;;; (dotimes (row rows) (dotimes (col cols) (let ((cell (aref result row col))) (declare (type (single-f (xorl (% save2.l) (% save2.l)) ; [180] (jmpq L562) ; [183] ;;; (dotimes (col cols) (let ((cell (aref result row col))) (declare (type (single-float) cell)) (dotime L181 (xorl (% save1.l) (% save1.l)) ; [188] (jmpq L549) ; [191] ;;; (let ((cell (aref result row col))) (declare (type (single-float) cell)) (dotimes (k cols) (incf cel L189 (movq (@ -64 (% rbp)) (% arg_x)) ; [196] (movq (@ 43 (% arg_x)) (% imm0)) ; [200] (sarq ($ 3) (% imm0)) ; [204] (imulq (% save2) (% imm0)) ; [208] (leaq (@ (% save1) (% imm0)) (% temp0)) ; [212] (movq (@ 11 (% arg_x)) (% arg_z)) ; [216] (movq (% temp0) (% imm0)) ; [220] (shrq (% imm0)) ; [223] (movss (@ -5 (% arg_z) (% imm0)) (% fp0)) ; [226] (movd (% imm0) (% stack-temp)) ; [232] (movq (@ (% gs) #x178) (% imm0)) ; [236] (movss (% fp0) (@ 16 (% imm0))) ; [245] (movd (% stack-temp) (% imm0)) ; [250] ;;; (dotimes (k cols) (incf cell (* (aref ma row k) (aref mb k col)))) (xorl (% save0.l) (% save0.l)) ; [254] (jmpq L471) ; [257] ;;; (aref ma row k) L255 (movq (@ -8 (% rbp)) (% arg_x)) ; [262] (movq (@ 43 (% arg_x)) (% imm0)) ; [266] (sarq ($ 3) (% imm0)) ; [270] (imulq (% save2) (% imm0)) ; [274] (leaq (@ (% save0) (% imm0)) (% temp0)) ; [278] (movq (@ 11 (% arg_x)) (% arg_z)) ; [282] (movq (% temp0) (% imm0)) ; [286] (shrq (% imm0)) ; [289] (movss (@ -5 (% arg_z) (% imm0)) (% fp0)) ; [292] ;;; (aref mb k col) (movq (@ -16 (% rbp)) (% arg_z)) ; [298] (movq (@ 43 (% arg_z)) (% imm0)) ; [302] (sarq ($ 3) (% imm0)) ; [306] (imulq (% save0) (% imm0)) ; [310] (leaq (@ (% save1) (% imm0)) (% temp0)) ; [314] (movq (@ 11 (% arg_z)) (% arg_z)) ; [318] (movq (% temp0) (% imm0)) ; [322] (shrq (% imm0)) ; [325] (movss (@ -5 (% arg_z) (% imm0)) (% fp1)) ; [328] ;;; (* (aref ma row k) (aref mb k col)) (mulss (% fp1) (% fp0)) ; [334] ;;; (incf cell (* (aref ma row k) (aref mb k col))) (movd (% imm0) (% stack-temp)) ; [338] (movq (@ (% gs) #x178) (% imm0)) ; [342] (movss (% fp0) (@ 32 (% imm0))) ; [351] (movd (% stack-temp) (% imm0)) ; [356] (movd (% imm0) (% stack-temp)) ; [360] (movq (@ (% gs) #x178) (% imm0)) ; [364] (movss (@ 32 (% imm0)) (% fp1)) ; [373] (movd (% stack-temp) (% imm0)) ; [378] (movd (% imm0) (% stack-temp)) ; [382] (movq (@ (% gs) #x178) (% imm0)) ; [386] (movss (@ 16 (% imm0)) (% fp0)) ; [395] (movd (% stack-temp) (% imm0)) ; [400] (addss (% fp1) (% fp0)) ; [404] (movd (% imm0) (% stack-temp)) ; [408] (movq (@ (% gs) #x178) (% imm0)) ; [412] (movss (% fp0) (@ 48 (% imm0))) ; [421] (movd (% stack-temp) (% imm0)) ; [426] (movd (% imm0) (% stack-temp)) ; [430] (movq (@ (% gs) #x178) (% imm0)) ; [434] (movss (@ 48 (% imm0)) (% fp0)) ; [443] (movd (% stack-temp) (% imm0)) ; [448] (movd (% imm0) (% stack-temp)) ; [452] (movq (@ (% gs) #x178) (% imm0)) ; [456] (movss (% fp0) (@ 16 (% imm0))) ; [465] (movd (% stack-temp) (% imm0)) ; [470] ;;; (dotimes (k cols) (incf cell (* (aref ma row k) (aref mb k col)))) (addq ($ 8) (% save0)) ; [474] L471 (cmpq (% save3) (% save0)) ; [478] (jl L255) ; [481] ;;; (setf (aref result row col) cell) (movd (% imm0) (% stack-temp)) ; [487] (movq (@ (% gs) #x178) (% imm0)) ; [491] (movss (@ 16 (% imm0)) (% fp0)) ; [500] (movd (% stack-temp) (% imm0)) ; [505] (movq (% save1) (% arg_y)) ; [509] (movq (% save2) (% arg_x)) ; [512] (movq (@ -64 (% rbp)) (% temp0)) ; [515] (movq (@ 43 (% temp0)) (% imm0)) ; [519] (sarq ($ 3) (% imm0)) ; [523] (imulq (% arg_x) (% imm0)) ; [527] (leaq (@ (% arg_y) (% imm0)) (% arg_y)) ; [531] (movq (@ 11 (% temp0)) (% arg_x)) ; [535] (movq (% arg_y) (% imm2)) ; [539] (shrq (% imm2)) ; [542] (movss (% fp0) (@ -5 (% arg_x) (% imm2))) ; [545] ;;; (dotimes (col cols) (let ((cell (aref result row col))) (declare (type (single-float) cell)) (dotime (addq ($ 8) (% save1)) ; [552] L549 (cmpq (% save3) (% save1)) ; [556] (jl L189) ; [559] ;;; (dotimes (row rows) (dotimes (col cols) (let ((cell (aref result row col))) (declare (type (single-f (addq ($ 8) (% save2)) ; [565] L562 (movq (@ -56 (% rbp)) (% arg_z)) ; [569] (cmpq (% arg_z) (% save2)) ; [573] (jl L181) ; [576] ;;; (let ((result (make-matrix rows cols))) (declare (type (simple-array single-float (* *)) result)) (d (movq (@ -64 (% rbp)) (% arg_z)) ; [582] (addq ($ 16) (% rsp)) ; [586] (popq (% save3)) ; [590] (popq (% save2)) ; [592] (popq (% save1)) ; [594] (popq (% save0)) ; [596] (movq (@ (% gs) #x178) (% imm0)) ; [598] (movq (@ 8 (% imm0)) (% stack-temp)) ; [607] (movq (@ (% imm0)) (% imm0)) ; [611] (movq (% imm0) (@ (% gs) 80)) ; [614] (movq (% stack-temp) (@ (% gs) #x178)) ; [623] (leaveq) ; [632] (retq) ; [633] CL-USER> (disassemble 'simple-gemm) ;; "home:sandbox;common-lisp;simple-gemm.lisp.newest":479-1037 (recover-fn-from-rip) ; [7] (pushq (% rbp)) ; [14] (movq (% rsp) (% rbp)) ; [15] (pushq (% arg_y)) ; [18] (pushq (% arg_z)) ; [19] (movq (@ (% gs) 80) (% stack-temp)) ; [20] (subq ($ 48) (@ (% gs) 80)) ; [29] (movq (@ (% gs) 80) (% imm0)) ; [39] (movq (% stack-temp) (@ (% imm0))) ; [48] (movq (@ (% gs) #x178) (% stack-temp)) ; [51] (movq (% stack-temp) (@ 8 (% imm0))) ; [60] (movq (% imm0) (@ (% gs) #x178)) ; [64] (pushq (% save0)) ; [73] (pushq (% save1)) ; [75] (pushq (% save2)) ; [77] (pushq (% save3)) ; [79] ;;; (let ((rows (array-dimension ma 0)) (cols (array-dimension mb 1))) (declare (type fixnum rows cols)) (xorl (% arg_z.l) (% arg_z.l)) ; [81] (movl ($ 16) (% nargs)) ; [83] (movq (@ 'ARRAY-DIMENSION (% fn)) (% temp0)) ; [88] (lisp-call (@ 10 (% temp0))) ; [97] (recover-fn-from-rip) ; [100] (movq (% arg_z) (% arg_x)) ; [107] (pushq (% arg_x)) ; [110] (movq (@ -16 (% rbp)) (% arg_y)) ; [112] (movl ($ 8) (% arg_z.l)) ; [116] (movl ($ 16) (% nargs)) ; [121] (movq (@ 'ARRAY-DIMENSION (% fn)) (% temp0)) ; [126] (lisp-call (@ 10 (% temp0))) ; [137] (recover-fn-from-rip) ; [140] (pushq (% arg_z)) ; [147] ;;; (let ((result (make-matrix rows cols))) (declare (type (simple-array single-float (* *)) result)) (d (movq (@ -56 (% rbp)) (% arg_y)) ; [148] (movl ($ 16) (% nargs)) ; [152] (movq (@ 'MAKE-MATRIX (% fn)) (% temp0)) ; [157] (lisp-call (@ 10 (% temp0))) ; [169] (recover-fn-from-rip) ; [172] (movq (% arg_z) (% save3)) ; [179] ;;; (dotimes (row rows) (dotimes (col cols) (dotimes (k cols) (incf (aref result row col) (* (aref ma ro (xorl (% save2.l) (% save2.l)) ; [182] (jmpq L479) ; [185] ;;; (dotimes (col cols) (dotimes (k cols) (incf (aref result row col) (* (aref ma row k) (aref mb k col) L183 (xorl (% save1.l) (% save1.l)) ; [190] (jmpq L462) ; [193] ;;; (dotimes (k cols) (incf (aref result row col) (* (aref ma row k) (aref mb k col)))) L191 (xorl (% save0.l) (% save0.l)) ; [198] (jmpq L445) ; [201] ;;; (aref ma row k) L199 (movq (@ -8 (% rbp)) (% arg_x)) ; [206] (movq (@ 43 (% arg_x)) (% imm0)) ; [210] (sarq ($ 3) (% imm0)) ; [214] (imulq (% save2) (% imm0)) ; [218] (leaq (@ (% save0) (% imm0)) (% temp0)) ; [222] (movq (@ 11 (% arg_x)) (% arg_z)) ; [226] (movq (% temp0) (% imm0)) ; [230] (shrq (% imm0)) ; [233] (movss (@ -5 (% arg_z) (% imm0)) (% fp0)) ; [236] ;;; (aref mb k col) (movq (@ -16 (% rbp)) (% arg_z)) ; [242] (movq (@ 43 (% arg_z)) (% imm0)) ; [246] (sarq ($ 3) (% imm0)) ; [250] (imulq (% save0) (% imm0)) ; [254] (leaq (@ (% save1) (% imm0)) (% temp0)) ; [258] (movq (@ 11 (% arg_z)) (% arg_z)) ; [262] (movq (% temp0) (% imm0)) ; [266] (shrq (% imm0)) ; [269] (movss (@ -5 (% arg_z) (% imm0)) (% fp1)) ; [272] ;;; (* (aref ma row k) (aref mb k col)) (mulss (% fp1) (% fp0)) ; [278] ;;; (incf (aref result row col) (* (aref ma row k) (aref mb k col))) (movd (% imm0) (% stack-temp)) ; [282] (movq (@ (% gs) #x178) (% imm0)) ; [286] (movss (% fp0) (@ 16 (% imm0))) ; [295] (movd (% stack-temp) (% imm0)) ; [300] (movq (@ 43 (% save3)) (% imm0)) ; [304] (sarq ($ 3) (% imm0)) ; [308] (imulq (% save2) (% imm0)) ; [312] (leaq (@ (% save1) (% imm0)) (% temp0)) ; [316] (movq (@ 11 (% save3)) (% arg_z)) ; [320] (movq (% temp0) (% imm0)) ; [324] (shrq (% imm0)) ; [327] (movss (@ -5 (% arg_z) (% imm0)) (% fp0)) ; [330] (movd (% imm0) (% stack-temp)) ; [336] (movq (@ (% gs) #x178) (% imm0)) ; [340] (movss (@ 16 (% imm0)) (% fp1)) ; [349] (movd (% stack-temp) (% imm0)) ; [354] (addss (% fp1) (% fp0)) ; [358] (movd (% imm0) (% stack-temp)) ; [362] (movq (@ (% gs) #x178) (% imm0)) ; [366] (movss (% fp0) (@ 32 (% imm0))) ; [375] (movd (% stack-temp) (% imm0)) ; [380] (movd (% imm0) (% stack-temp)) ; [384] (movq (@ (% gs) #x178) (% imm0)) ; [388] (movss (@ 32 (% imm0)) (% fp0)) ; [397] (movd (% stack-temp) (% imm0)) ; [402] (movq (% save1) (% arg_y)) ; [406] (movq (% save2) (% arg_x)) ; [409] (movq (% save3) (% temp0)) ; [412] (movq (@ 43 (% temp0)) (% imm0)) ; [415] (sarq ($ 3) (% imm0)) ; [419] (imulq (% arg_x) (% imm0)) ; [423] (leaq (@ (% arg_y) (% imm0)) (% arg_y)) ; [427] (movq (@ 11 (% temp0)) (% arg_x)) ; [431] (movq (% arg_y) (% imm2)) ; [435] (shrq (% imm2)) ; [438] (movss (% fp0) (@ -5 (% arg_x) (% imm2))) ; [441] ;;; (dotimes (k cols) (incf (aref result row col) (* (aref ma row k) (aref mb k col)))) (addq ($ 8) (% save0)) ; [448] L445 (movq (@ -64 (% rbp)) (% arg_z)) ; [452] (cmpq (% arg_z) (% save0)) ; [456] (jl L199) ; [459] ;;; (dotimes (col cols) (dotimes (k cols) (incf (aref result row col) (* (aref ma row k) (aref mb k col) (addq ($ 8) (% save1)) ; [465] L462 (movq (@ -64 (% rbp)) (% arg_z)) ; [469] (cmpq (% arg_z) (% save1)) ; [473] (jl L191) ; [476] ;;; (dotimes (row rows) (dotimes (col cols) (dotimes (k cols) (incf (aref result row col) (* (aref ma ro (addq ($ 8) (% save2)) ; [482] L479 (movq (@ -56 (% rbp)) (% arg_z)) ; [486] (cmpq (% arg_z) (% save2)) ; [490] (jl L183) ; [493] ;;; (let ((result (make-matrix rows cols))) (declare (type (simple-array single-float (* *)) result)) (d (movq (% save3) (% arg_z)) ; [499] (addq ($ 16) (% rsp)) ; [502] (popq (% save3)) ; [506] (popq (% save2)) ; [508] (popq (% save1)) ; [510] (popq (% save0)) ; [512] (movq (@ (% gs) #x178) (% imm0)) ; [514] (movq (@ 8 (% imm0)) (% stack-temp)) ; [523] (movq (@ (% imm0)) (% imm0)) ; [527] (movq (% imm0) (@ (% gs) 80)) ; [530] (movq (% stack-temp) (@ (% gs) #x178)) ; [539] (leaveq) ; [548] (retq) ; [549]
これアセンブラ部分が意図的にコメントになってないけど、このままREPLに入れると動くんだろうか?
CL-USER> (recover-fn-from-rip) ; Evaluation aborted on #<CCL::UNDEFINED-FUNCTION-CALL #x3020018BF73D>.
ダメらしい。
caching-gemm
中間結果を変数にキャッシュするGEMM。
(defun caching-gemm (ma mb) (declare (optimize (speed 3) (debug 0) (safety 0))) (declare (type (simple-array single-float (* *)) ma mb)) (let ((rows (array-dimension ma 0)) (cols (array-dimension mb 1))) (declare (type fixnum rows cols)) (let ((result (make-matrix rows cols))) (declare (type (simple-array single-float (* *)) result)) (dotimes (row rows) (dotimes (col cols) (let ((cell (aref result row col))) (declare (type (single-float) cell)) (dotimes (k cols) (incf cell (* (aref ma row k) (aref mb k col)))) (setf (aref result row col) cell)))) result)))
SBCLの場合。
L2
が多分最内ループで、SAR
してるのなんだ?
後、幅とか先頭アドレスかを毎回ロードして計算しなおしてる?
CL-USER> (disassemble 'caching-gemm) ; disassembly for CACHING-GEMM ; Size: 272 bytes. Origin: #x1006D8D8F5 ; 8F5: 4C894DE8 MOV [RBP-24], R9 ; no-arg-parsing entry point ; 8F9: 4C8945E0 MOV [RBP-32], R8 ; 8FD: 4D8B6831 MOV R13, [R8+49] ; 901: 4D8B5139 MOV R10, [R9+57] ; 905: 4C896DF8 MOV [RBP-8], R13 ; 909: 4C8955F0 MOV [RBP-16], R10 ; 90D: 488D5C24F0 LEA RBX, [RSP-16] ; 912: 4883EC18 SUB RSP, 24 ; 916: 498BD5 MOV RDX, R13 ; 919: 498BFA MOV RDI, R10 ; 91C: 488B057DFFFFFF MOV RAX, [RIP-131] ; #<FDEFINITION for MAKE-MATRIX> ; 923: B904000000 MOV ECX, 4 ; 928: 48892B MOV [RBX], RBP ; 92B: 488BEB MOV RBP, RBX ; 92E: FF5009 CALL QWORD PTR [RAX+9] ; 931: 480F42E3 CMOVB RSP, RBX ; 935: 4C8B45E0 MOV R8, [RBP-32] ; 939: 4C8B4DE8 MOV R9, [RBP-24] ; 93D: 4C8B55F0 MOV R10, [RBP-16] ; 941: 4C8B6DF8 MOV R13, [RBP-8] ; 945: 488BF2 MOV RSI, RDX ; 948: 31DB XOR EBX, EBX ; 94A: E9A4000000 JMP L5 ; 94F: 90 NOP ; 950: L0: 31C0 XOR EAX, EAX ; 952: E98F000000 JMP L4 ; 957: 660F1F840000000000 NOP ; 960: L1: 488B4E39 MOV RCX, [RSI+57] ; 964: 488BFB MOV RDI, RBX ; 967: 48D1FF SAR RDI, 1 ; 96A: 480FAFF9 IMUL RDI, RCX ; 96E: 4801C7 ADD RDI, RAX ; 971: 488B4E11 MOV RCX, [RSI+17] ; 975: F30F104C7901 MOVSS XMM1, [RCX+RDI*2+1] ; 97B: 31C9 XOR ECX, ECX ; 97D: EB43 JMP L3 ; 97F: 90 NOP ; 980: L2: 498B5039 MOV RDX, [R8+57] ; 984: 488BFB MOV RDI, RBX ; 987: 48D1FF SAR RDI, 1 ; 98A: 480FAFFA IMUL RDI, RDX ; 98E: 4801CF ADD RDI, RCX ; 991: 498B5011 MOV RDX, [R8+17] ; 995: F30F10547A01 MOVSS XMM2, [RDX+RDI*2+1] ; 99B: 498B5139 MOV RDX, [R9+57] ; 99F: 488BF9 MOV RDI, RCX ; 9A2: 48D1FF SAR RDI, 1 ; 9A5: 480FAFFA IMUL RDI, RDX ; 9A9: 4801C7 ADD RDI, RAX ; 9AC: 498B5111 MOV RDX, [R9+17] ; 9B0: F30F105C7A01 MOVSS XMM3, [RDX+RDI*2+1] ; 9B6: F30F59DA MULSS XMM3, XMM2 ; 9BA: F30F58CB ADDSS XMM1, XMM3 ; 9BE: 4883C102 ADD RCX, 2 ; 9C2: L3: 4C39D1 CMP RCX, R10 ; 9C5: 7CB9 JL L2 ; 9C7: 488B4E39 MOV RCX, [RSI+57] ; 9CB: 488BFB MOV RDI, RBX ; 9CE: 48D1FF SAR RDI, 1 ; 9D1: 480FAFF9 IMUL RDI, RCX ; 9D5: 4801C7 ADD RDI, RAX ; 9D8: 488B4E11 MOV RCX, [RSI+17] ; 9DC: F30F114C7901 MOVSS [RCX+RDI*2+1], XMM1 ; 9E2: 4883C002 ADD RAX, 2 ; 9E6: L4: 4C39D0 CMP RAX, R10 ; 9E9: 0F8C71FFFFFF JL L1 ; 9EF: 4883C302 ADD RBX, 2 ; 9F3: L5: 4C39EB CMP RBX, R13 ; 9F6: 0F8C54FFFFFF JL L0 ; 9FC: 488BD6 MOV RDX, RSI ; 9FF: 488BE5 MOV RSP, RBP ; A02: F8 CLC ; A03: 5D POP RBP ; A04: C3 RET
CCLの場合。
L255
あたりが最内ループ。
途中にlisp-call
とか入ってたり、stack-temp
が多分レジスタスピルだと思う。
CL-USER> (disassemble 'caching-gemm) ;; "home:sandbox;common-lisp;simple-gemm.lisp.newest":1039-1729 (recover-fn-from-rip) ; [7] (pushq (% rbp)) ; [14] (movq (% rsp) (% rbp)) ; [15] (pushq (% arg_y)) ; [18] (pushq (% arg_z)) ; [19] (movq (@ (% gs) 80) (% stack-temp)) ; [20] (subq ($ 64) (@ (% gs) 80)) ; [29] (movq (@ (% gs) 80) (% imm0)) ; [39] (movq (% stack-temp) (@ (% imm0))) ; [48] (movq (@ (% gs) #x178) (% stack-temp)) ; [51] (movq (% stack-temp) (@ 8 (% imm0))) ; [60] (movq (% imm0) (@ (% gs) #x178)) ; [64] (pushq (% save0)) ; [73] (pushq (% save1)) ; [75] (pushq (% save2)) ; [77] (pushq (% save3)) ; [79] ;;; (let ((rows (array-dimension ma 0)) (cols (array-dimension mb 1))) (declare (type fixnum rows cols)) (xorl (% arg_z.l) (% arg_z.l)) ; [81] (movl ($ 16) (% nargs)) ; [83] (movq (@ 'ARRAY-DIMENSION (% fn)) (% temp0)) ; [88] (lisp-call (@ 10 (% temp0))) ; [97] (recover-fn-from-rip) ; [100] (movq (% arg_z) (% arg_x)) ; [107] (pushq (% arg_x)) ; [110] (movq (@ -16 (% rbp)) (% arg_y)) ; [112] (movl ($ 8) (% arg_z.l)) ; [116] (movl ($ 16) (% nargs)) ; [121] (movq (@ 'ARRAY-DIMENSION (% fn)) (% temp0)) ; [126] (lisp-call (@ 10 (% temp0))) ; [137] (recover-fn-from-rip) ; [140] (movq (% arg_z) (% save3)) ; [147] ;;; (let ((result (make-matrix rows cols))) (declare (type (simple-array single-float (* *)) result)) (d (movq (@ -56 (% rbp)) (% arg_y)) ; [150] (movq (% save3) (% arg_z)) ; [154] (movl ($ 16) (% nargs)) ; [157] (movq (@ 'MAKE-MATRIX (% fn)) (% temp0)) ; [162] (lisp-call (@ 10 (% temp0))) ; [169] (recover-fn-from-rip) ; [172] (pushq (% arg_z)) ; [179] ;;; (dotimes (row rows) (dotimes (col cols) (let ((cell (aref result row col))) (declare (type (single-f (xorl (% save2.l) (% save2.l)) ; [180] (jmpq L562) ; [183] ;;; (dotimes (col cols) (let ((cell (aref result row col))) (declare (type (single-float) cell)) (dotime L181 (xorl (% save1.l) (% save1.l)) ; [188] (jmpq L549) ; [191] ;;; (let ((cell (aref result row col))) (declare (type (single-float) cell)) (dotimes (k cols) (incf cel L189 (movq (@ -64 (% rbp)) (% arg_x)) ; [196] (movq (@ 43 (% arg_x)) (% imm0)) ; [200] (sarq ($ 3) (% imm0)) ; [204] (imulq (% save2) (% imm0)) ; [208] (leaq (@ (% save1) (% imm0)) (% temp0)) ; [212] (movq (@ 11 (% arg_x)) (% arg_z)) ; [216] (movq (% temp0) (% imm0)) ; [220] (shrq (% imm0)) ; [223] (movss (@ -5 (% arg_z) (% imm0)) (% fp0)) ; [226] (movd (% imm0) (% stack-temp)) ; [232] (movq (@ (% gs) #x178) (% imm0)) ; [236] (movss (% fp0) (@ 16 (% imm0))) ; [245] (movd (% stack-temp) (% imm0)) ; [250] ;;; (dotimes (k cols) (incf cell (* (aref ma row k) (aref mb k col)))) (xorl (% save0.l) (% save0.l)) ; [254] (jmpq L471) ; [257] ;;; (aref ma row k) L255 (movq (@ -8 (% rbp)) (% arg_x)) ; [262] (movq (@ 43 (% arg_x)) (% imm0)) ; [266] (sarq ($ 3) (% imm0)) ; [270] (imulq (% save2) (% imm0)) ; [274] (leaq (@ (% save0) (% imm0)) (% temp0)) ; [278] (movq (@ 11 (% arg_x)) (% arg_z)) ; [282] (movq (% temp0) (% imm0)) ; [286] (shrq (% imm0)) ; [289] (movss (@ -5 (% arg_z) (% imm0)) (% fp0)) ; [292] ;;; (aref mb k col) (movq (@ -16 (% rbp)) (% arg_z)) ; [298] (movq (@ 43 (% arg_z)) (% imm0)) ; [302] (sarq ($ 3) (% imm0)) ; [306] (imulq (% save0) (% imm0)) ; [310] (leaq (@ (% save1) (% imm0)) (% temp0)) ; [314] (movq (@ 11 (% arg_z)) (% arg_z)) ; [318] (movq (% temp0) (% imm0)) ; [322] (shrq (% imm0)) ; [325] (movss (@ -5 (% arg_z) (% imm0)) (% fp1)) ; [328] ;;; (* (aref ma row k) (aref mb k col)) (mulss (% fp1) (% fp0)) ; [334] ;;; (incf cell (* (aref ma row k) (aref mb k col))) (movd (% imm0) (% stack-temp)) ; [338] (movq (@ (% gs) #x178) (% imm0)) ; [342] (movss (% fp0) (@ 32 (% imm0))) ; [351] (movd (% stack-temp) (% imm0)) ; [356] (movd (% imm0) (% stack-temp)) ; [360] (movq (@ (% gs) #x178) (% imm0)) ; [364] (movss (@ 32 (% imm0)) (% fp1)) ; [373] (movd (% stack-temp) (% imm0)) ; [378] (movd (% imm0) (% stack-temp)) ; [382] (movq (@ (% gs) #x178) (% imm0)) ; [386] (movss (@ 16 (% imm0)) (% fp0)) ; [395] (movd (% stack-temp) (% imm0)) ; [400] (addss (% fp1) (% fp0)) ; [404] (movd (% imm0) (% stack-temp)) ; [408] (movq (@ (% gs) #x178) (% imm0)) ; [412] (movss (% fp0) (@ 48 (% imm0))) ; [421] (movd (% stack-temp) (% imm0)) ; [426] (movd (% imm0) (% stack-temp)) ; [430] (movq (@ (% gs) #x178) (% imm0)) ; [434] (movss (@ 48 (% imm0)) (% fp0)) ; [443] (movd (% stack-temp) (% imm0)) ; [448] (movd (% imm0) (% stack-temp)) ; [452] (movq (@ (% gs) #x178) (% imm0)) ; [456] (movss (% fp0) (@ 16 (% imm0))) ; [465] (movd (% stack-temp) (% imm0)) ; [470] ;;; (dotimes (k cols) (incf cell (* (aref ma row k) (aref mb k col)))) (addq ($ 8) (% save0)) ; [474] L471 (cmpq (% save3) (% save0)) ; [478] (jl L255) ; [481] ;;; (setf (aref result row col) cell) (movd (% imm0) (% stack-temp)) ; [487] (movq (@ (% gs) #x178) (% imm0)) ; [491] (movss (@ 16 (% imm0)) (% fp0)) ; [500] (movd (% stack-temp) (% imm0)) ; [505] (movq (% save1) (% arg_y)) ; [509] (movq (% save2) (% arg_x)) ; [512] (movq (@ -64 (% rbp)) (% temp0)) ; [515] (movq (@ 43 (% temp0)) (% imm0)) ; [519] (sarq ($ 3) (% imm0)) ; [523] (imulq (% arg_x) (% imm0)) ; [527] (leaq (@ (% arg_y) (% imm0)) (% arg_y)) ; [531] (movq (@ 11 (% temp0)) (% arg_x)) ; [535] (movq (% arg_y) (% imm2)) ; [539] (shrq (% imm2)) ; [542] (movss (% fp0) (@ -5 (% arg_x) (% imm2))) ; [545] ;;; (dotimes (col cols) (let ((cell (aref result row col))) (declare (type (single-float) cell)) (dotime (addq ($ 8) (% save1)) ; [552] L549 (cmpq (% save3) (% save1)) ; [556] (jl L189) ; [559] ;;; (dotimes (row rows) (dotimes (col cols) (let ((cell (aref result row col))) (declare (type (single-f (addq ($ 8) (% save2)) ; [565] L562 (movq (@ -56 (% rbp)) (% arg_z)) ; [569] (cmpq (% arg_z) (% save2)) ; [573] (jl L181) ; [576] ;;; (let ((result (make-matrix rows cols))) (declare (type (simple-array single-float (* *)) result)) (d (movq (@ -64 (% rbp)) (% arg_z)) ; [582] (addq ($ 16) (% rsp)) ; [586] (popq (% save3)) ; [590] (popq (% save2)) ; [592] (popq (% save1)) ; [594] (popq (% save0)) ; [596] (movq (@ (% gs) #x178) (% imm0)) ; [598] (movq (@ 8 (% imm0)) (% stack-temp)) ; [607] (movq (@ (% imm0)) (% imm0)) ; [611] (movq (% imm0) (@ (% gs) 80)) ; [614] (movq (% stack-temp) (@ (% gs) #x178)) ; [623] (leaveq) ; [632] (retq) ; [633]
row-major-gemm
aref
が何となく遅そうなので、row-major-aref
で書き直した。
(defun row-major-gemm (ma mb) (declare (optimize (speed 3) (debug 0) (safety 0))) (declare (type (simple-array single-float (* *)) ma mb)) (let ((rows (array-dimension ma 0)) (cols (array-dimension mb 1))) (declare (type fixnum rows cols)) (let ((result (make-matrix rows cols))) (declare (type (simple-array single-float (* *)) result)) (dotimes (row rows) (dotimes (col cols) (let ((cell (aref result row col)) (ma-index (array-row-major-index ma row 0)) (mb-index (array-row-major-index mb 0 col))) (declare (type (single-float) cell)) (declare (type fixnum ma-index mb-index)) (dotimes (k cols) (incf cell (* (row-major-aref ma ma-index) (row-major-aref mb mb-index))) (incf ma-index) (incf mb-index cols)) (setf (aref result row col) cell)))) result)))
SBCLの場合。
L2
が最内ループで、indexの掛け算を消せた気がするけど、先頭アドレスのロードらしきやつが消せない。
CL-USER> (disassemble 'row-major-gemm) ; disassembly for ROW-MAJOR-GEMM ; Size: 266 bytes. Origin: #x1003296715 ; 715: 4C8975E8 MOV [RBP-24], R14 ; no-arg-parsing entry point ; 719: 4C896DE0 MOV [RBP-32], R13 ; 71D: 4D8B7D31 MOV R15, [R13+49] ; 721: 4D8B4E39 MOV R9, [R14+57] ; 725: 4C897DF8 MOV [RBP-8], R15 ; 729: 4C894DF0 MOV [RBP-16], R9 ; 72D: 488D5C24F0 LEA RBX, [RSP-16] ; 732: 4883EC18 SUB RSP, 24 ; 736: 498BD7 MOV RDX, R15 ; 739: 498BF9 MOV RDI, R9 ; 73C: 488B057DFFFFFF MOV RAX, [RIP-131] ; #<FDEFINITION for MAKE-MATRIX> ; 743: B904000000 MOV ECX, 4 ; 748: 48892B MOV [RBX], RBP ; 74B: 488BEB MOV RBP, RBX ; 74E: FF5009 CALL QWORD PTR [RAX+9] ; 751: 480F42E3 CMOVB RSP, RBX ; 755: 4C8B6DE0 MOV R13, [RBP-32] ; 759: 4C8B75E8 MOV R14, [RBP-24] ; 75D: 4C8B4DF0 MOV R9, [RBP-16] ; 761: 4C8B7DF8 MOV R15, [RBP-8] ; 765: 488BDA MOV RBX, RDX ; 768: 31C9 XOR ECX, ECX ; 76A: E99E000000 JMP L5 ; 76F: 90 NOP ; 770: L0: 31C0 XOR EAX, EAX ; 772: E989000000 JMP L4 ; 777: 660F1F840000000000 NOP ; 780: L1: 488B5339 MOV RDX, [RBX+57] ; 784: 488BF1 MOV RSI, RCX ; 787: 48D1FE SAR RSI, 1 ; 78A: 480FAFF2 IMUL RSI, RDX ; 78E: 4801C6 ADD RSI, RAX ; 791: 488B5311 MOV RDX, [RBX+17] ; 795: F30F104C7201 MOVSS XMM1, [RDX+RSI*2+1] ; 79B: 498B5539 MOV RDX, [R13+57] ; 79F: 488BF9 MOV RDI, RCX ; 7A2: 48D1FF SAR RDI, 1 ; 7A5: 480FAFFA IMUL RDI, RDX ; 7A9: 4C8BC0 MOV R8, RAX ; 7AC: 31F6 XOR ESI, ESI ; 7AE: EB2C JMP L3 ; 7B0: L2: 498B5511 MOV RDX, [R13+17] ; 7B4: F30F10547A01 MOVSS XMM2, [RDX+RDI*2+1] ; 7BA: 498B5611 MOV RDX, [R14+17] ; 7BE: F3420F105C4201 MOVSS XMM3, [RDX+R8*2+1] ; 7C5: F30F59DA MULSS XMM3, XMM2 ; 7C9: F30F58CB ADDSS XMM1, XMM3 ; 7CD: 4883C702 ADD RDI, 2 ; 7D1: 4F8D1401 LEA R10, [R9+R8] ; 7D5: 4D8BC2 MOV R8, R10 ; 7D8: 4883C602 ADD RSI, 2 ; 7DC: L3: 4C39CE CMP RSI, R9 ; 7DF: 7CCF JL L2 ; 7E1: 488B5339 MOV RDX, [RBX+57] ; 7E5: 488BF1 MOV RSI, RCX ; 7E8: 48D1FE SAR RSI, 1 ; 7EB: 480FAFF2 IMUL RSI, RDX ; 7EF: 4801C6 ADD RSI, RAX ; 7F2: 488B5311 MOV RDX, [RBX+17] ; 7F6: F30F114C7201 MOVSS [RDX+RSI*2+1], XMM1 ; 7FC: 4883C002 ADD RAX, 2 ; 800: L4: 4C39C8 CMP RAX, R9 ; 803: 0F8C77FFFFFF JL L1 ; 809: 4883C102 ADD RCX, 2 ; 80D: L5: 4C39F9 CMP RCX, R15 ; 810: 0F8C5AFFFFFF JL L0 ; 816: 488BD3 MOV RDX, RBX ; 819: 488BE5 MOV RSP, RBP ; 81C: F8 CLC ; 81D: 5D POP RBP ; 81E: C3 RET
一応実測する。
CL-USER> (setf *N* 256) (run #'simple-gemm 100) (run #'on-register-gemm 100) (run #'row-major-gemm 100) Evaluation took: 6.159 seconds of real time 6.164000 seconds of total run time (6.164000 user, 0.000000 system) [ Run times consist of 0.004 seconds GC time, and 6.160 seconds non-GC time. ] 100.08% CPU 14,747,074,008 processor cycles 26,281,520 bytes consed Evaluation took: 3.955 seconds of real time 3.960000 seconds of total run time (3.960000 user, 0.000000 system) 100.13% CPU 9,469,087,053 processor cycles 26,216,000 bytes consed Evaluation took: 3.003 seconds of real time 3.008000 seconds of total run time (3.008000 user, 0.000000 system) [ Run times consist of 0.004 seconds GC time, and 3.004 seconds non-GC time. ] 100.17% CPU 7,192,309,782 processor cycles 26,216,000 bytes consed
unroll無しでon-register-gemm
よりは速くなってるけど、やっぱりCと同レベルなコードにはならなそう。