Common Lispはdisassembleも付いてきてチューニングできるらしいとゆー話をよく聞くので、逆アセンブルしてみた。
題材としてはこの辺を参考に
caching-gemm
がバグってたので厳密には同じコードではなくなっている。
処理系のバージョンは次の通り
simple-gemm
単純なGEMM。
(defun simple-gemm (ma mb)
(declare (optimize (speed 3) (debug 0) (safety 0)))
(declare (type (simple-array single-float (* *)) ma mb))
(let ((rows (array-dimension ma 0))
(cols (array-dimension mb 1)))
(declare (type fixnum rows cols))
(let ((result (make-matrix rows cols)))
(declare (type (simple-array single-float (* *)) result))
(dotimes (row rows)
(dotimes (col cols)
(dotimes (k cols)
(incf (aref result row col)
(* (aref ma row k) (aref mb k col))))))
result)))
SBCLの場合。
普通っぽい。
CL-USER> (disassemble 'simple-gemm)
; disassembly for SIMPLE-GEMM
; Size: 283 bytes. Origin: #x1004C5CA65
; A65: 4C894DE8 MOV [RBP-24], R9 ; no-arg-parsing entry point
; A69: 4C8945E0 MOV [RBP-32], R8
; A6D: 4D8B6831 MOV R13, [R8+49]
; A71: 4D8B5139 MOV R10, [R9+57]
; A75: 4C896DF8 MOV [RBP-8], R13
; A79: 4C8955F0 MOV [RBP-16], R10
; A7D: 488D5C24F0 LEA RBX, [RSP-16]
; A82: 4883EC18 SUB RSP, 24
; A86: 498BD5 MOV RDX, R13
; A89: 498BFA MOV RDI, R10
; A8C: 488B057DFFFFFF MOV RAX, [RIP-131] ; #<FDEFINITION for MAKE-MATRIX>
; A93: B904000000 MOV ECX, 4
; A98: 48892B MOV [RBX], RBP
; A9B: 488BEB MOV RBP, RBX
; A9E: FF5009 CALL QWORD PTR [RAX+9]
; AA1: 480F42E3 CMOVB RSP, RBX
; AA5: 4C8B45E0 MOV R8, [RBP-32]
; AA9: 4C8B4DE8 MOV R9, [RBP-24]
; AAD: 4C8B55F0 MOV R10, [RBP-16]
; AB1: 4C8B6DF8 MOV R13, [RBP-8]
; AB5: 488BF2 MOV RSI, RDX
; AB8: 31DB XOR EBX, EBX
; ABA: E9AF000000 JMP L5
; ABF: 90 NOP
; AC0: L0: 31C0 XOR EAX, EAX
; AC2: E99A000000 JMP L4
; AC7: 660F1F840000000000 NOP
; AD0: L1: 31C9 XOR ECX, ECX
; AD2: E981000000 JMP L3
; AD7: 660F1F840000000000 NOP
; AE0: L2: 498B5039 MOV RDX, [R8+57]
; AE4: 488BFB MOV RDI, RBX
; AE7: 48D1FF SAR RDI, 1
; AEA: 480FAFFA IMUL RDI, RDX
; AEE: 4801CF ADD RDI, RCX
; AF1: 498B5011 MOV RDX, [R8+17]
; AF5: F30F104C7A01 MOVSS XMM1, [RDX+RDI*2+1]
; AFB: 498B5139 MOV RDX, [R9+57]
; AFF: 488BF9 MOV RDI, RCX
; B02: 48D1FF SAR RDI, 1
; B05: 480FAFFA IMUL RDI, RDX
; B09: 4801C7 ADD RDI, RAX
; B0C: 498B5111 MOV RDX, [R9+17]
; B10: F30F10547A01 MOVSS XMM2, [RDX+RDI*2+1]
; B16: F30F59D1 MULSS XMM2, XMM1
; B1A: 488B5639 MOV RDX, [RSI+57]
; B1E: 488BFB MOV RDI, RBX
; B21: 48D1FF SAR RDI, 1
; B24: 480FAFFA IMUL RDI, RDX
; B28: 4801C7 ADD RDI, RAX
; B2B: 488B5611 MOV RDX, [RSI+17]
; B2F: F30F104C7A01 MOVSS XMM1, [RDX+RDI*2+1]
; B35: F30F58CA ADDSS XMM1, XMM2
; B39: 488B5639 MOV RDX, [RSI+57]
; B3D: 488BFB MOV RDI, RBX
; B40: 48D1FF SAR RDI, 1
; B43: 480FAFFA IMUL RDI, RDX
; B47: 4801C7 ADD RDI, RAX
; B4A: 488B5611 MOV RDX, [RSI+17]
; B4E: F30F114C7A01 MOVSS [RDX+RDI*2+1], XMM1
; B54: 4883C102 ADD RCX, 2
; B58: L3: 4C39D1 CMP RCX, R10
; B5B: 7C83 JL L2
; B5D: 4883C002 ADD RAX, 2
; B61: L4: 4C39D0 CMP RAX, R10
; B64: 0F8C66FFFFFF JL L1
; B6A: 4883C302 ADD RBX, 2
; B6E: L5: 4C39EB CMP RBX, R13
; B71: 0F8C49FFFFFF JL L0
; B77: 488BD6 MOV RDX, RSI
; B7A: 488BE5 MOV RSP, RBP
; B7D: F8 CLC
; B7E: 5D POP RBP
; B7F: C3 RET
CCLの場合。
対応するLispコードがコメントに付いてて分かりやすい。
S式風謎アセンブラよく見るとよく分からないの色々あるな。
$
が即値で%
が変数かな?
CCLのドキュメントちゃんと読まないと無理そう
CL-USER> (disassemble 'caching-gemm)
(recover-fn-from-rip)
(pushq (% rbp))
(movq (% rsp) (% rbp))
(pushq (% arg_y))
(pushq (% arg_z))
(movq (@ (% gs) 80) (% stack-temp))
(subq ($ 64) (@ (% gs) 80))
(movq (@ (% gs) 80) (% imm0))
(movq (% stack-temp) (@ (% imm0)))
(movq (@ (% gs) #x178) (% stack-temp))
(movq (% stack-temp) (@ 8 (% imm0)))
(movq (% imm0) (@ (% gs) #x178))
(pushq (% save0))
(pushq (% save1))
(pushq (% save2))
(pushq (% save3))
(xorl (% arg_z.l) (% arg_z.l))
(movl ($ 16) (% nargs))
(movq (@ 'ARRAY-DIMENSION (% fn)) (% temp0))
(lisp-call (@ 10 (% temp0)))
(recover-fn-from-rip)
(movq (% arg_z) (% arg_x))
(pushq (% arg_x))
(movq (@ -16 (% rbp)) (% arg_y))
(movl ($ 8) (% arg_z.l))
(movl ($ 16) (% nargs))
(movq (@ 'ARRAY-DIMENSION (% fn)) (% temp0))
(lisp-call (@ 10 (% temp0)))
(recover-fn-from-rip)
(movq (% arg_z) (% save3))
(movq (@ -56 (% rbp)) (% arg_y))
(movq (% save3) (% arg_z))
(movl ($ 16) (% nargs))
(movq (@ 'MAKE-MATRIX (% fn)) (% temp0))
(lisp-call (@ 10 (% temp0)))
(recover-fn-from-rip)
(pushq (% arg_z))
(xorl (% save2.l) (% save2.l))
(jmpq L562)
L181
(xorl (% save1.l) (% save1.l))
(jmpq L549)
L189
(movq (@ -64 (% rbp)) (% arg_x))
(movq (@ 43 (% arg_x)) (% imm0))
(sarq ($ 3) (% imm0))
(imulq (% save2) (% imm0))
(leaq (@ (% save1) (% imm0)) (% temp0))
(movq (@ 11 (% arg_x)) (% arg_z))
(movq (% temp0) (% imm0))
(shrq (% imm0))
(movss (@ -5 (% arg_z) (% imm0)) (% fp0))
(movd (% imm0) (% stack-temp))
(movq (@ (% gs) #x178) (% imm0))
(movss (% fp0) (@ 16 (% imm0)))
(movd (% stack-temp) (% imm0))
(xorl (% save0.l) (% save0.l))
(jmpq L471)
L255
(movq (@ -8 (% rbp)) (% arg_x))
(movq (@ 43 (% arg_x)) (% imm0))
(sarq ($ 3) (% imm0))
(imulq (% save2) (% imm0))
(leaq (@ (% save0) (% imm0)) (% temp0))
(movq (@ 11 (% arg_x)) (% arg_z))
(movq (% temp0) (% imm0))
(shrq (% imm0))
(movss (@ -5 (% arg_z) (% imm0)) (% fp0))
(movq (@ -16 (% rbp)) (% arg_z))
(movq (@ 43 (% arg_z)) (% imm0))
(sarq ($ 3) (% imm0))
(imulq (% save0) (% imm0))
(leaq (@ (% save1) (% imm0)) (% temp0))
(movq (@ 11 (% arg_z)) (% arg_z))
(movq (% temp0) (% imm0))
(shrq (% imm0))
(movss (@ -5 (% arg_z) (% imm0)) (% fp1))
(mulss (% fp1) (% fp0))
(movd (% imm0) (% stack-temp))
(movq (@ (% gs) #x178) (% imm0))
(movss (% fp0) (@ 32 (% imm0)))
(movd (% stack-temp) (% imm0))
(movd (% imm0) (% stack-temp))
(movq (@ (% gs) #x178) (% imm0))
(movss (@ 32 (% imm0)) (% fp1))
(movd (% stack-temp) (% imm0))
(movd (% imm0) (% stack-temp))
(movq (@ (% gs) #x178) (% imm0))
(movss (@ 16 (% imm0)) (% fp0))
(movd (% stack-temp) (% imm0))
(addss (% fp1) (% fp0))
(movd (% imm0) (% stack-temp))
(movq (@ (% gs) #x178) (% imm0))
(movss (% fp0) (@ 48 (% imm0)))
(movd (% stack-temp) (% imm0))
(movd (% imm0) (% stack-temp))
(movq (@ (% gs) #x178) (% imm0))
(movss (@ 48 (% imm0)) (% fp0))
(movd (% stack-temp) (% imm0))
(movd (% imm0) (% stack-temp))
(movq (@ (% gs) #x178) (% imm0))
(movss (% fp0) (@ 16 (% imm0)))
(movd (% stack-temp) (% imm0))
(addq ($ 8) (% save0))
L471
(cmpq (% save3) (% save0))
(jl L255)
(movd (% imm0) (% stack-temp))
(movq (@ (% gs) #x178) (% imm0))
(movss (@ 16 (% imm0)) (% fp0))
(movd (% stack-temp) (% imm0))
(movq (% save1) (% arg_y))
(movq (% save2) (% arg_x))
(movq (@ -64 (% rbp)) (% temp0))
(movq (@ 43 (% temp0)) (% imm0))
(sarq ($ 3) (% imm0))
(imulq (% arg_x) (% imm0))
(leaq (@ (% arg_y) (% imm0)) (% arg_y))
(movq (@ 11 (% temp0)) (% arg_x))
(movq (% arg_y) (% imm2))
(shrq (% imm2))
(movss (% fp0) (@ -5 (% arg_x) (% imm2)))
(addq ($ 8) (% save1))
L549
(cmpq (% save3) (% save1))
(jl L189)
(addq ($ 8) (% save2))
L562
(movq (@ -56 (% rbp)) (% arg_z))
(cmpq (% arg_z) (% save2))
(jl L181)
(movq (@ -64 (% rbp)) (% arg_z))
(addq ($ 16) (% rsp))
(popq (% save3))
(popq (% save2))
(popq (% save1))
(popq (% save0))
(movq (@ (% gs) #x178) (% imm0))
(movq (@ 8 (% imm0)) (% stack-temp))
(movq (@ (% imm0)) (% imm0))
(movq (% imm0) (@ (% gs) 80))
(movq (% stack-temp) (@ (% gs) #x178))
(leaveq)
(retq)
CL-USER> (disassemble 'simple-gemm)
(recover-fn-from-rip)
(pushq (% rbp))
(movq (% rsp) (% rbp))
(pushq (% arg_y))
(pushq (% arg_z))
(movq (@ (% gs) 80) (% stack-temp))
(subq ($ 48) (@ (% gs) 80))
(movq (@ (% gs) 80) (% imm0))
(movq (% stack-temp) (@ (% imm0)))
(movq (@ (% gs) #x178) (% stack-temp))
(movq (% stack-temp) (@ 8 (% imm0)))
(movq (% imm0) (@ (% gs) #x178))
(pushq (% save0))
(pushq (% save1))
(pushq (% save2))
(pushq (% save3))
(xorl (% arg_z.l) (% arg_z.l))
(movl ($ 16) (% nargs))
(movq (@ 'ARRAY-DIMENSION (% fn)) (% temp0))
(lisp-call (@ 10 (% temp0)))
(recover-fn-from-rip)
(movq (% arg_z) (% arg_x))
(pushq (% arg_x))
(movq (@ -16 (% rbp)) (% arg_y))
(movl ($ 8) (% arg_z.l))
(movl ($ 16) (% nargs))
(movq (@ 'ARRAY-DIMENSION (% fn)) (% temp0))
(lisp-call (@ 10 (% temp0)))
(recover-fn-from-rip)
(pushq (% arg_z))
(movq (@ -56 (% rbp)) (% arg_y))
(movl ($ 16) (% nargs))
(movq (@ 'MAKE-MATRIX (% fn)) (% temp0))
(lisp-call (@ 10 (% temp0)))
(recover-fn-from-rip)
(movq (% arg_z) (% save3))
(xorl (% save2.l) (% save2.l))
(jmpq L479)
L183
(xorl (% save1.l) (% save1.l))
(jmpq L462)
L191
(xorl (% save0.l) (% save0.l))
(jmpq L445)
L199
(movq (@ -8 (% rbp)) (% arg_x))
(movq (@ 43 (% arg_x)) (% imm0))
(sarq ($ 3) (% imm0))
(imulq (% save2) (% imm0))
(leaq (@ (% save0) (% imm0)) (% temp0))
(movq (@ 11 (% arg_x)) (% arg_z))
(movq (% temp0) (% imm0))
(shrq (% imm0))
(movss (@ -5 (% arg_z) (% imm0)) (% fp0))
(movq (@ -16 (% rbp)) (% arg_z))
(movq (@ 43 (% arg_z)) (% imm0))
(sarq ($ 3) (% imm0))
(imulq (% save0) (% imm0))
(leaq (@ (% save1) (% imm0)) (% temp0))
(movq (@ 11 (% arg_z)) (% arg_z))
(movq (% temp0) (% imm0))
(shrq (% imm0))
(movss (@ -5 (% arg_z) (% imm0)) (% fp1))
(mulss (% fp1) (% fp0))
(movd (% imm0) (% stack-temp))
(movq (@ (% gs) #x178) (% imm0))
(movss (% fp0) (@ 16 (% imm0)))
(movd (% stack-temp) (% imm0))
(movq (@ 43 (% save3)) (% imm0))
(sarq ($ 3) (% imm0))
(imulq (% save2) (% imm0))
(leaq (@ (% save1) (% imm0)) (% temp0))
(movq (@ 11 (% save3)) (% arg_z))
(movq (% temp0) (% imm0))
(shrq (% imm0))
(movss (@ -5 (% arg_z) (% imm0)) (% fp0))
(movd (% imm0) (% stack-temp))
(movq (@ (% gs) #x178) (% imm0))
(movss (@ 16 (% imm0)) (% fp1))
(movd (% stack-temp) (% imm0))
(addss (% fp1) (% fp0))
(movd (% imm0) (% stack-temp))
(movq (@ (% gs) #x178) (% imm0))
(movss (% fp0) (@ 32 (% imm0)))
(movd (% stack-temp) (% imm0))
(movd (% imm0) (% stack-temp))
(movq (@ (% gs) #x178) (% imm0))
(movss (@ 32 (% imm0)) (% fp0))
(movd (% stack-temp) (% imm0))
(movq (% save1) (% arg_y))
(movq (% save2) (% arg_x))
(movq (% save3) (% temp0))
(movq (@ 43 (% temp0)) (% imm0))
(sarq ($ 3) (% imm0))
(imulq (% arg_x) (% imm0))
(leaq (@ (% arg_y) (% imm0)) (% arg_y))
(movq (@ 11 (% temp0)) (% arg_x))
(movq (% arg_y) (% imm2))
(shrq (% imm2))
(movss (% fp0) (@ -5 (% arg_x) (% imm2)))
(addq ($ 8) (% save0))
L445
(movq (@ -64 (% rbp)) (% arg_z))
(cmpq (% arg_z) (% save0))
(jl L199)
(addq ($ 8) (% save1))
L462
(movq (@ -64 (% rbp)) (% arg_z))
(cmpq (% arg_z) (% save1))
(jl L191)
(addq ($ 8) (% save2))
L479
(movq (@ -56 (% rbp)) (% arg_z))
(cmpq (% arg_z) (% save2))
(jl L183)
(movq (% save3) (% arg_z))
(addq ($ 16) (% rsp))
(popq (% save3))
(popq (% save2))
(popq (% save1))
(popq (% save0))
(movq (@ (% gs) #x178) (% imm0))
(movq (@ 8 (% imm0)) (% stack-temp))
(movq (@ (% imm0)) (% imm0))
(movq (% imm0) (@ (% gs) 80))
(movq (% stack-temp) (@ (% gs) #x178))
(leaveq)
(retq)
これアセンブラ部分が意図的にコメントになってないけど、このままREPLに入れると動くんだろうか?
CL-USER> (recover-fn-from-rip)
; Evaluation aborted on #<CCL::UNDEFINED-FUNCTION-CALL #x3020018BF73D>.
ダメらしい。
caching-gemm
中間結果を変数にキャッシュするGEMM。
(defun caching-gemm (ma mb)
(declare (optimize (speed 3) (debug 0) (safety 0)))
(declare (type (simple-array single-float (* *)) ma mb))
(let ((rows (array-dimension ma 0))
(cols (array-dimension mb 1)))
(declare (type fixnum rows cols))
(let ((result (make-matrix rows cols)))
(declare (type (simple-array single-float (* *)) result))
(dotimes (row rows)
(dotimes (col cols)
(let ((cell (aref result row col)))
(declare (type (single-float) cell))
(dotimes (k cols)
(incf cell
(* (aref ma row k) (aref mb k col))))
(setf (aref result row col) cell))))
result)))
SBCLの場合。
L2
が多分最内ループで、SAR
してるのなんだ?
後、幅とか先頭アドレスかを毎回ロードして計算しなおしてる?
CL-USER> (disassemble 'caching-gemm)
; disassembly for CACHING-GEMM
; Size: 272 bytes. Origin: #x1006D8D8F5
; 8F5: 4C894DE8 MOV [RBP-24], R9 ; no-arg-parsing entry point
; 8F9: 4C8945E0 MOV [RBP-32], R8
; 8FD: 4D8B6831 MOV R13, [R8+49]
; 901: 4D8B5139 MOV R10, [R9+57]
; 905: 4C896DF8 MOV [RBP-8], R13
; 909: 4C8955F0 MOV [RBP-16], R10
; 90D: 488D5C24F0 LEA RBX, [RSP-16]
; 912: 4883EC18 SUB RSP, 24
; 916: 498BD5 MOV RDX, R13
; 919: 498BFA MOV RDI, R10
; 91C: 488B057DFFFFFF MOV RAX, [RIP-131] ; #<FDEFINITION for MAKE-MATRIX>
; 923: B904000000 MOV ECX, 4
; 928: 48892B MOV [RBX], RBP
; 92B: 488BEB MOV RBP, RBX
; 92E: FF5009 CALL QWORD PTR [RAX+9]
; 931: 480F42E3 CMOVB RSP, RBX
; 935: 4C8B45E0 MOV R8, [RBP-32]
; 939: 4C8B4DE8 MOV R9, [RBP-24]
; 93D: 4C8B55F0 MOV R10, [RBP-16]
; 941: 4C8B6DF8 MOV R13, [RBP-8]
; 945: 488BF2 MOV RSI, RDX
; 948: 31DB XOR EBX, EBX
; 94A: E9A4000000 JMP L5
; 94F: 90 NOP
; 950: L0: 31C0 XOR EAX, EAX
; 952: E98F000000 JMP L4
; 957: 660F1F840000000000 NOP
; 960: L1: 488B4E39 MOV RCX, [RSI+57]
; 964: 488BFB MOV RDI, RBX
; 967: 48D1FF SAR RDI, 1
; 96A: 480FAFF9 IMUL RDI, RCX
; 96E: 4801C7 ADD RDI, RAX
; 971: 488B4E11 MOV RCX, [RSI+17]
; 975: F30F104C7901 MOVSS XMM1, [RCX+RDI*2+1]
; 97B: 31C9 XOR ECX, ECX
; 97D: EB43 JMP L3
; 97F: 90 NOP
; 980: L2: 498B5039 MOV RDX, [R8+57]
; 984: 488BFB MOV RDI, RBX
; 987: 48D1FF SAR RDI, 1
; 98A: 480FAFFA IMUL RDI, RDX
; 98E: 4801CF ADD RDI, RCX
; 991: 498B5011 MOV RDX, [R8+17]
; 995: F30F10547A01 MOVSS XMM2, [RDX+RDI*2+1]
; 99B: 498B5139 MOV RDX, [R9+57]
; 99F: 488BF9 MOV RDI, RCX
; 9A2: 48D1FF SAR RDI, 1
; 9A5: 480FAFFA IMUL RDI, RDX
; 9A9: 4801C7 ADD RDI, RAX
; 9AC: 498B5111 MOV RDX, [R9+17]
; 9B0: F30F105C7A01 MOVSS XMM3, [RDX+RDI*2+1]
; 9B6: F30F59DA MULSS XMM3, XMM2
; 9BA: F30F58CB ADDSS XMM1, XMM3
; 9BE: 4883C102 ADD RCX, 2
; 9C2: L3: 4C39D1 CMP RCX, R10
; 9C5: 7CB9 JL L2
; 9C7: 488B4E39 MOV RCX, [RSI+57]
; 9CB: 488BFB MOV RDI, RBX
; 9CE: 48D1FF SAR RDI, 1
; 9D1: 480FAFF9 IMUL RDI, RCX
; 9D5: 4801C7 ADD RDI, RAX
; 9D8: 488B4E11 MOV RCX, [RSI+17]
; 9DC: F30F114C7901 MOVSS [RCX+RDI*2+1], XMM1
; 9E2: 4883C002 ADD RAX, 2
; 9E6: L4: 4C39D0 CMP RAX, R10
; 9E9: 0F8C71FFFFFF JL L1
; 9EF: 4883C302 ADD RBX, 2
; 9F3: L5: 4C39EB CMP RBX, R13
; 9F6: 0F8C54FFFFFF JL L0
; 9FC: 488BD6 MOV RDX, RSI
; 9FF: 488BE5 MOV RSP, RBP
; A02: F8 CLC
; A03: 5D POP RBP
; A04: C3 RET
CCLの場合。
L255
あたりが最内ループ。
途中にlisp-call
とか入ってたり、stack-temp
が多分レジスタスピルだと思う。
CL-USER> (disassemble 'caching-gemm)
(recover-fn-from-rip)
(pushq (% rbp))
(movq (% rsp) (% rbp))
(pushq (% arg_y))
(pushq (% arg_z))
(movq (@ (% gs) 80) (% stack-temp))
(subq ($ 64) (@ (% gs) 80))
(movq (@ (% gs) 80) (% imm0))
(movq (% stack-temp) (@ (% imm0)))
(movq (@ (% gs) #x178) (% stack-temp))
(movq (% stack-temp) (@ 8 (% imm0)))
(movq (% imm0) (@ (% gs) #x178))
(pushq (% save0))
(pushq (% save1))
(pushq (% save2))
(pushq (% save3))
(xorl (% arg_z.l) (% arg_z.l))
(movl ($ 16) (% nargs))
(movq (@ 'ARRAY-DIMENSION (% fn)) (% temp0))
(lisp-call (@ 10 (% temp0)))
(recover-fn-from-rip)
(movq (% arg_z) (% arg_x))
(pushq (% arg_x))
(movq (@ -16 (% rbp)) (% arg_y))
(movl ($ 8) (% arg_z.l))
(movl ($ 16) (% nargs))
(movq (@ 'ARRAY-DIMENSION (% fn)) (% temp0))
(lisp-call (@ 10 (% temp0)))
(recover-fn-from-rip)
(movq (% arg_z) (% save3))
(movq (@ -56 (% rbp)) (% arg_y))
(movq (% save3) (% arg_z))
(movl ($ 16) (% nargs))
(movq (@ 'MAKE-MATRIX (% fn)) (% temp0))
(lisp-call (@ 10 (% temp0)))
(recover-fn-from-rip)
(pushq (% arg_z))
(xorl (% save2.l) (% save2.l))
(jmpq L562)
L181
(xorl (% save1.l) (% save1.l))
(jmpq L549)
L189
(movq (@ -64 (% rbp)) (% arg_x))
(movq (@ 43 (% arg_x)) (% imm0))
(sarq ($ 3) (% imm0))
(imulq (% save2) (% imm0))
(leaq (@ (% save1) (% imm0)) (% temp0))
(movq (@ 11 (% arg_x)) (% arg_z))
(movq (% temp0) (% imm0))
(shrq (% imm0))
(movss (@ -5 (% arg_z) (% imm0)) (% fp0))
(movd (% imm0) (% stack-temp))
(movq (@ (% gs) #x178) (% imm0))
(movss (% fp0) (@ 16 (% imm0)))
(movd (% stack-temp) (% imm0))
(xorl (% save0.l) (% save0.l))
(jmpq L471)
L255
(movq (@ -8 (% rbp)) (% arg_x))
(movq (@ 43 (% arg_x)) (% imm0))
(sarq ($ 3) (% imm0))
(imulq (% save2) (% imm0))
(leaq (@ (% save0) (% imm0)) (% temp0))
(movq (@ 11 (% arg_x)) (% arg_z))
(movq (% temp0) (% imm0))
(shrq (% imm0))
(movss (@ -5 (% arg_z) (% imm0)) (% fp0))
(movq (@ -16 (% rbp)) (% arg_z))
(movq (@ 43 (% arg_z)) (% imm0))
(sarq ($ 3) (% imm0))
(imulq (% save0) (% imm0))
(leaq (@ (% save1) (% imm0)) (% temp0))
(movq (@ 11 (% arg_z)) (% arg_z))
(movq (% temp0) (% imm0))
(shrq (% imm0))
(movss (@ -5 (% arg_z) (% imm0)) (% fp1))
(mulss (% fp1) (% fp0))
(movd (% imm0) (% stack-temp))
(movq (@ (% gs) #x178) (% imm0))
(movss (% fp0) (@ 32 (% imm0)))
(movd (% stack-temp) (% imm0))
(movd (% imm0) (% stack-temp))
(movq (@ (% gs) #x178) (% imm0))
(movss (@ 32 (% imm0)) (% fp1))
(movd (% stack-temp) (% imm0))
(movd (% imm0) (% stack-temp))
(movq (@ (% gs) #x178) (% imm0))
(movss (@ 16 (% imm0)) (% fp0))
(movd (% stack-temp) (% imm0))
(addss (% fp1) (% fp0))
(movd (% imm0) (% stack-temp))
(movq (@ (% gs) #x178) (% imm0))
(movss (% fp0) (@ 48 (% imm0)))
(movd (% stack-temp) (% imm0))
(movd (% imm0) (% stack-temp))
(movq (@ (% gs) #x178) (% imm0))
(movss (@ 48 (% imm0)) (% fp0))
(movd (% stack-temp) (% imm0))
(movd (% imm0) (% stack-temp))
(movq (@ (% gs) #x178) (% imm0))
(movss (% fp0) (@ 16 (% imm0)))
(movd (% stack-temp) (% imm0))
(addq ($ 8) (% save0))
L471
(cmpq (% save3) (% save0))
(jl L255)
(movd (% imm0) (% stack-temp))
(movq (@ (% gs) #x178) (% imm0))
(movss (@ 16 (% imm0)) (% fp0))
(movd (% stack-temp) (% imm0))
(movq (% save1) (% arg_y))
(movq (% save2) (% arg_x))
(movq (@ -64 (% rbp)) (% temp0))
(movq (@ 43 (% temp0)) (% imm0))
(sarq ($ 3) (% imm0))
(imulq (% arg_x) (% imm0))
(leaq (@ (% arg_y) (% imm0)) (% arg_y))
(movq (@ 11 (% temp0)) (% arg_x))
(movq (% arg_y) (% imm2))
(shrq (% imm2))
(movss (% fp0) (@ -5 (% arg_x) (% imm2)))
(addq ($ 8) (% save1))
L549
(cmpq (% save3) (% save1))
(jl L189)
(addq ($ 8) (% save2))
L562
(movq (@ -56 (% rbp)) (% arg_z))
(cmpq (% arg_z) (% save2))
(jl L181)
(movq (@ -64 (% rbp)) (% arg_z))
(addq ($ 16) (% rsp))
(popq (% save3))
(popq (% save2))
(popq (% save1))
(popq (% save0))
(movq (@ (% gs) #x178) (% imm0))
(movq (@ 8 (% imm0)) (% stack-temp))
(movq (@ (% imm0)) (% imm0))
(movq (% imm0) (@ (% gs) 80))
(movq (% stack-temp) (@ (% gs) #x178))
(leaveq)
(retq)
row-major-gemm
aref
が何となく遅そうなので、row-major-aref
で書き直した。
(defun row-major-gemm (ma mb)
(declare (optimize (speed 3) (debug 0) (safety 0)))
(declare (type (simple-array single-float (* *)) ma mb))
(let ((rows (array-dimension ma 0))
(cols (array-dimension mb 1)))
(declare (type fixnum rows cols))
(let ((result (make-matrix rows cols)))
(declare (type (simple-array single-float (* *)) result))
(dotimes (row rows)
(dotimes (col cols)
(let ((cell (aref result row col))
(ma-index (array-row-major-index ma row 0))
(mb-index (array-row-major-index mb 0 col)))
(declare (type (single-float) cell))
(declare (type fixnum ma-index mb-index))
(dotimes (k cols)
(incf cell (* (row-major-aref ma ma-index)
(row-major-aref mb mb-index)))
(incf ma-index)
(incf mb-index cols))
(setf (aref result row col) cell))))
result)))
SBCLの場合。
L2
が最内ループで、indexの掛け算を消せた気がするけど、先頭アドレスのロードらしきやつが消せない。
CL-USER> (disassemble 'row-major-gemm)
; disassembly for ROW-MAJOR-GEMM
; Size: 266 bytes. Origin: #x1003296715
; 715: 4C8975E8 MOV [RBP-24], R14 ; no-arg-parsing entry point
; 719: 4C896DE0 MOV [RBP-32], R13
; 71D: 4D8B7D31 MOV R15, [R13+49]
; 721: 4D8B4E39 MOV R9, [R14+57]
; 725: 4C897DF8 MOV [RBP-8], R15
; 729: 4C894DF0 MOV [RBP-16], R9
; 72D: 488D5C24F0 LEA RBX, [RSP-16]
; 732: 4883EC18 SUB RSP, 24
; 736: 498BD7 MOV RDX, R15
; 739: 498BF9 MOV RDI, R9
; 73C: 488B057DFFFFFF MOV RAX, [RIP-131] ; #<FDEFINITION for MAKE-MATRIX>
; 743: B904000000 MOV ECX, 4
; 748: 48892B MOV [RBX], RBP
; 74B: 488BEB MOV RBP, RBX
; 74E: FF5009 CALL QWORD PTR [RAX+9]
; 751: 480F42E3 CMOVB RSP, RBX
; 755: 4C8B6DE0 MOV R13, [RBP-32]
; 759: 4C8B75E8 MOV R14, [RBP-24]
; 75D: 4C8B4DF0 MOV R9, [RBP-16]
; 761: 4C8B7DF8 MOV R15, [RBP-8]
; 765: 488BDA MOV RBX, RDX
; 768: 31C9 XOR ECX, ECX
; 76A: E99E000000 JMP L5
; 76F: 90 NOP
; 770: L0: 31C0 XOR EAX, EAX
; 772: E989000000 JMP L4
; 777: 660F1F840000000000 NOP
; 780: L1: 488B5339 MOV RDX, [RBX+57]
; 784: 488BF1 MOV RSI, RCX
; 787: 48D1FE SAR RSI, 1
; 78A: 480FAFF2 IMUL RSI, RDX
; 78E: 4801C6 ADD RSI, RAX
; 791: 488B5311 MOV RDX, [RBX+17]
; 795: F30F104C7201 MOVSS XMM1, [RDX+RSI*2+1]
; 79B: 498B5539 MOV RDX, [R13+57]
; 79F: 488BF9 MOV RDI, RCX
; 7A2: 48D1FF SAR RDI, 1
; 7A5: 480FAFFA IMUL RDI, RDX
; 7A9: 4C8BC0 MOV R8, RAX
; 7AC: 31F6 XOR ESI, ESI
; 7AE: EB2C JMP L3
; 7B0: L2: 498B5511 MOV RDX, [R13+17]
; 7B4: F30F10547A01 MOVSS XMM2, [RDX+RDI*2+1]
; 7BA: 498B5611 MOV RDX, [R14+17]
; 7BE: F3420F105C4201 MOVSS XMM3, [RDX+R8*2+1]
; 7C5: F30F59DA MULSS XMM3, XMM2
; 7C9: F30F58CB ADDSS XMM1, XMM3
; 7CD: 4883C702 ADD RDI, 2
; 7D1: 4F8D1401 LEA R10, [R9+R8]
; 7D5: 4D8BC2 MOV R8, R10
; 7D8: 4883C602 ADD RSI, 2
; 7DC: L3: 4C39CE CMP RSI, R9
; 7DF: 7CCF JL L2
; 7E1: 488B5339 MOV RDX, [RBX+57]
; 7E5: 488BF1 MOV RSI, RCX
; 7E8: 48D1FE SAR RSI, 1
; 7EB: 480FAFF2 IMUL RSI, RDX
; 7EF: 4801C6 ADD RSI, RAX
; 7F2: 488B5311 MOV RDX, [RBX+17]
; 7F6: F30F114C7201 MOVSS [RDX+RSI*2+1], XMM1
; 7FC: 4883C002 ADD RAX, 2
; 800: L4: 4C39C8 CMP RAX, R9
; 803: 0F8C77FFFFFF JL L1
; 809: 4883C102 ADD RCX, 2
; 80D: L5: 4C39F9 CMP RCX, R15
; 810: 0F8C5AFFFFFF JL L0
; 816: 488BD3 MOV RDX, RBX
; 819: 488BE5 MOV RSP, RBP
; 81C: F8 CLC
; 81D: 5D POP RBP
; 81E: C3 RET
一応実測する。
CL-USER> (setf *N* 256) (run #'simple-gemm 100) (run #'on-register-gemm 100) (run #'row-major-gemm 100)
Evaluation took:
6.159 seconds of real time
6.164000 seconds of total run time (6.164000 user, 0.000000 system)
[ Run times consist of 0.004 seconds GC time, and 6.160 seconds non-GC time. ]
100.08% CPU
14,747,074,008 processor cycles
26,281,520 bytes consed
Evaluation took:
3.955 seconds of real time
3.960000 seconds of total run time (3.960000 user, 0.000000 system)
100.13% CPU
9,469,087,053 processor cycles
26,216,000 bytes consed
Evaluation took:
3.003 seconds of real time
3.008000 seconds of total run time (3.008000 user, 0.000000 system)
[ Run times consist of 0.004 seconds GC time, and 3.004 seconds non-GC time. ]
100.17% CPU
7,192,309,782 processor cycles
26,216,000 bytes consed
unroll無しでon-register-gemm
よりは速くなってるけど、やっぱりCと同レベルなコードにはならなそう。
参考ページ