Tiled Matmul
While @tile
and @parallelize
are powerful tools for data manipulation, they can be cumbersome. TiledCopy
and TiledMMA
simplify this process.
Tiled Copy
TiledCopy
streamlines data transfer between arrays. Consider an example where six threads copy a 4x9 src
array to a dst
array of the same shape. The mapping of logical coordinates to thread IDs is as follows:
1 1 1 2 2 2 3 3 3
1 1 1 2 2 2 3 3 3
4 4 4 5 5 5 6 6 6
4 4 4 5 5 5 6 6 6
Each thread is assigned a data segment defined by val_layout
(2,3):(1,2), while the thread group operates within thr_layout
(2,3):(3,1).
First, initialize the arrays:
julia> using MoYe
julia> src_buffer = collect(1:36) .* 0.1;
julia> src = MoYeArray(src_buffer, @Layout((4,9)))
4×9 MoYeArray{Float64, 2, ViewEngine{Float64, Ptr{Float64}}, StaticLayout{2, Tuple{Static.StaticInt{4}, Static.StaticInt{9}}, Tuple{Static.StaticInt{1}, Static.StaticInt{4}}}} with indices _1:_4×_1:_9: 0.1 0.5 0.9 1.3 1.7 2.1 2.5 2.9 3.3 0.2 0.6 1.0 1.4 1.8 2.2 2.6 3.0 3.4 0.3 0.7 1.1 1.5 1.9 2.3 2.7 3.1 3.5 0.4 0.8 1.2 1.6 2.0 2.4 2.8 3.2 3.6
julia> dst_buffer = zeros(36);
julia> dst = MoYeArray(dst_buffer, make_layout((_4,_9)));
Next, set up the TiledCopy
:
julia> thr_layout = @Layout (2, 3) (3, 1)
(_2, _3):(_3, _1)
julia> val_layout = @Layout (2, 3) (1, 2)
(_2, _3):(_1, _2)
julia> tiled_copy = make_tiled_copy( CopyAtom{UniversalCopy{Float64}, Float64}(), thr_layout, val_layout)
TiledCopy Tiler_MN: (_4, _9) TiledLayout_TV: ((_3, _2), (_2, _3)):((_12, _2), (_1, _4)) CopyAtom Thread ID: _1:_0 ValLayoutSrc: (_1, _1):(_0, _1) ValLayoutDst: (_1, _1):(_0, _1) ValLayoutRef: (_1, _1):(_0, _1) ValueType: 64b
The Float64
in CopyAtom
specifies the data type. UniversalCopy{Float64}
indicates a non-vectorized copy. For vectorized copies, use a type like UInt128
:
tiled_copy_vec = make_tiled_copy(
CopyAtom{UniversalCopy{UInt128}, Float64}(),
thr_layout,
val_layout)
Note that for vectorized copies, val_layout
must have a divisible number of elements.
Visualize the tiled_copy
using print_typst(tiled_copy)
in the Typst web app:
The two tables show the thread distribution for src
and dst
. PTX instructions may reallocate each thread's data. For example:
print_typst(make_tiled_copy(MoYe.CopyAtom{LDSM_U32x4_N, UInt16}(),
@Layout((16,2)), @Layout((2,4))));
As shown, thr_layout
and val_layout
are defined on dst
. We will revisit ldmatrix
when discussing Tensor Cores.
After creating the tiled_copy
, partition the data:
julia> thr_idx = 2;
julia> thr_copy = get_slice(tiled_copy, thr_idx);
julia> dst_t = partition_D(thr_copy, dst);
julia> dst_t.layout
((_1, (_2, _3)), _1, _1):((_0, (_1, _4)), _0, _0)
julia> src_t = partition_S(thr_copy, src);
julia> src_t.layout
((_1, (_2, _3)), _1, _1):((_0, (_1, _4)), _0, _0)
julia> copyto!(tiled_copy, dst_t, src_t);
julia> dst
4×9 MoYeArray{Float64, 2, ViewEngine{Float64, Ptr{Float64}}, StaticLayout{2, Tuple{Static.StaticInt{4}, Static.StaticInt{9}}, Tuple{Static.StaticInt{1}, Static.StaticInt{4}}}} with indices _1:_4×_1:_9: 0.0 0.0 0.0 1.3 1.7 2.1 0.0 0.0 0.0 0.0 0.0 0.0 1.4 1.8 2.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
The second thread has now completed its copy. The shape of dst_t
is (CPY, CPY_M, CPY_K)
, where CPY
is the number of vectorized values per thread. In this case, it's 1. Changing to UniversalCopy{UInt128}
would alter this.
The NVIDIA Ampere architecture supports cuda::memcpy_async
for asynchronous data copies between global and shared memory. In older architectures, this required intermediate registers:
julia> thr_idx = 3;
julia> thr_copy = get_slice(tiled_copy, thr_idx);
julia> dst_t = partition_D(thr_copy, dst);
julia> src_t = partition_S(thr_copy, src);
julia> dst_r = make_fragment_like(dst_t);
julia> copyto!(tiled_copy, dst_r, src_t);
julia> copyto!(tiled_copy, dst_t, dst_r);
julia> dst
4×9 MoYeArray{Float64, 2, ViewEngine{Float64, Ptr{Float64}}, StaticLayout{2, Tuple{Static.StaticInt{4}, Static.StaticInt{9}}, Tuple{Static.StaticInt{1}, Static.StaticInt{4}}}} with indices _1:_4×_1:_9: 0.0 0.0 0.0 1.3 1.7 2.1 2.5 2.9 3.3 0.0 0.0 0.0 1.4 1.8 2.2 2.6 3.0 3.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
TiledMMA
TiledMMA
simplifies MMA partitions. Invoke make_tiled_mma
as follows:
julia> mma_C = make_tiled_mma(UniversalFMA{TA,TB, TC}(), # MMA operation @Layout((16,16))) # Atom layout
ERROR: UndefVarError: `TA` not defined in `Main` Suggestion: check for spelling errors or missing imports.
You can replace UniversalFMA
with other MMAOp
types. View the predefined MMAOps
with:
julia> MoYe.mma_ops_list
51-element Vector{Any}: "MMAOP_8x8x4_F64F64F64F64_TN" => "llvm.nvvm.mma.m8n8k4.row.col.f64" "MMAOP_16x8x4_F32TF32TF32F32_TN" => "llvm.nvvm.mma.m16n8k4.row.col.tf32" "MMAOP_16x8x8_F32TF32TF32F32_TN" => "llvm.nvvm.mma.m16n8k8.row.col.tf32" "MMAOP_16x8x16_F32BF16BF16F32_TN" => "llvm.nvvm.mma.m16n8k16.row.col.bf16" "MMAOP_16x8x8_F32BF16BF16F32_TN" => "llvm.nvvm.mma.m16n8k8.row.col.bf16" "MMAOP_8x8x4_F16F16F16F16_TT" => "llvm.nvvm.mma.m8n8k4.row.row.f16.f16" "MMAOP_8x8x4_F16F16F16F16_NT" => "llvm.nvvm.mma.m8n8k4.col.row.f16.f16" "MMAOP_8x8x4_F16F16F16F16_TN" => "llvm.nvvm.mma.m8n8k4.row.col.f16.f16" "MMAOP_8x8x4_F16F16F16F16_NN" => "llvm.nvvm.mma.m8n8k4.col.col.f16.f16" "MMAOP_8x8x4_F32F16F16F16_TT" => "llvm.nvvm.mma.m8n8k4.row.row.f32.f16" ⋮ "MMAOP_16x8x32_S32U8S8S32_TN_SATURATE" => "llvm.nvvm.mma.m16n8k32.row.col.satfinite.u8.s8" "MMAOP_16x8x32_S32U8U8S32_TN" => "llvm.nvvm.mma.m16n8k32.row.col.u8" "MMAOP_16x8x32_S32U8U8S32_TN_SATURATE" => "llvm.nvvm.mma.m16n8k32.row.col.satfinite.u8" "MMAOP_8x8x128_S32B1B1S32_TN_XORPOPC" => "llvm.nvvm.mma.xor.popc.m8n8k128.row.col.b1" "MMAOP_8x8x128_S32B1B1S32_TN_ANDPOPC" => "llvm.nvvm.mma.and.popc.m8n8k128.row.col.b1" "MMAOP_16x8x128_S32B1B1S32_TN_XORPOPC" => "llvm.nvvm.mma.xor.popc.m16n8k128.row.col.b1" "MMAOP_16x8x128_S32B1B1S32_TN_ANDPOPC" => "llvm.nvvm.mma.and.popc.m16n8k128.row.col.b1" "MMAOP_16x8x256_S32B1B1S32_TN_XORPOPC" => "llvm.nvvm.mma.xor.popc.m16n8k256.row.col.b1" "MMAOP_16x8x256_S32B1B1S32_TN_ANDPOPC" => "llvm.nvvm.mma.and.popc.m16n8k256.row.col.b1"
thr_mma = get_slice(mma_C, threadIdx().x);
tCsA = partition_A(sA);
tCsB = partition_B(sB);
tCgC = partition_C(gC);
tCrC = make_fragment_like(tCgC)
These instructions operate on Tensor Cores, which are covered in a later section.
Matmul with Tiled Operations
Now, let's upgrade the matmul_kernel
with TiledCopy
and TiledMMA
.
function matmul_kernel(A, sA_layout, copy_A,
B, sB_layout, copy_B,
C, mma_C)
sA = MoYeSharedArray(eltype(A), sA_layout)
sB = MoYeSharedArray(eltype(B), sB_layout)
mA = MoYeArray(A)
mB = MoYeArray(B)
mC = MoYeArray(C)
bM = size(sA_layout, 1)
bN = size(sB_layout, 1)
bK = size(sB_layout, 2)
gA = @tile mA (bM, bK) (blockIdx().x, :)
gB = @tile mB (bN, bK) (blockIdx().y, :)
gC = @tile mC (bM, bN) (blockIdx().x, blockIdx().y)
# Copy partition
thr_copy_a = get_slice(copy_A, threadIdx().x)
tAgA = partition_S(thr_copy_a, gA) # (CPY, CPY_M, CPY_K, k)
tAsA = partition_D(thr_copy_a, sA) # (CPY, CPY_M, CPY_K)
tArA = make_fragment_like(tAsA) # (CPY, CPY_M, CPY_K)
thr_copy_b = get_slice(copy_B, threadIdx().x)
tBgB = partition_S(thr_copy_b, gB) # (CPY, CPY_N, CPY_K, k)
tBsB = partition_D(thr_copy_b, sB) # (CPY, CPY_N, CPY_K)
tBrB = make_fragment_like(tBsB) # (CPY, CPY_N, CPY_K)
# MMA partition
thr_mma = get_slice(mma_C, threadIdx().x)
tCsA = partition_A(thr_mma, sA) # (MMA, MMA_M, MMA_K)
tCsB = partition_B(thr_mma, sB) # (MMA, MMA_M, MMA_K)
tCgC = partition_C(thr_mma, gC) # (MMA, MMA_M, MMA_N)
# Overlap copy and compute
copyto!(copy_A, tArA, view(tAgA, :, :, :, _1))
copyto!(copy_B, tBrB, view(tBgB, :, :, :, _1))
# Accumulator
tCrC = make_fragment_C(thr_mma, tCgC)
zeros!(tCrC)
k_max = size(tAgA, 4)
for k in 1:k_max
sync_threads()
copyto!(tAsA, tArA)
copyto!(tBsB, tBrB)
sync_threads()
# Load the next tile
k_next = k < k_max ? k+1 : k
copyto!(copy_A, tArA, view(tAgA, :, :, :, k_next))
copyto!(copy_B, tBrB, view(tBgB, :, :, :, k_next))
@gc_preserve gemm!(mma_C, tCrC, tCsA, tCsB, tCrC)
end
copyto!(tCgC, tCrC)
return nothing
end
function matmul(A, B, C)
bM = _128
bN = _128
bK = _8
sA_layout = make_layout((bM, bK), (_1, bM + _1))
sB_layout = make_layout((bN, bK), (_1, bN + _1))
TA = eltype(A)
TB = eltype(B)
TC = eltype(C)
copy_A = make_tiled_copy(CopyAtom{UniversalCopy{TA}, TA}(),
@Layout((32, 8)),
@Layout((1, 1)))
copy_B = make_tiled_copy(CopyAtom{UniversalCopy{TB}, TB}(),
@Layout((32, 8)),
@Layout((1, 1)))
mma_C = make_tiled_mma(UniversalFMA{TA,TB, TC}(), # MMA operation
@Layout((32,8))) # Atom layout
threads = Int(size(mma_C))
blocks = (cld(size(A, 1), bM), cld(size(B, 1), bN))
@cuda threads=threads blocks=blocks matmul_kernel(A, sA_layout, copy_A,
B, sB_layout, copy_B,
C, mma_C)
end
function test()
A = CUDA.randn(Float32, 2048, 256)
B = CUDA.randn(Float32, 2048, 256)
C = CUDA.randn(Float32, 2048, 2048)
matmul(A, B, C)
CUDA.synchronize()
@test C == A * B'
CUDA.unsafe_free!(A)
CUDA.unsafe_free!(B)
CUDA.unsafe_free!(C)
end
test()
Vectorized Copy and Memory Coalescing
As mentioned, you can use UniversalCopy{Float64}
or UniversalCopy{UInt128}
for vectorized copies. However, it is crucial to ensure that memory accesses are coalesced.
An uncoalesced copy:
copy_A = make_tiled_copy(CopyAtom{UniversalCopy{Float64}, TA}(),
@Layout((32, 8)),
@Layout((4, 1)))
Here, thread 1 loads from [1], [2]
and thread 2 loads from [5], [6]
, which is not coalesced.
Coalesced copies:
copy_A = make_tiled_copy(CopyAtom{UniversalCopy{Float64}, TA}(),
@Layout((32, 8)),
@Layout((2, 1)))
copy_A = make_tiled_copy(CopyAtom{UniversalCopy{UInt128}, TA}(),
@Layout((32, 8)),
@Layout((4, 1)))
In these examples, threads access contiguous memory locations, leading to coalesced memory access and better performance.