Tiled Matmul
While @tile and @parallelize are powerful tools for data manipulation, they can be cumbersome. TiledCopy and TiledMMA simplify this process.
Tiled Copy
TiledCopy streamlines data transfer between arrays. Consider an example where six threads copy a 4x9 src array to a dst array of the same shape. The mapping of logical coordinates to thread IDs is as follows:
1 1 1 2 2 2 3 3 3
1 1 1 2 2 2 3 3 3
4 4 4 5 5 5 6 6 6
4 4 4 5 5 5 6 6 6Each thread is assigned a data segment defined by val_layout (2,3):(1,2), while the thread group operates within thr_layout (2,3):(3,1).
First, initialize the arrays:
julia> using MoYejulia> src_buffer = collect(1:36) .* 0.1;julia> src = MoYeArray(src_buffer, @Layout((4,9)))4×9 MoYeArray{Float64, 2, ViewEngine{Float64, Ptr{Float64}}, StaticLayout{2, Tuple{Static.StaticInt{4}, Static.StaticInt{9}}, Tuple{Static.StaticInt{1}, Static.StaticInt{4}}}} with indices _1:_4×_1:_9: 0.1 0.5 0.9 1.3 1.7 2.1 2.5 2.9 3.3 0.2 0.6 1.0 1.4 1.8 2.2 2.6 3.0 3.4 0.3 0.7 1.1 1.5 1.9 2.3 2.7 3.1 3.5 0.4 0.8 1.2 1.6 2.0 2.4 2.8 3.2 3.6julia> dst_buffer = zeros(36);julia> dst = MoYeArray(dst_buffer, make_layout((_4,_9)));
Next, set up the TiledCopy:
julia> thr_layout = @Layout (2, 3) (3, 1)(_2, _3):(_3, _1)julia> val_layout = @Layout (2, 3) (1, 2)(_2, _3):(_1, _2)julia> tiled_copy = make_tiled_copy( CopyAtom{UniversalCopy{Float64}, Float64}(), thr_layout, val_layout)TiledCopy Tiler_MN: (_4, _9) TiledLayout_TV: ((_3, _2), (_2, _3)):((_12, _2), (_1, _4)) CopyAtom Thread ID: _1:_0 ValLayoutSrc: (_1, _1):(_0, _1) ValLayoutDst: (_1, _1):(_0, _1) ValLayoutRef: (_1, _1):(_0, _1) ValueType: 64b
The Float64 in CopyAtom specifies the data type. UniversalCopy{Float64} indicates a non-vectorized copy. For vectorized copies, use a type like UInt128:
tiled_copy_vec = make_tiled_copy(
CopyAtom{UniversalCopy{UInt128}, Float64}(),
thr_layout,
val_layout)Note that for vectorized copies, val_layout must have a divisible number of elements.
Visualize the tiled_copy using print_typst(tiled_copy) in the Typst web app:
The two tables show the thread distribution for src and dst. PTX instructions may reallocate each thread's data. For example:
print_typst(make_tiled_copy(MoYe.CopyAtom{LDSM_U32x4_N, UInt16}(),
@Layout((16,2)), @Layout((2,4))));
As shown, thr_layout and val_layout are defined on dst. We will revisit ldmatrix when discussing Tensor Cores.
After creating the tiled_copy, partition the data:
julia> thr_idx = 2;julia> thr_copy = get_slice(tiled_copy, thr_idx);julia> dst_t = partition_D(thr_copy, dst);julia> dst_t.layout((_1, (_2, _3)), _1, _1):((_0, (_1, _4)), _0, _0)julia> src_t = partition_S(thr_copy, src);julia> src_t.layout((_1, (_2, _3)), _1, _1):((_0, (_1, _4)), _0, _0)julia> copyto!(tiled_copy, dst_t, src_t);julia> dst4×9 MoYeArray{Float64, 2, ViewEngine{Float64, Ptr{Float64}}, StaticLayout{2, Tuple{Static.StaticInt{4}, Static.StaticInt{9}}, Tuple{Static.StaticInt{1}, Static.StaticInt{4}}}} with indices _1:_4×_1:_9: 0.0 0.0 0.0 1.3 1.7 2.1 0.0 0.0 0.0 0.0 0.0 0.0 1.4 1.8 2.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
The second thread has now completed its copy. The shape of dst_t is (CPY, CPY_M, CPY_K), where CPY is the number of vectorized values per thread. In this case, it's 1. Changing to UniversalCopy{UInt128} would alter this.
The NVIDIA Ampere architecture supports cuda::memcpy_async for asynchronous data copies between global and shared memory. In older architectures, this required intermediate registers:
julia> thr_idx = 3;julia> thr_copy = get_slice(tiled_copy, thr_idx);julia> dst_t = partition_D(thr_copy, dst);julia> src_t = partition_S(thr_copy, src);julia> dst_r = make_fragment_like(dst_t);julia> copyto!(tiled_copy, dst_r, src_t);julia> copyto!(tiled_copy, dst_t, dst_r);julia> dst4×9 MoYeArray{Float64, 2, ViewEngine{Float64, Ptr{Float64}}, StaticLayout{2, Tuple{Static.StaticInt{4}, Static.StaticInt{9}}, Tuple{Static.StaticInt{1}, Static.StaticInt{4}}}} with indices _1:_4×_1:_9: 0.0 0.0 0.0 1.3 1.7 2.1 2.5 2.9 3.3 0.0 0.0 0.0 1.4 1.8 2.2 2.6 3.0 3.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
TiledMMA
TiledMMA simplifies MMA partitions. Invoke make_tiled_mma as follows:
julia> mma_C = make_tiled_mma(UniversalFMA{TA,TB, TC}(), # MMA operation @Layout((16,16))) # Atom layoutERROR: UndefVarError: `TA` not defined in `Main` Suggestion: check for spelling errors or missing imports.
You can replace UniversalFMA with other MMAOp types. View the predefined MMAOps with:
julia> MoYe.mma_ops_list51-element Vector{Any}: "MMAOP_8x8x4_F64F64F64F64_TN" => "llvm.nvvm.mma.m8n8k4.row.col.f64" "MMAOP_16x8x4_F32TF32TF32F32_TN" => "llvm.nvvm.mma.m16n8k4.row.col.tf32" "MMAOP_16x8x8_F32TF32TF32F32_TN" => "llvm.nvvm.mma.m16n8k8.row.col.tf32" "MMAOP_16x8x16_F32BF16BF16F32_TN" => "llvm.nvvm.mma.m16n8k16.row.col.bf16" "MMAOP_16x8x8_F32BF16BF16F32_TN" => "llvm.nvvm.mma.m16n8k8.row.col.bf16" "MMAOP_8x8x4_F16F16F16F16_TT" => "llvm.nvvm.mma.m8n8k4.row.row.f16.f16" "MMAOP_8x8x4_F16F16F16F16_NT" => "llvm.nvvm.mma.m8n8k4.col.row.f16.f16" "MMAOP_8x8x4_F16F16F16F16_TN" => "llvm.nvvm.mma.m8n8k4.row.col.f16.f16" "MMAOP_8x8x4_F16F16F16F16_NN" => "llvm.nvvm.mma.m8n8k4.col.col.f16.f16" "MMAOP_8x8x4_F32F16F16F16_TT" => "llvm.nvvm.mma.m8n8k4.row.row.f32.f16" ⋮ "MMAOP_16x8x32_S32U8S8S32_TN_SATURATE" => "llvm.nvvm.mma.m16n8k32.row.col.satfinite.u8.s8" "MMAOP_16x8x32_S32U8U8S32_TN" => "llvm.nvvm.mma.m16n8k32.row.col.u8" "MMAOP_16x8x32_S32U8U8S32_TN_SATURATE" => "llvm.nvvm.mma.m16n8k32.row.col.satfinite.u8" "MMAOP_8x8x128_S32B1B1S32_TN_XORPOPC" => "llvm.nvvm.mma.xor.popc.m8n8k128.row.col.b1" "MMAOP_8x8x128_S32B1B1S32_TN_ANDPOPC" => "llvm.nvvm.mma.and.popc.m8n8k128.row.col.b1" "MMAOP_16x8x128_S32B1B1S32_TN_XORPOPC" => "llvm.nvvm.mma.xor.popc.m16n8k128.row.col.b1" "MMAOP_16x8x128_S32B1B1S32_TN_ANDPOPC" => "llvm.nvvm.mma.and.popc.m16n8k128.row.col.b1" "MMAOP_16x8x256_S32B1B1S32_TN_XORPOPC" => "llvm.nvvm.mma.xor.popc.m16n8k256.row.col.b1" "MMAOP_16x8x256_S32B1B1S32_TN_ANDPOPC" => "llvm.nvvm.mma.and.popc.m16n8k256.row.col.b1"
thr_mma = get_slice(mma_C, threadIdx().x);
tCsA = partition_A(sA);
tCsB = partition_B(sB);
tCgC = partition_C(gC);
tCrC = make_fragment_like(tCgC)These instructions operate on Tensor Cores, which are covered in a later section.
Matmul with Tiled Operations
Now, let's upgrade the matmul_kernel with TiledCopy and TiledMMA.
function matmul_kernel(A, sA_layout, copy_A,
B, sB_layout, copy_B,
C, mma_C)
sA = MoYeSharedArray(eltype(A), sA_layout)
sB = MoYeSharedArray(eltype(B), sB_layout)
mA = MoYeArray(A)
mB = MoYeArray(B)
mC = MoYeArray(C)
bM = size(sA_layout, 1)
bN = size(sB_layout, 1)
bK = size(sB_layout, 2)
gA = @tile mA (bM, bK) (blockIdx().x, :)
gB = @tile mB (bN, bK) (blockIdx().y, :)
gC = @tile mC (bM, bN) (blockIdx().x, blockIdx().y)
# Copy partition
thr_copy_a = get_slice(copy_A, threadIdx().x)
tAgA = partition_S(thr_copy_a, gA) # (CPY, CPY_M, CPY_K, k)
tAsA = partition_D(thr_copy_a, sA) # (CPY, CPY_M, CPY_K)
tArA = make_fragment_like(tAsA) # (CPY, CPY_M, CPY_K)
thr_copy_b = get_slice(copy_B, threadIdx().x)
tBgB = partition_S(thr_copy_b, gB) # (CPY, CPY_N, CPY_K, k)
tBsB = partition_D(thr_copy_b, sB) # (CPY, CPY_N, CPY_K)
tBrB = make_fragment_like(tBsB) # (CPY, CPY_N, CPY_K)
# MMA partition
thr_mma = get_slice(mma_C, threadIdx().x)
tCsA = partition_A(thr_mma, sA) # (MMA, MMA_M, MMA_K)
tCsB = partition_B(thr_mma, sB) # (MMA, MMA_M, MMA_K)
tCgC = partition_C(thr_mma, gC) # (MMA, MMA_M, MMA_N)
# Overlap copy and compute
copyto!(copy_A, tArA, view(tAgA, :, :, :, _1))
copyto!(copy_B, tBrB, view(tBgB, :, :, :, _1))
# Accumulator
tCrC = make_fragment_C(thr_mma, tCgC)
zeros!(tCrC)
k_max = size(tAgA, 4)
for k in 1:k_max
sync_threads()
copyto!(tAsA, tArA)
copyto!(tBsB, tBrB)
sync_threads()
# Load the next tile
k_next = k < k_max ? k+1 : k
copyto!(copy_A, tArA, view(tAgA, :, :, :, k_next))
copyto!(copy_B, tBrB, view(tBgB, :, :, :, k_next))
@gc_preserve gemm!(mma_C, tCrC, tCsA, tCsB, tCrC)
end
copyto!(tCgC, tCrC)
return nothing
end
function matmul(A, B, C)
bM = _128
bN = _128
bK = _8
sA_layout = make_layout((bM, bK), (_1, bM + _1))
sB_layout = make_layout((bN, bK), (_1, bN + _1))
TA = eltype(A)
TB = eltype(B)
TC = eltype(C)
copy_A = make_tiled_copy(CopyAtom{UniversalCopy{TA}, TA}(),
@Layout((32, 8)),
@Layout((1, 1)))
copy_B = make_tiled_copy(CopyAtom{UniversalCopy{TB}, TB}(),
@Layout((32, 8)),
@Layout((1, 1)))
mma_C = make_tiled_mma(UniversalFMA{TA,TB, TC}(), # MMA operation
@Layout((32,8))) # Atom layout
threads = Int(size(mma_C))
blocks = (cld(size(A, 1), bM), cld(size(B, 1), bN))
@cuda threads=threads blocks=blocks matmul_kernel(A, sA_layout, copy_A,
B, sB_layout, copy_B,
C, mma_C)
end
function test()
A = CUDA.randn(Float32, 2048, 256)
B = CUDA.randn(Float32, 2048, 256)
C = CUDA.randn(Float32, 2048, 2048)
matmul(A, B, C)
CUDA.synchronize()
@test C == A * B'
CUDA.unsafe_free!(A)
CUDA.unsafe_free!(B)
CUDA.unsafe_free!(C)
end
test()Vectorized Copy and Memory Coalescing
As mentioned, you can use UniversalCopy{Float64} or UniversalCopy{UInt128} for vectorized copies. However, it is crucial to ensure that memory accesses are coalesced.
An uncoalesced copy:
copy_A = make_tiled_copy(CopyAtom{UniversalCopy{Float64}, TA}(),
@Layout((32, 8)),
@Layout((4, 1)))Here, thread 1 loads from [1], [2] and thread 2 loads from [5], [6], which is not coalesced.
Coalesced copies:
copy_A = make_tiled_copy(CopyAtom{UniversalCopy{Float64}, TA}(),
@Layout((32, 8)),
@Layout((2, 1)))
copy_A = make_tiled_copy(CopyAtom{UniversalCopy{UInt128}, TA}(),
@Layout((32, 8)),
@Layout((4, 1))) In these examples, threads access contiguous memory locations, leading to coalesced memory access and better performance.