Matrix Transpose Tutorial
This tutorial illustrates the process copying data between global memory and shared memory using MoYe
.
In this tutorial, we will use the following configuration:
- Array size: 2048 x 2048
- Block size: 32 x 32
- Thread size: 32 x 8
Copy Kernel
We start with a copy kernel.
using MoYe, Test, CUDA
function copy_kernel(dest, src, smemlayout, blocklayout, threadlayout)
moye_smem = MoYeSharedArray(eltype(dest), smemlayout)
moye_dest = MoYeArray(dest)
moye_src = MoYeArray(src)
bM = size(blocklayout, 1)
bN = size(blocklayout, 2)
blocktile_dest = @tile moye_dest (bM, bN) (blockIdx().x, blockIdx().y)
blocktile_src = @tile moye_src (bM, bN) (blockIdx().x, blockIdx().y)
threadtile_dest = @parallelize blocktile_dest threadlayout threadIdx().x
threadtile_src = @parallelize blocktile_src threadlayout threadIdx().x
threadtile_smem = @parallelize moye_smem threadlayout threadIdx().x
for i in eachindex(threadtile_smem)
threadtile_smem[i] = threadtile_src[i]
end
for i in eachindex(threadtile_dest)
threadtile_dest[i] = threadtile_smem[i]
end
return nothing
end
function test_copy_async(M, N)
a = CUDA.rand(Float32, M, N)
b = CUDA.rand(Float32, M, N)
blocklayout = @Layout (32, 32) # 32 * 32 elements in a block
smemlayout = @Layout (32, 32) # 32 * 32 elements in shared memory
threadlayout = @Layout (32, 8) # 32 * 8 threads in a block
bM = size(blocklayout, 1)
bN = size(blocklayout, 2)
blocks = (cld(M, bM), cld(N, bN))
threads = Int(size(threadlayout))
@cuda blocks=blocks threads=threads copy_kernel(a, b, smemlayout, blocklayout, threadlayout)
CUDA.synchronize()
@test a == b
end
test_copy_async(2048, 2048)
Code Explanation
The device function follows these steps:
- Allocate shared memory using
MoYeSharedArray
with a static layout. - Wrap the destination and source arrays with dynamic layouts.
- Get the size of each block in the grid (bM and bN).
- Create local tiles for the destination and source arrays using
@tile
. - Partition the local tiles into thread tiles using
@parallelize
. - Copy data from the source thread tile to the shared memory thread tile.
- Synchronize threads.
- Copy data back from the shared memory thread tile to the destination thread tile.
The host function tests the copy_kernel function with the following steps:
- Define the dimensions M and N for the source and destination arrays.
- Create random GPU arrays a and b with the specified dimensions using CUDA.rand.
- Define the block and thread layouts using
@Layout
for creating static layouts. - Calculate the number of blocks in the grid using
cld
. Here we assume the divisibility.
A few things to notice here:
@tile
means that all of our blocks cover the entire array.- Each block contains 32 x 32 elements of the original array, but we have 32 x 8 threads per block, which means that each thread processes 4 elements. The code
@parallelize blocktile_dest threadlayout threadIdx().x
returns the set of elements that the thread corresponding to threadIdx().x is processing, which in this case is an array of length 4.
- Once we have completed all the tiling, we just perform computations as if we were dealing with a regular array:
for i in eachindex(threadtile_smem)
threadtile_smem[i] = threadtile_src[i]
end
You need not concern yourself with index bookkeeping, it is implicitly handled by the layout; instead, concentrate on the computation aspect, as it is a fundamental objective of MoYe.jl.
Additionally, you can use the copyto!
function for static MoYeArray
with two key feature: copying from global memory to shared memory automatically calls cp.async
(Requires sm_80
or higher), and automatic vectorization when possible.
Here is how it would look like using copyto!
.
function copy_kernel(dest, src, smemlayout, blocklayout, threadlayout)
moye_smem = MoYeSharedArray(eltype(dest), smemlayout)
moye_dest = MoYeArray(dest)
moye_src = MoYeArray(src)
bM = size(blocklayout, 1)
bN = size(blocklayout, 2)
blocktile_dest = @tile moye_dest (bM, bN) (blockIdx().x, blockIdx().y)
blocktile_src = @tile moye_src (bM, bN) (blockIdx().x, blockIdx().y)
threadtile_dest = @parallelize blocktile_dest threadlayout threadIdx().x
threadtile_src = @parallelize blocktile_src threadlayout threadIdx().x
threadtile_smem = @parallelize moye_smem threadlayout threadIdx().x
copyto!(threadtile_smem, threadtile_src)
cp_async_wait()
copyto!(threadtile_dest, threadtile_smem)
return nothing
end
Padding Shared Memory
Note that in the above code, the layout of the shared memory is the same as the block layout. However, we often need to pad the shared array to avoid bank conflicts. We just need to change one line of code:
smemlayout = @Layout (32, 32) (1, 31) # pad one row
Also note that our kernel will recompile for different static layout parameters.
Transpose kernel
Now we turn to the transpose kernel.
function transpose_kernel(dest, src, smemlayout, blocklayout, threadlayout)
moye_smem = MoYeSharedArray(eltype(dest), smemlayout)
moye_src = MoYeArray(src)
moye_dest = MoYeArray(dest)
bM = size(blocklayout, 1)
bN = size(blocklayout, 2)
blocktile_src = @tile moye_src (bM, bN) (blockIdx().x, blockIdx().y)
blocktile_dest = @tile moye_dest (bN, bM) (blockIdx().y, blockIdx().x)
threadtile_dest = @parallelize blocktile_dest threadlayout threadIdx().x
threadtile_src = @parallelize blocktile_src threadlayout threadIdx().x
threadtile_smem = @parallelize moye_smem threadlayout threadIdx().x
copyto!(threadtile_smem, threadtile_src)
cp_async_wait()
sync_threads()
moye_smem′ = MoYe.transpose(moye_smem)
threadtile_smem′ = @parallelize moye_smem′ threadlayout threadIdx().x
copyto!(threadtile_dest, threadtile_smem′)
return nothing
end
function test_transpose(M, N)
a = CUDA.rand(Float32, M, N)
b = CUDA.rand(Float32, N, M)
blocklayout = @Layout (32, 32)
smemlayout = @Layout (32, 32) (1, 33)
threadlayout = @Layout (32, 8)
bM = size(blocklayout, 1)
bN = size(blocklayout, 2)
blocks = (cld(M, bM), cld(N, bN))
threads = Int(size(threadlayout))
@cuda blocks=blocks threads=threads transpose_kernel(a, b, smemlayout, blocklayout, threadlayout)
CUDA.synchronize()
@test a == transpose(b)
end
test_transpose(2048, 2048)
It is almost identical to the copy kernel, but we would need to transpose the shared memory by simply transposing its layout
moye_smem′ = MoYe.transpose(moye_smem)
and then compute the new thread tiles. Note that each thread would work on different elements now so we need to call sync_threads()
.