Data Movement

Index

Base.copyto!Method
copyto!(dest::MoYeArray, src::MoYeArray)

Copy the contents of src to dest. The function automatically carries out potential vectorization. In particular, while transferring data from global memory to shared memory, it automatically initiates asynchronous copying, if your device supports so.

source
Base.copyto!Method
copyto!(ldmatrix::AbstractLdMatrix, dest::MoYeArray{UInt32}, src::MoYeArray{UInt128})

Load data from shared memory to registers. The available AbstractLdMatrixs are:

# Type => LLVM intrinsic
"LDSM_U32x1_N" => "llvm.nvvm.ldmatrix.sync.aligned.m8n8.x1.b16"
"LDSM_U32x2_N" => "llvm.nvvm.ldmatrix.sync.aligned.m8n8.x2.b16"
"LDSM_U32x4_N" => "llvm.nvvm.ldmatrix.sync.aligned.m8n8.x4.b16"
"LDSM_U16x2_T" => "llvm.nvvm.ldmatrix.sync.aligned.m8n8.x1.trans.b16"
"LDSM_U16x4_T" => "llvm.nvvm.ldmatrix.sync.aligned.m8n8.x2.trans.b16"
"LDSM_U16x8_T" => "llvm.nvvm.ldmatrix.sync.aligned.m8n8.x4.trans.b16"

You can inspect the number and the type of registers used per thread by

julia> LDSM_U32x4_N()
LDSM_U32x4_N()

julia> ans.DRegisters
Registers{UInt32, 4}
source
MoYe.cp_async_waitFunction
cp_async_wait(N::Int32)
cp_async_wait()

cp_async_wait(N) is equivalent to cp.async.wait.group(N) and cp_async_wait() is equivalent to cp.async.wait.all in CUDA.

source