Data Movement
Index
Base.copyto! — Methodcopyto!(dest::MoYeArray, src::MoYeArray)Copy the contents of src to dest. The function automatically carries out potential vectorization. In particular, while transferring data from global memory to shared memory, it automatically initiates asynchronous copying, if your device supports so.
Base.copyto! — Methodcopyto!(ldmatrix::AbstractLdMatrix, dest::MoYeArray{UInt32}, src::MoYeArray{UInt128})Load data from shared memory to registers. The available AbstractLdMatrixs are:
# Type => LLVM intrinsic
"LDSM_U32x1_N" => "llvm.nvvm.ldmatrix.sync.aligned.m8n8.x1.b16"
"LDSM_U32x2_N" => "llvm.nvvm.ldmatrix.sync.aligned.m8n8.x2.b16"
"LDSM_U32x4_N" => "llvm.nvvm.ldmatrix.sync.aligned.m8n8.x4.b16"
"LDSM_U16x2_T" => "llvm.nvvm.ldmatrix.sync.aligned.m8n8.x1.trans.b16"
"LDSM_U16x4_T" => "llvm.nvvm.ldmatrix.sync.aligned.m8n8.x2.trans.b16"
"LDSM_U16x8_T" => "llvm.nvvm.ldmatrix.sync.aligned.m8n8.x4.trans.b16"You can inspect the number and the type of registers used per thread by
julia> LDSM_U32x4_N()
LDSM_U32x4_N()
julia> ans.DRegisters
Registers{UInt32, 4}MoYe.cp_async_wait — Functioncp_async_wait(N::Int32)
cp_async_wait()cp_async_wait(N) is equivalent to cp.async.wait.group(N) and cp_async_wait() is equivalent to cp.async.wait.all in CUDA.
MoYe.cp_async_commit — Functioncp_async_commit()cp.async.commit.group.