Data Movement
Index
Base.copyto! — Method
copyto!(dest::MoYeArray, src::MoYeArray)Copy the contents of src to dest. The function automatically carries out potential vectorization. In particular, while transferring data from global memory to shared memory, it automatically initiates asynchronous copying, if your device supports so.
Base.copyto! — Method
copyto!(ldmatrix::AbstractLdMatrix, dest::MoYeArray{UInt32}, src::MoYeArray{UInt128})Load data from shared memory to registers. The available AbstractLdMatrixs are:
# Type => LLVM intrinsic
"LDSM_U32x1_N" => "llvm.nvvm.ldmatrix.sync.aligned.m8n8.x1.b16"
"LDSM_U32x2_N" => "llvm.nvvm.ldmatrix.sync.aligned.m8n8.x2.b16"
"LDSM_U32x4_N" => "llvm.nvvm.ldmatrix.sync.aligned.m8n8.x4.b16"
"LDSM_U16x2_T" => "llvm.nvvm.ldmatrix.sync.aligned.m8n8.x1.trans.b16"
"LDSM_U16x4_T" => "llvm.nvvm.ldmatrix.sync.aligned.m8n8.x2.trans.b16"
"LDSM_U16x8_T" => "llvm.nvvm.ldmatrix.sync.aligned.m8n8.x4.trans.b16"You can inspect the number and the type of registers used per thread by
julia> LDSM_U32x4_N()
LDSM_U32x4_N()
julia> ans.DRegisters
Registers{UInt32, 4}MoYe.cp_async_wait — Function
cp_async_wait(N::Int32)
cp_async_wait()cp_async_wait(N) is equivalent to cp.async.wait.group(N) and cp_async_wait() is equivalent to cp.async.wait.all in CUDA.
MoYe.cp_async_commit — Function
cp_async_commit()cp.async.commit.group.