Data Movement
Index
Base.copyto!
— Methodcopyto!(dest::MoYeArray, src::MoYeArray)
Copy the contents of src
to dest
. The function automatically carries out potential vectorization. In particular, while transferring data from global memory to shared memory, it automatically initiates asynchronous copying, if your device supports so.
Base.copyto!
— Methodcopyto!(ldmatrix::AbstractLdMatrix, dest::MoYeArray{UInt32}, src::MoYeArray{UInt128})
Load data from shared memory to registers. The available AbstractLdMatrix
s are:
# Type => LLVM intrinsic
"LDSM_U32x1_N" => "llvm.nvvm.ldmatrix.sync.aligned.m8n8.x1.b16"
"LDSM_U32x2_N" => "llvm.nvvm.ldmatrix.sync.aligned.m8n8.x2.b16"
"LDSM_U32x4_N" => "llvm.nvvm.ldmatrix.sync.aligned.m8n8.x4.b16"
"LDSM_U16x2_T" => "llvm.nvvm.ldmatrix.sync.aligned.m8n8.x1.trans.b16"
"LDSM_U16x4_T" => "llvm.nvvm.ldmatrix.sync.aligned.m8n8.x2.trans.b16"
"LDSM_U16x8_T" => "llvm.nvvm.ldmatrix.sync.aligned.m8n8.x4.trans.b16"
You can inspect the number and the type of registers used per thread by
julia> LDSM_U32x4_N()
LDSM_U32x4_N()
julia> ans.DRegisters
Registers{UInt32, 4}
MoYe.cp_async_wait
— Functioncp_async_wait(N::Int32)
cp_async_wait()
cp_async_wait(N)
is equivalent to cp.async.wait.group(N)
and cp_async_wait()
is equivalent to cp.async.wait.all
in CUDA.
MoYe.cp_async_commit
— Functioncp_async_commit()
cp.async.commit.group
.