Data Movement · MoYe.jl

Index

Base.copyto!
Base.copyto!
MoYe.cp_async_commit
MoYe.cp_async_wait

Base.copyto! — Method

copyto!(dest::MoYeArray, src::MoYeArray)

Copy the contents of src to dest. The function automatically carries out potential vectorization. In particular, while transferring data from global memory to shared memory, it automatically initiates asynchronous copying, if your device supports so.

source

Base.copyto! — Method

copyto!(ldmatrix::AbstractLdMatrix, dest::MoYeArray{UInt32}, src::MoYeArray{UInt128})

Load data from shared memory to registers. The available AbstractLdMatrixs are:

# Type => LLVM intrinsic
"LDSM_U32x1_N" => "llvm.nvvm.ldmatrix.sync.aligned.m8n8.x1.b16"
"LDSM_U32x2_N" => "llvm.nvvm.ldmatrix.sync.aligned.m8n8.x2.b16"
"LDSM_U32x4_N" => "llvm.nvvm.ldmatrix.sync.aligned.m8n8.x4.b16"
"LDSM_U16x2_T" => "llvm.nvvm.ldmatrix.sync.aligned.m8n8.x1.trans.b16"
"LDSM_U16x4_T" => "llvm.nvvm.ldmatrix.sync.aligned.m8n8.x2.trans.b16"
"LDSM_U16x8_T" => "llvm.nvvm.ldmatrix.sync.aligned.m8n8.x4.trans.b16"

You can inspect the number and the type of registers used per thread by

julia> LDSM_U32x4_N()
LDSM_U32x4_N()

julia> ans.DRegisters
Registers{UInt32, 4}

source

MoYe.cp_async_wait — Function

cp_async_wait(N::Int32)
cp_async_wait()

cp_async_wait(N) is equivalent to cp.async.wait.group(N) and cp_async_wait() is equivalent to cp.async.wait.all in CUDA.

source

MoYe.cp_async_commit — Function

cp_async_commit()

cp.async.commit.group.

source