CUDA Programming  1. Matrix Multiplication
Notation
 Matrix
M
 Matrix
N
 Output Matrix
P
 row counter:
i
 column counter:
j
P(Rowundefined Col)
is the element atRowth
position in the vertical direction andColth
position in the horizontal direction
As shown in the picture below, P(RowundefinedCol)
(the small square in P) is the inner product of:
 the
Rowth
row of M  the
Colth
column of N
Indexing
The row index for the P element is:
Row = blockIdx.y * blockDim.y + threadIdx.y
The column index for the P element is:
Col = blockIdx.x * blockDim.x + threadIdx.x
With this onetoone mapping, the Row
and Col
thread indexes are also the row and column indexes for output array
Simple Kernel Using One Thread
Note that the Width
used below is the Width
in the picture above.
__global__ void MatrixMulKernel (float* M, float* N, float* P, int Width){
// calculate the row index of the P element and M
int Row = blockIdx.y * blockDim.y + threadIdx.y;
// calculate the col index of the P element and N
int Col = blockIdx.x * blockDim.x + threadIdx.x;
if((Row < Width) && (Col < Width>)){
float Pvalue = 0;
//each thread computes one element of the block submatrix
for(int k = 0; k < Width; k++) {
Pvalue += M[Row*Width + k] * N[k*Width + Col];
}
P[Row * Width + Col] = Pvalue;
}
}
memory access of the for loop
In every iteration of the loop, two global memory accesses are performed for one floatingpoint multiplication and one floatingpoint addition.
memory accesses
 One global memory access fetches an M element
 the other fetches an N element
computation performed
 One floatingpoint operation multiplies the M and N elements fetched
 the other accumulates the product into Pvalue
computetoglobalmemoryaccess ratio
The computetoglobalmemoryaccess ratio of the loop is 1.0.
This ratio will likely result in less than 2% utilization of the peak execution speed of the modern GPUs.
We need to increase the ratio by at least an order of magnitude for the computation throughput of modern devices to achieve good utilization.
Execution of the Matrix Multiplication Kernel Within A Block
setting
 size of
P
is 4×4 BLOCK_WIDTH = 2
map threads to P
With:
 blocks that are 2×2 arrays of threads
 each thread calculating one P element
The P matrix is now divided into four tiles, and each block calculates one tile.
 In the example,
thread(0undefined0)
ofblock(0undefined0)
calculatesP(0undefined0)
thread(0undefined0)
ofblock(1undefined0)
calculatesP(2undefined0)
actions of one thread block
The picture below illustrates the multiplication operations in each thread block.
Threads in block (0,0) produce four dot products.
the execution of the forloop
Use thread(0undefined0)
in block(0undefined0)
as an example.

During the 0th iteration (k=0):
Row * Width + k = 0*4+0 = 0
k * Width + Col= 0*4+0 = 0
 Therefore, we are accessing
M[0]
andN[0]
, which are the 1D equivalent ofM(0undefined0)
andN(0undefined0)

During the 1st iteration (k=1):
Row * Width + k = 0*4+1 = 1
k * Width + Col = 1*4+0 = 4
 We are accessing
M[1]
andN[4]
, which are the 1D equivalent ofM(0undefined1)
andN(1undefined0)
Tiled Kernel
To see how to use tiling (corner turning) technique to overcome the fact that rowwise access of matrix M cannot be coalesced, see Corner Turning section in The CUDA Parallel Programming Model  5.Memory Coalescing.