CUDA Programming

Prepared by Pengfei Zhang (pfzhang@cse.cuhk.edu.hk)

In this tutorial, we aim to introduce the CUDA compiling environment in the servers of the CSE department. We take GPU18 as an example.

Connect to GPU18 / GPU19

GPU18 and GPU19 are servers provided by the CSE department in which GPUs are installed. You can connect to it just like how you connect linux1~linux15. The following commands are provided in case you forget.

# if you are using cse vpn
ssh cse_account@gpu18

# if you are not using cse vpn, you can connect through gateway.
ssh cse_account@gw.cse.cuhk.edu.hk
# inside gateway machine, type
ssh cse_account@gpu18

I think the default shell of gpu18 and gpu19 should be bash, you can check it by command echo "$SHELL" , if it is not, you can change it to bash by command > bash.

echo "$SHELL"
> bash

Run a demo

Initially, you must want to know the hardware configuration of the GPU server, especially the information of GPUs. We can use the following command to see the GPU information:

nvidia-smi

GPU18 is equipped with 4 GeForce GTX 1080Ti GPUs. Though GTX 1080Ti is not the latest model, its computation resource is still super powerful, for example, it has 3584 Pascal CUDA cores which can deliver 10.6TFlops single-precision performance and it has 11 GB GDDR5X memory.

For more information about the GPUs installed on gpu18 and gpu19, you can use commands nvidia-smi -a and nvidia-smi -h

Now, it is time to introduce how to set up the software environment in the Linux server and fully utilize the GPU computation power. We prepared a demo and you can git clone it in any place at your home directory in GPU18. Since the server is behind a proxy, so you have to set the network proxies as follows:

export http_proxy=http://proxy.cse.cuhk.edu.hk:8000
export https_proxy=http://proxy.cse.cuhk.edu.hk:8000

Then git clone the demo repo:

git clone https://github.com/kuafu1994/GPUDemo.git

In this repo, there is an executable file named matrixMulCUBLAS. Before running it, we have to run the following command to change its file permission.

cd GPUDemo/
chmod u+x ./matrixMulCUBLAS

If you are not familiar with linux file permission, we recommend you to do our CSCI-3150 lab of File System.

By the name of the executable file, it is easy to know it evaluates a matrix multiplication operation. The suffix CUBLAS denotes it invokes the highly-tuned cuBLAS library at runtime. Before running the program, we should set an environment variable calledLD_LIBRARY_PATHto tell the program where to find the shared libararies needed at runtime. Notice that you may have set this env variable in other servers before.

# Check whether LD_LIBRARY_PATH is set
echo $LD_LIBRARY_PATH

# If nothing is printed
export LD_LIBRARY_PATH=/usr/local/lib
echo $LD_LIBRARY_PATH

# Then add cuda lib to our environment
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-10.2/lib64

At first, we can use the help option to show the help information.

gpu18:~/GPUDemo> ./matrixMulCUBLAS help
[Matrix Multiply CUBLAS] - Starting...
Usage -device=n (n >= 0 for deviceID)
      -wA=WidthA -hA=HeightA (Width x Height of Matrix A)
      -wB=WidthB -hB=HeightB (Width x Height of Matrix B)
  Note: Outer matrix dimensions of A & B matrices must be equal.

The -device is used to set the ID of GPU, where you want to run the program. By default, matrixMulCUBLAS uses GPU 0, otherwise, it overrides the GPU ID based on what is provided at the command line, for example, if you specify -device=1, then GPU 1 will be chosen. Notice that there might be many other users running GPU programs on the server, it is necessary to know which GPU is available now. The command introduced at the beginning --- nvidia-smi can help. For more information about this tool, you can refer to the official documentation.

The above figure shows that all four GPUs are idle according to the 'Memory-Usage' and 'Volatile GPU-Util' columns. Thus, in this lab, we set -device=1 and we also specify the dimensions of input matrices A and B with -wA=1024 -hA=2048 -wB=2048 -hB=1024.

# In GPUDemo directory
gpu18:~/GPUDemo> ./matrixMulCUBLAS -device=1 -wA=1024 -hA=2048 -wB=2048 -hB=1024
[Matrix Multiply CUBLAS] - Starting...
gpuDeviceInit() CUDA Device [1]: "NVIDIA GeForce GTX 1080 Ti
GPU Device 1: "NVIDIA GeForce GTX 1080 Ti" with compute capability 6.1

MatrixA(2048,1024), MatrixB(1024,2048), MatrixC(2048,2048)
Computing result using CUBLAS...done.
Performance= 8398.15 GFlop/s, Time= 1.023 msec, Size= 8589934592 Ops
Computing result using host CPU...done.
Comparing CUBLAS Matrix Multiply with CPU results: PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

The above figure shows that this program achieves 8398.15 GFlop/s single precision performance on GPU 1. It almost hits its peak performance.

Compile a GPU program

Now, we will introduce how to compile a GPU program in our Linux server. In the GPUDemo repo, there is a source file named vectorAdd.cu. It performs an addition between two vectors.

In order to compile this source file, another compiler named nvcc is used. nvcc is provided by the GPU vendor Nvidia and it is installed in the directory: /usr/local/cuda/bin/ of GPU18. However, this directory is not included in the PATH environment variable. In Linux, PATH specifies a set of directories where executable programs are located. Thus, we have to use the absolute path of nvcc in the command line to execute it and compile vectorAdd.cu.

/usr/local/cuda/bin/nvcc vectorAdd.cu -o vectorAdd

We strongly suggest that you use export command to make the directory /usr/local/cuda/bin/ included in the environment variable PATH.

export PATH=$PATH:/usr/local/cuda/bin/

Then, you can simply use nvcc instead of /usr/local/cuda/bin/nvcc to run the compiler. It will benefit you when you want to use other tools installed in this directory, like nvprof. After obtaining the executable file vectorAdd, just run it and have fun.

gpu18:~/vectorAdd> ./vectorAdd 
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Kill you zombie jobs

Sometimes your job running on GPU may become a zombie job because of bugs or some other reasons. If so, you should kill your zombie job, to avoid too much occupation of resources. To kill your jobs, you should first check your jobs and get the job id with nvidia-smi

We can see from the above figure that I have a job with PID=5777 running on GPU-0. To kill it, you can use the following command

kill -9 5777

Notice that Killing your zombie jobs is important as it may cause our servers to be down.

Set environment variables in bashrc

We have configured several environment variables using export in above sections. However, whenever you logout and ssh to the GPU machine again, you need to re-set those environment variables. To avoid this, you can set environment variables in ~/.bashrc so that the environment variables are loaded automatically when you use bash.

add the following lines to ~/.bashrc

# setup proxy
export http_proxy=http://proxy.cse.cuhk.edu.hk:8000
export https_proxy=http://proxy.cse.cuhk.edu.hk:8000

# add cuda tools to PATH
export PATH=$PATH:/usr/local/cuda/bin

# add cuda lib to LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-10.2/lib64

# (optional) just to make your shell prompt more good-looking
export PS1="[\[\e[32m\]\W\[\e[m\]]> "

CUDA Programming

As CUDA is so popular in recent years, there are numerous materials discussing it. I highly recommend you to read the nice CUDA programming tutorials by Mark Harris. He introduced many code optimization tricks of CUDA, which will help you a lot in Asgn1b.

Among those tutorials, you must learn the following:

Last updated 10 months ago

Was this helpful?