CUDA Programming
Prepared by Pengfei Zhang (pfzhang@cse.cuhk.edu.hk)
In this tutorial, we aim to introduce the CUDA compiling environment in the servers of the CSE department. We take GPU18 as an example.
Connect to GPU18 / GPU19
GPU18 and GPU19 are servers provided by the CSE department in which GPUs are installed. You can connect to it just like how you connect linux1~linux15. The following commands are provided in case you forget.
# if you are using cse vpn
ssh cse_account@gpu18
# if you are not using cse vpn, you can connect through gateway.
ssh cse_account@gw.cse.cuhk.edu.hk
# inside gateway machine, type
ssh cse_account@gpu18
I think the default shell of gpu18 and gpu19 should be bash
, you can check it by command echo "$SHELL"
, if it is not, you can change it to bash by command > bash
.
echo "$SHELL"
> bash
Run a demo
Initially, you must want to know the hardware configuration of the GPU server, especially the information of GPUs. We can use the following command to see the GPU information:
nvidia-smi

GPU18 is equipped with 4 GeForce GTX 1080Ti GPUs. Though GTX 1080Ti is not the latest model, its computation resource is still super powerful, for example, it has 3584 Pascal CUDA cores which can deliver 10.6TFlops single-precision performance and it has 11 GB GDDR5X memory.
For more information about the GPUs installed on gpu18 and gpu19, you can use commands nvidia-smi -a
and nvidia-smi -h
Now, it is time to introduce how to set up the software environment in the Linux server and fully utilize the GPU computation power. We prepared a demo and you can git clone it in any place at your home directory in GPU18. Since the server is behind a proxy, so you have to set the network proxies as follows:
export http_proxy=http://proxy.cse.cuhk.edu.hk:8000
export https_proxy=http://proxy.cse.cuhk.edu.hk:8000
Then git clone the demo repo:
git clone https://github.com/kuafu1994/GPUDemo.git
In this repo, there is an executable file named matrixMulCUBLAS. Before running it, we have to run the following command to change its file permission.
cd GPUDemo/
chmod u+x ./matrixMulCUBLAS
If you are not familiar with linux file permission, we recommend you to do our CSCI-3150 lab of File System.
By the name of the executable file, it is easy to know it evaluates a matrix multiplication operation. The suffix CUBLAS denotes it invokes the highly-tuned cuBLAS library at runtime. Before running the program, we should set an environment variable calledLD_LIBRARY_PATH
to tell the program where to find the shared libararies needed at runtime. Notice that you may have set this env variable in other servers before.
# Check whether LD_LIBRARY_PATH is set
echo $LD_LIBRARY_PATH
# If nothing is printed
export LD_LIBRARY_PATH=/usr/local/lib
echo $LD_LIBRARY_PATH
# Then add cuda lib to our environment
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-10.2/lib64
At first, we can use the help
option to show the help information.
gpu18:~/GPUDemo> ./matrixMulCUBLAS help
[Matrix Multiply CUBLAS] - Starting...
Usage -device=n (n >= 0 for deviceID)
-wA=WidthA -hA=HeightA (Width x Height of Matrix A)
-wB=WidthB -hB=HeightB (Width x Height of Matrix B)
Note: Outer matrix dimensions of A & B matrices must be equal.
The -device
is used to set the ID of GPU, where you want to run the program. By default, matrixMulCUBLAS uses GPU 0, otherwise, it overrides the GPU ID based on what is provided at the command line, for example, if you specify -device=1
, then GPU 1 will be chosen. Notice that there might be many other users running GPU programs on the server, it is necessary to know which GPU is available now. The command introduced at the beginning --- nvidia-smi
can help. For more information about this tool, you can refer to the official documentation.

The above figure shows that all four GPUs are idle according to the 'Memory-Usage' and 'Volatile GPU-Util' columns. Thus, in this lab, we set -device=1
and we also specify the dimensions of input matrices A and B with -wA=1024 -hA=2048 -wB=2048 -hB=1024
.
# In GPUDemo directory
gpu18:~/GPUDemo> ./matrixMulCUBLAS -device=1 -wA=1024 -hA=2048 -wB=2048 -hB=1024
[Matrix Multiply CUBLAS] - Starting...
gpuDeviceInit() CUDA Device [1]: "NVIDIA GeForce GTX 1080 Ti
GPU Device 1: "NVIDIA GeForce GTX 1080 Ti" with compute capability 6.1
MatrixA(2048,1024), MatrixB(1024,2048), MatrixC(2048,2048)
Computing result using CUBLAS...done.
Performance= 8398.15 GFlop/s, Time= 1.023 msec, Size= 8589934592 Ops
Computing result using host CPU...done.
Comparing CUBLAS Matrix Multiply with CPU results: PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
The above figure shows that this program achieves 8398.15 GFlop/s single precision performance on GPU 1. It almost hits its peak performance.
Compile a GPU program
Now, we will introduce how to compile a GPU program in our Linux server. In the GPUDemo repo, there is a source file named vectorAdd.cu. It performs an addition between two vectors.
In order to compile this source file, another compiler named nvcc
is used. nvcc
is provided by the GPU vendor Nvidia and it is installed in the directory: /usr/local/cuda/bin/
of GPU18. However, this directory is not included in the PATH environment variable. In Linux, PATH specifies a set of directories where executable programs are located. Thus, we have to use the absolute path of nvcc
in the command line to execute it and compile vectorAdd.cu
.
/usr/local/cuda/bin/nvcc vectorAdd.cu -o vectorAdd
We strongly suggest that you use export
command to make the directory /usr/local/cuda/bin/
included in the environment variable PATH
.
export PATH=$PATH:/usr/local/cuda/bin/
Then, you can simply use nvcc
instead of /usr/local/cuda/bin/nvcc
to run the compiler. It will benefit you when you want to use other tools installed in this directory, like nvprof
. After obtaining the executable file vectorAdd, just run it and have fun.
gpu18:~/vectorAdd> ./vectorAdd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
Kill you zombie jobs
Sometimes your job running on GPU may become a zombie job because of bugs or some other reasons. If so, you should kill your zombie job, to avoid too much occupation of resources. To kill your jobs, you should first check your jobs and get the job id with nvidia-smi

We can see from the above figure that I have a job with PID=5777 running on GPU-0. To kill it, you can use the following command
kill -9 5777
Notice that Killing your zombie jobs is important as it may cause our servers to be down.
Set environment variables in bashrc
We have configured several environment variables using export
in above sections. However, whenever you logout and ssh
to the GPU machine again, you need to re-set those environment variables. To avoid this, you can set environment variables in ~/.bashrc
so that the environment variables are loaded automatically when you use bash
.
add the following lines to ~/.bashrc
# setup proxy
export http_proxy=http://proxy.cse.cuhk.edu.hk:8000
export https_proxy=http://proxy.cse.cuhk.edu.hk:8000
# add cuda tools to PATH
export PATH=$PATH:/usr/local/cuda/bin
# add cuda lib to LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-10.2/lib64
# (optional) just to make your shell prompt more good-looking
export PS1="[\[\e[32m\]\W\[\e[m\]]> "
CUDA Programming
As CUDA is so popular in recent years, there are numerous materials discussing it. I highly recommend you to read the nice CUDA programming tutorials by Mark Harris. He introduced many code optimization tricks of CUDA, which will help you a lot in Asgn1b.
Among those tutorials, you must learn the following:
Last updated
Was this helpful?