# CUDA Programming

In this tutorial, we aim to introduce the CUDA compiling environment in the servers of the CSE department. We take **GPU33** as an example.

### Connect to GPU33 / GPU34

GPU33 and GPU34 are servers provided by the CSE department in which GPUs are installed. You can connect to it just like how you connect linux1\~linux15. The following commands are provided in case you forget.

```bash
# if you are using cse vpn
ssh cse_account@gpu33

# if you are not using cse vpn, you can connect through gateway.
ssh cse_account@gw.cse.cuhk.edu.hk
# inside gateway machine, type
ssh linux<1-15>  # connect to any of linux1-15 machines
# and then, finally, inside the CSE public Linux machines
ssh cse_account@gpu33
```

I think the default shell of gpu33 and gpu34 should be `bash`, you can check it by command `echo "$SHELL"` , if it is not, you can change it to bash by command `> bash`.

```bash
echo "$SHELL"
> bash
```

### Run a demo

Initially, you must want to know the hardware configuration of the GPU server, especially the information of GPUs. We can use the following command to see the GPU information:

```bash
nvidia-smi
```

![nvidia-smi output](/files/1DQLuw7ld3VZoEq7hlwZ)

**GPU33** is equipped with 4 [NVIDIA Titan Xp](https://www.nvidia.com/en-au/geforce/products/10series/star-wars-galactic-empire-titan-xp-collectors-edition) GPUs. Though Titan Xp is not the latest model, its computation resource is still super powerful, for example, it has 3840 Pascal CUDA cores which can deliver 12.15TFlops single-precision performance and it has 12 GB GDDR5X memory.

For more information about the GPUs installed on gpu33 and gpu34, you can use commands `nvidia-smi -a`  and `nvidia-smi -h`

Now, it is time to introduce how to set up the software environment in the Linux server and fully utilize the GPU computation power. We prepared a demo and you can **git clone** it in any place at your home directory in **GPU33**. Since the server is behind a proxy, so you have to set the network proxies as follows:

```bash
export http_proxy=http://proxy.cse.cuhk.edu.hk:8000
export https_proxy=http://proxy.cse.cuhk.edu.hk:8000
```

Then git clone the demo repo:

```bash
git clone https://github.com/kuafu1994/GPUDemo.git
```

In this repo, there is an executable file named *matrixMulCUBLAS.* Before running it, we have to run the following command to change its file permission.&#x20;

```bash
cd GPUDemo/
chmod u+x ./matrixMulCUBLAS
```

If you are not familiar with linux file permission, we recommend you to do our [CSCI-3150 lab](https://eric-lo.gitbooks.io/lab9-filesystem/content/) of File System.

By the name of the executable file, it is easy to know it evaluates a matrix multiplication operation. The suffix *CUBLAS* denotes it invokes the highly-tuned [cuBLAS](https://developer.nvidia.com/cublas) library at runtime. Before running the program, we should set an environment variable called`LD_LIBRARY_PATH`to tell the program where to find the shared libararies needed at runtime. *Notice that you may have set this env variable in other servers before.*

```bash
# Check whether LD_LIBRARY_PATH is set
echo $LD_LIBRARY_PATH

# If nothing is printed
export LD_LIBRARY_PATH=/usr/local/lib
echo $LD_LIBRARY_PATH

# Then add cuda lib to our environment
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-10.2/lib64
```

At first, we can use the `help` option to show the help information.

```
gpu18:~/GPUDemo> ./matrixMulCUBLAS help
[Matrix Multiply CUBLAS] - Starting...
Usage -device=n (n >= 0 for deviceID)
      -wA=WidthA -hA=HeightA (Width x Height of Matrix A)
      -wB=WidthB -hB=HeightB (Width x Height of Matrix B)
  Note: Outer matrix dimensions of A & B matrices must be equal.
```

The `-device` is used to set the ID of GPU, where you want to run the program. By default, matrixMulCUBLAS uses GPU 0, otherwise, it overrides the GPU ID based on what is provided at the command line, for example, if you specify `-device=1`, then GPU 1 will be chosen.  Notice that there might be many other users running GPU programs on the server, it is necessary to know which GPU is available now. The command introduced at the beginning --- `nvidia-smi`  can help.  For more information about this tool, you can refer to the [official documentation](https://developer.download.nvidia.com/compute/DCGM/docs/nvidia-smi-367.38.pdf).&#x20;

![nvidia-smi output](/files/1AKrK8nJ8snqpLfCtgq3)

The above figure shows that all four GPUs are idle according to the 'Memory-Usage' and 'Volatile GPU-Util' columns.  Thus, in this lab, we set `-device=1` and we also specify the dimensions of input matrices A and B with `-wA=1024 -hA=2048 -wB=2048 -hB=1024`.&#x20;

```bash
# In GPUDemo directory
gpu34:~/GPUDemo> ./matrixMulCUBLAS -device=1 -wA=1024 -hA=2048 -wB=2048 -hB=1024
[Matrix Multiply CUBLAS] - Starting...
gpuDeviceInit() CUDA Device [1]: "NVIDIA TITAN Xp
GPU Device 1: "NVIDIA TITAN Xp" with compute capability 6.1

MatrixA(2048,1024), MatrixB(1024,2048), MatrixC(2048,2048)
Computing result using CUBLAS...done.
Performance= 8789.40 GFlop/s, Time= 0.977 msec, Size= 8589934592 Ops
Computing result using host CPU...done.
Comparing CUBLAS Matrix Multiply with CPU results: PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
```

The above figure shows that this program achieves 8789.40 GFlop/s single precision performance on GPU 1. It almost hits its peak performance.&#x20;

### Compile a GPU program

Now, we will introduce how to compile a GPU program in our Linux server. In the GPUDemo repo, there is a source file named [vectorAdd.cu](https://github.com/kuafu1994/GPUDemo/blob/master/vectorAdd.cu). It performs an addition between two vectors.&#x20;

In order to compile this source file, another compiler named `nvcc` is used. `nvcc` is provided by the GPU vendor Nvidia and it is installed in the directory: `/usr/local/cuda-11.8/bin/` of **GPU33**. However, this directory is not included in the PATH environment variable. In Linux, PATH specifies a set of directories where executable programs are located. Thus, we have to use the absolute path of `nvcc` in the command line to execute it and compile `vectorAdd.cu`.&#x20;

```bash
/usr/local/cuda-11.8/bin/nvcc vectorAdd.cu -o vectorAdd
```

We strongly suggest that you use `export` command to make the directory `/usr/local/cuda/bin/` included in the environment variable `PATH`.&#x20;

```bash
export PATH=$PATH:/usr/local/cuda-11.8/bin/
```

Then, you can simply use `nvcc` instead of `/usr/local/cuda/bin/nvcc` to run the compiler. It will benefit you when you want to use other tools installed in this directory, like `nvprof`.  After obtaining the executable file *vectorAdd,* just run it and have fun.

```
gpu33:~/vectorAdd> ./vectorAdd 
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
```

### Kill you zombie jobs

Sometimes your job running on GPU may become a zombie job because of bugs or some other reasons. If so, you should kill your zombie job, to avoid too much occupation of resources. To kill your jobs, you should first check your jobs and get the job id with `nvidia-smi`

![nvidia-smi to show your jobs](/files/-MXFoGf_iNYkfPI9XSkx)

We can see from the above figure that I have a job with PID=5777 running on GPU-0. To kill it, you can use the following command

```bash
kill -9 5777
```

**Notice that Killing your zombie jobs is important as it may cause our servers to be down.**&#x20;

### Set environment variables in bashrc

We have configured several environment variables using `export` in above sections. However, whenever you logout and `ssh` to the GPU machine again, you need to re-set those environment variables. To avoid this, you can set environment variables in `~/.bashrc` so that the environment variables are loaded automatically when you use `bash`.

add the following lines to `~/.bashrc`

```bash
# setup proxy
export http_proxy=http://proxy.cse.cuhk.edu.hk:8000
export https_proxy=http://proxy.cse.cuhk.edu.hk:8000

# add cuda tools to PATH
export PATH=$PATH:/usr/local/cuda-11.8/bin

# add cuda 10.2 lib to LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-10.2/lib64

# (optional) just to make your shell prompt more good-looking
export PS1="[\[\e[32m\]\W\[\e[m\]]> "
```

### CUDA Programming&#x20;

As CUDA is so popular in recent years, there are numerous materials discussing it. I highly recommend you to read the nice [CUDA programming tutorials](https://devblogs.nvidia.com/even-easier-introduction-cuda/) by  Mark Harris. He introduced many code optimization tricks of CUDA, which will help you a lot in Asgn1b.

Among those tutorials, you must learn the following:

* [**An Even Easier Introduction to CUDA**](https://devblogs.nvidia.com/even-easier-introduction-cuda/)
* [**How to Implement Performance Metrics in CUDA C/C++**](https://devblogs.nvidia.com/how-implement-performance-metrics-cuda-cc/)
* [**How to Query Device Properties and Handle Errors in CUDA C/C++**](https://devblogs.nvidia.com/how-query-device-properties-and-handle-errors-cuda-cc/)
* [**How to Optimize Data Transfers in CUDA C/C++**](https://devblogs.nvidia.com/how-optimize-data-transfers-cuda-cc/)
* [**How to Overlap Data Transfers in CUDA C/C++**](https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/)
* [**How to Access Global Memory Efficiently in CUDA C/C++ Kernels**](https://devblogs.nvidia.com/how-access-global-memory-efficiently-cuda-c-kernels/)
* [**Using Shared Memory in CUDA C/C++**](https://devblogs.nvidia.com/using-shared-memory-cuda-cc/)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://eric-lo.gitbook.io/cuda-programming/master.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
