Updated April 6, 2023

Introduction to PyTorch CUDA

Compute Unified Device Architecture or CUDA helps in parallel computing in PyTorch along with various APIs where a Graphics processing unit is used for processing in all the models. We can do calculations using CPU and GPU in CUDA architecture, which is the advantage of using CUDA in any system. Developers can use C, C++, Fortran, MATLAB, and Python to write programs working in CUDA architecture.

What is PyTorch CUDA?

CUDA operations can be set up and run using a torch.cuda, where all the tensors and current GPU are selected and kept on track. It is better to allocate a tensor to the device, after which we can do the operations without considering the device as it looks only for the tensor. It should be noted that if the tensors are connected to different devices, we should not do operations without having a peer to peer memory access in the device.

Using CUDA with PyTorch

We must check whether CUDA is available in the system or not. This gives a Boolean result and if the result is False, make sure to switch on GPU in the system.

torch.cuda.is_available()

It is good to know about CUDA in the system, and the below commands help in the same.

torch.cuda.current_device() 
	torch.cuda.get_device_name(ID of the device)
	torch.cuda.memory_allocated(ID of the device) 
	torch.cuda.memory_reserved(ID of the device)

Cached memory can be released from CUDA using the following command.

torch.cuda.empty_cache()

If we have several CUDA devices and plan to allocate several tasks to each device while running the command, it is necessary to mention the device’s ID for the operation.

cuda1 = torch.device('cuda:1') 
tensor = torch.Tensor([0.,0.], device = cuda1)
tensor = torch.Tensor([0.,0.]).to(cuda1)
tensor = torch.Tensor([0.,0.]).cuda(cuda1)

We can change the default CUDA device easily by specifying the ID.

torch.cuda.set_device(1)

It is easy to make a few GPU devices invisible by setting the environment variables.

import os
	os.environ["CUDA_VISIBLE_DEVICES"] = "1,2,3"

PyTorch model in GPU

There are three steps involved in training the PyTorch model in GPU using CUDA methods. First, we should code a neural network, allocate a model with GPU and start the training in the system. Initially, we can check whether the model is present in GPU or not by running the code.

next(net.parameters()).is_cuda

Assuming the result to be true, the next step is parallelization. We must split the given data into batches to be given to different CUDA devices in the system.

GPU = 0, 1
gpu_list = ''
multi_gpus = False
if isinstance(GPU, int):
   gpu_list = str(GPU)
else:
   multi_gpus = True
   for i, gpu_id in enumerate(GPU):
       gpu_list += str(gpu_id)
       if i != len(GPU) - 1:
           gpu_list += ','
os.environ['CUDA_VISIBLE_DEVICES'] = gpu_list 
net = net.cuda()
if multi_gpus:
   net = DataParallel(net, device_ids = gpu_list)

The next step is to load the PyTorch model into the system with this code.

cuda = torch.cuda.is_available()
net = MobileNetV3()
checkpoint = torch.load(‘path/to/checkpoint/)
net.load_state_dict(checkpoint[‘net_state_dict’])
if cuda:
	net = net.cuda()
net.eval()
result = net(image)

PyTorch CUDA Support

CUDA helps PyTorch to do all the activities with the help of tensors, parallelization, and streams.

CUDA helps manage the tensors as it investigates which GPU is being used in the system and gets the same type of tensors. The device will have the tensor where all the operations will be running, and the results will be saved to the same device. Cross-device operations are not done in CUDA, so that there is no chance of mixing the devices and losing the results.
The parallelization approach of CUDA helps to compute several operations within a short span of time. Data is automatically copied to all the devices by PyTorch, and the operations are carried out synchronously in the system. Synchronous operations help to identify the errors and correct them before proceeding to the next step.
We have streams in CUDA where linearity of operation is done in the devices. Default streams are present in all devices, so that we need not create any stream for the device. All the operations follow the serialization pattern in the device and hence inside the stream. Synchronization methods should be used to avoid several operations being carried out at the same time in several devices. PyTorch synchronizes data effectively, and we should use the proper synchronization methods.

PyTorch CUDA Step–by–step Example

We can check the versions of CUDA device identity using this code.

import torch
print(f"Is CUDA present in this system? 
      {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
  
cuda_identity = torch.cuda.current_device()
print(f"Identity of current CUDA device:
      {torch.cuda.current_device()}")
        
print(f"Name of CUDA device:
      {torch.cuda.get_device_name(cuda_id)}")
Tensor.device
Tensor.to(device_name)
Tensor.cpu()

We can handle tensors using CUDA devices.

import torch
a = torch.randint(1, 10, (10, 10))
print(a.device)
res_cpu = a ** 2
a = a.to(torch.device('cuda'))
print(a.device)
res_gpu = a ** 2 
assert torch.equal(res_cpu, res_gpu.cpu())

Machine learning models can be handled using CUDA.

import torch
import torchvision.models as models
device = 'cuda' if torch.cuda.is_available() else 'cpu'  
model = models.resnet18(pretrained=True) 
model = model.to(device)

PyTorch CUDA Methods

We can simplify various methods in deep learning and neural network using CUDA.

We can store various tensors, and we can run the same models in GPU using CUDA.
If we have several GPUs, we can select any one of the GPU to work with by giving the proper ID to the GPU system. Or we can use all the GPUs in the system for various applications where it will compute the calculations separately and store the results in the GPU itself.
Data parallelism can be done faster in PyTorch with the help of CUDA. Multiprocessing also can be done effectively using CUDA, where several machines can be trained and set up with a data loader. Of course, the model should be set up with different connections as well.

Conclusion

Copies to GPU happen faster with page-locked memory, where a copy of an object is returned where data is placed in a pinned region. Asynchronous GPU copies can be used to overlap data transfers while doing computation. DataLoader can be used to return the batches placed on the memory page.

Quiz Result
Total Questions	Correct Answers	Wrong Answers	Percentage

Introduction to PyTorch CUDA

What is PyTorch CUDA?

Using CUDA with PyTorch

PyTorch model in GPU

PyTorch CUDA Support

PyTorch CUDA Step–by–step Example

PyTorch CUDA Methods

Conclusion

Recommended Articles

Follow us!

APPS

Blog

Courses

Email