Updated April 4, 2023
Introduction to PyTorch DataLoader
The data loading process is done in a parallel mode where collecting the batch details is carried out automatically with the help of PyTorch, which is called PyTorch DataLoader. This helps in doing the data loading process faster than ever with less memory in place. DataLoader has both dataset and sampler within itself so that an iterable can be formed in the dataset. We can do single loading or multi-process loading based on the amount of data and the speed required for the process and can be combined with map-style or iterable-style of the datasets where the loading order can be customized.
What is PyTorch DataLoader?
We can load batched or non-batched data where the data is batched automatically. We have map-style datasets where __getitem()__ and __len()__ protocol must be implemented that represents a map from either index or data samples and make it look like a protocol for users. This map–style datasets look for an IDX label in the dataset and read it from the disk for the users. Now, iterable – style dataset represents __iter()__ protocol in the dataset where it looks for an iterable size of data in the samples. We can call these datasets iter(dataset), and this can be fetched from any disks or folders as map-style datasets.
Data loading order is different here, and users can customize the same depending on the usage. Hence dynamic batch size can be used here along with the batch reading of files. Any sampler can be used to represent the datasets and the commonly used sampler is the stochastic gradient descent sampler. A shuffle argument can be used to shuffle the order or do the same in a sequential manner.
Complete guide PyTorch DataLoader
If the data to be loaded is unstructured, we should be careful in using proper libraries for loading the same. DataLoader helps in loading and iterating the data, whatever the data might be. This makes everyone to use DataLoader in PyTorch. The first step is to import DataLoader from utilities.
from torch.utils.data import DataLoader
We should give the name of the dataset, batch size, and several other functions as given below.
DataLoader(
dataset,
batch_size=1,
shuffle=False,
num_workers=0,
collate_fn=None,
pin_memory=False,
)
The total number of training samples is represented by batch_size, and the shuffle parameter helps us know whether the data is shuffled based on batches. Multiprocessing can be made by making num_workers any value more than 0. This gives the number of multiprocessing tasks in the dataset, and hence the user can work accordingly in the dataset.
Num_workers = 0 makes us understand that the process is the main process, and if it is 1, then the process is having another process running, and it can be slow. When we use map-styled datasets, it is important to use the argument collate_fn, which helps us know whether data merging is happening. If we want to load data as CUDA tensors, we can use the pin_memory argument in the command and set the same to true. This helps to load the data as copies of tensors.
Custom Text Dataset Class
We can create custom datasets using PyTorch.
import torch
from torch.utils.data import Dataset, DataLoader
We can create datasets based on our choice in any class. Here the class is called textdataset.
class TextDataset(Dataset):
def __init__(self, content, titles):
self.titles = titles
self.content = content
Two variables are needed in the dataset. Here they are titles and content. These variables should be used as functions as given above. Thus, we can measure the length of the dataset and know the index of the dataset as well.
def __len__(self):
return len(self.titles)
def __getitem__(self, idx):
label = self.titles[idx]
text = self.content[idx]
sample = {"Text": content, "Class": title}
return sample
Collection of sample and then dataset construction is done in the above code.
content = ['India', 'China', 'SriLanka', 'Nepal', 'Afghanistan']
titles = ['Peninsula', 'Country', 'Island', 'Country', 'Country']
Now we are initializing the data we have created using the class.
TextData = TextDataset(text_titles_df['Content'],
text_titles_df['Titles'])
The data is ready for use with all the given details.
How to create a PyTorch DataLoader?
You should create a dataset class in the code like below.
class Mynewdata(T.utils.data.Dataset):
# code should be written here to load data
my_datas = Mynewdata ("my_train_data.txt")
my_loadr = torch.utils.data.DataLoader(my_datas, 10, True)
for (idx, batch) in enumerate(my_loadr):
Here we are using a batch size as 10 for data in any order.
import numpy as np
import torch as T
item = T.device("cpu")
class mypeople(T.utils.data.Dataset):
def __init__(self, src_file, num_rows=None):
self.x_data = T.tensor(x_tmp,
dtype=T.float32).to(item)
self.y_data = T.tensor(y_tmp,
dtype=T.long).to(item)
def __len__(self):
return len(self.x_data)
def __getitem__(self, idx):
if T.is_tensor(idx):
idx = idx.tolist()
preds = self.x_data[idx, 0:7]
pol = self.y_data[idx]
sample = \
{ 'predictors' : preds, 'political' : pol }
return sample
if __name__ == "__main__":
main()
Now, the dataset and dataloader must be created using code.
train_file = ".\\people_train.txt"
train_datas = mypeople(train_file, num_rows=8)
bat_size = 3
train_loadr = T.utils.data.DataLoader(train_datas,
batch_size=bat_size, shuffle=True)
PyTorch DataLoader examples
1. We can use built-in datasets for PyTorch DataLoader. The MNIST dataset is considered here, where data normalization is done as there are digits. Iter function can be used to download the images and use it for further processing.
transform = transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,)),
])
trainerset = datasets.MNIST('~/.pytorch/MNIST_data/', download=True, train=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True) dataiter = iter(trainloader)
images, labels = dataiter.next()
print(images.shape)
print(labels.shape)
plt.imshow(images[1].numpy().squeeze(), cmap='Greys_r')
2. We can also use custom datasets for DataLoader, where random numbers are selected in the dataset and loaded in the DataLoader.
from torch.utils.data import DataLoader
loader = DataLoader(dataset,batch_size=15, shuffle=True, num_workers=5 )
for i, batch in enumerate(loader):
print(i, batch)
Batches are divided into 15, and workers are assigned as 5 numbers. This provides us with the output for the dataset in 15 batches.
Conclusion
DataLoader helps in arranging the data well and hence by making all the data to be analyzed easily using PyTorch. Moreover, custom datasets can be created easily, and it is always advised to go with custom datasets as we can manipulate the data based on our requirements. These are the fundamentals for using DataLoader in PyTorch.
Recommended Articles
We hope that this EDUCBA information on “PyTorch DataLoader” was beneficial to you. You can view EDUCBA’s recommended articles for more information.