Updated March 15, 2023
Definition of DataSet in R
Dataset in R is defined as a central location in the package in RStudio where data from various sources are stored, managed and available for use. In today’s world of big data, it has always been a challenge to find data that is clean, reliable and the metadata of the dataset is easy to interpret. RStudio is an Integrated Development Environment that enables developers to build statistical models for graphics and statistical computing through programming.
Dataset in R are present within the format of the RStudio application that provides the required usability for the required use case. There are 2 formats available in the market, one being the RStudio Desktop and the other being RStudio Server. The description of the dataset though is format agnostic and hence suitable for any version that one is using.
How to Read DataSet into R?
The dataset can be of 2 types, each having their individual way of reading the dataset. The first being the dataset that is pre stored in the package within RStudio from where the developer can access directly whereas on the other hand there is another form of dataset that can be present in raw format viz. excel, csv, database etc. Here we will look into the individual ways one by one. In the context of the dataset that is present in the RStudio package, we will see at limited number of examples but not limiting ourselves to the domain of dataset. Essentially, we will look into datasets which cater to the problem of classification and regressions individually.
From the pre-defined dataset in the package:
Most of the datasets are already available with the RStudio package exists in the repository named as “UCI Machine Learning”. The reason that these datasets are so popular is because of the following properties:
- One can download the dataset fast.
- The datasets are small and hence can fit into memory.
- The datasets are mostly cleaned and hence data cleaning process can be avoided, and one can quickly jump to running the algorithms quickly on them.
These packages are present in place that makes developers to download and use them in the projects conveniently through the bridge of Comprehensive R Archive Network (CRAN) which allows these third party libraries to download and keep the modules stored in the RStudio package.
Let us see at some of the datasets that are most famous for data science practitioner.
1. Datasets Library
This library comprises of comes in loaded with base version of the RStudio and hence there is no requirement of loading the library. There are various libraries that comes as a part of this bundle. One way to look into the various datasets are available in this library is by executing the following command.
Code:
library(help = "datasets")
2. Iris Dataset
This dataset contains the variety of an Iris flowers based on the different feature set and measurements of the flower. There are 3 types of varieties, that is categorized through 4 features set namely Sepal length, Sepal width, Petal length and Petal width. Loading the dataset can be performed by executing the following command.
Code:
data(iris)
This data is widely used for trying algorithms that cater to the genre of multi-class classification problem.
3. Longley’s Economic Dataset
This dataset contains the % people that were employed during a particular year on the basis of the various economic indicators. There are 6 different attributes that explains provides the % people employed in the column named as “Employed” and in future one can predict the % people that might be employed on the basis of the economic indicators in some defined year. Loading the dataset can be performed by executing the following command.
Code:
data(longley)
This data is widely used for trying algorithms that cater to the genre of regression problem.
4. mlbench Library
This library comprises of data regarding to the various real-world benchmark problems. One can install the library by executing the command.
Code:
install.packages("mlbench")
Loading the library can be done by executing the command.
Code:
library(mlbench)
Similar to the datasets library, one can execute the following code to get list of all the datasets in the library mlbench.
Code:
library(help = "mlbench")
5. Boston Housing Dataset
This dataset contains the prices of houses in the city of Boston on the basis of 13 features that are available in this dataset. Loading the dataset can be performed by executing the following command.
Code:
data(BostonHousing)
This data is widely used for trying algorithms that cater to the genre of regression problem.
6. Diabetes Dataset for Pima Indians (Female)
This dataset contains the presence of the diabetes in Pima Indians through 8 personal attributes like glucose, pressure, etc. Loading the dataset can be performed by executing the following command.
Code:
data(PimaIndiansDiabetes)
This data is widely used for trying algorithms that cater to the genre of binary classification problem.
7. AppliedPredictiveModelinglibrary
This library comprises of data that are present in one of the famous books of applied predictive modelling. One can install the library by executing the command.
Code:
install.packages("AppliedPredictiveModeling")
Loading the library can be done by executing the command:
Code:
library(AppliedPredictiveModeling)
Similar to the datasets library, one can execute the following code to get list of all the datasets in the library mlbench:
Code:
library(help = "AppliedPredictiveModeling")
From Raw Format Data File
The datasets are mostly present in some raw format like csv, excel.
Below we will see into the way how we load the dataset from.
CSV File:
<- read.csv(“<name and extension of file>”)
Excel files (Most popular way):
df_excel <- read.xlsx(“<name and extension of file>”, sheetIndex = <index of the sheet that needs to be loaded>)
Conclusion
With the end of this article we have looked at most popular datasets available in the context of RStudio. One can easily look into the other datasets that are mentioned in the libraries by looking into the documentation of the corresponding ones.
Recommended Articles
This is a guide to DataSet in R. Here we discuss the introduction, how to read DataSet into R? and from raw format data file respectively. You may also have a look at the following articles to learn more –