Updated March 8, 2023

Introduction to dataset preprocessing

In the actual world, data is frequently incomplete: it lacks attribute values, specific attributes of relevance are missing, or it simply contains aggregate data. Errors or outliers make the data noisy. Inconsistent: having inconsistencies in codes or names. The Keras dataset pre-processing utilities assist us in converting raw disc data to a tf. data file. A dataset is a collection of data that may be used to train a model. In this topic, we are going to learn about dataset preprocessing.

Why use dataset pre-processing?

By pre-processing data, we can:

Improve the accuracy of our database. We remove any values that are wrong or missing as a consequence of human error or problems.
Consistency should be improved. The accuracy of the results is harmed when there are data discrepancies or duplicates.
Make the database as complete as possible. If necessary, we can fill up the missing properties.
The data should be smooth. We make it easier to use and interpret this manner.

We have few Dataset pre-processing Utilities:

Image
Text
Time series

Importing datasets pre-processing

Steps for Importing a dataset in Python:

Importing appropriate Libraries

We’ll need to import NumPy and Pandas every time we construct a new model. Pandas is used to import and handle data sets, while NumPy is a library that contains mathematical operations and is used for scientific computing.

import matplotlib.pyplot as mpt

Import Datasets

The datasets are in the.csv format. A CSV file is a plain text file that consists of tabular data. A data record is represented by each line in the file.

dataset = pd.read_csv('Data.csv')

We’ll use pandas’ iloc (used to fix indexes for selection) to read the columns, which has two parameters: [row selection, column selection].

x = Dataset.iloc[:, :-1].values

Let’s have the following incomplete datasets

Name	Pay	Managers
AAA	40000	Yes
BBB	90000
	60000	No
CCC		Yes
DDD	30000	Yes

As we can see few missing cells are in the table. To fill these we need to follow a few steps:

from sklearn.preprocessing import Imputer

Next By importing a class

Using not a number (NAN) A=pd.DataFrame(np.array()) // Using Missing Indicator to fit transform.

Splitting a dataset by training and test set.

Installing a library:

from sklearn.cross_validation import train_test_split

A_train, A_test, B_train, B_test = train_test_split(X, Y, test_size = 0.2)

Feature Scaling

from sklearn.preprocessing import StandardScaler scale_A = StandardScaler() A_train = scale_A.fit_transform(A_train) A_test = scale_A.transform(A_test)

Example #1

# importing libraries from sklearn.preprocessing import StandardScaler import pandas import NumPy# link for the required dataset url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data" # parameters names = ['sno', 'sname', 'age', 'Type', 'diagnosis', 'in', 'out', 'consultant', 'class'] dataframe = pandas.read_csv(url, names = names) array = dataframe.values //Array element for inputs X = array[:, 0:8] Y = array[:, 8] scaler = StandardScaler().fit(X) rescaledX = scaler.transform(X) numpy.set_printoptions(precision = 3) print(rescaledX[0:6,:])

Explanation

All of the data preprocessing procedures are combined in the above code.

Output:

Feature datasets pre-processing

Outliers are removed during pre-processing, and the features are scaled to an equivalent range.

Steps Involved in Data Pre-processing

Data cleaning: Data can contain a lot of useless and missing information. Data cleaning is carried out to handle this component. It entails dealing with missing data, noisy data, and so on. The purpose of data cleaning is to give machine learning simple, full, and unambiguous collections of examples.

a) Missing Data: This occurs when some data in the data is missing. It can be explored in many ways.

Here are a few examples:

Ignore the tuples: This method is only appropriate when the dataset is huge and many values are missing within a tuple.

Fill in the blanks: There are several options for completing this challenge. You have the option of manually filling the missing values, using the attribute mean, or using the most likely value.

b) Noisy Data: Data with a lot of noise

The term “noise” refers to a great volume of additional worthless data.

Duplicates or semi-duplicates of data records; data segments with no value for certain research; and needless information fields for each of the variables are examples of this.

It can be dealt with in the following ways:

Method of Binning:

This approach smoothes data that has been sorted. The data is divided into equal-sized parts, and the process is completed using a variety of approaches.

Regression:

Regression analysis aids in determining which variables do have an impact. To smooth massive amounts of data, use regression analysis. This will help to focus on the most important qualities rather than trying to examine a large number of variables.

Clustering: In this method, needed data is grouped in a cluster. Outliers may go unnoticed, or they may fall outside of clusters.

Data Transformation

We’ve already started modifying our data with data cleaning, but data transformation will start the process of transforming the data into the right format(s) for analysis and other downstream operations. This usually occurs in one or more of the following situations:

Aggregation
Normalization
Selection of features
Discretization
The creation of a concept hierarchy

Data Reduction:

Data mining is a strategy for dealing with large amounts of data. When dealing with bigger amounts of data, analysis faces quite a complication. We employ a data reduction technique to overcome this problem. Its goal is to improve storage efficiency and reduce analysis expenses. Data reduction not only simplifies and improves analysis but also reduces data storage.

The following are the steps involved in data reduction:

Attribute selection: Like discretization, can help us fit the data into smaller groups. It essentially combines tags or traits, such as male/female and manager, to create a male manager/female manager.

Reduced quantity: This will aid data storage and transmission. A regression model, for example, can be used to employ only the data and variables that are relevant to the investigation at hand.

Reduced dimensionality: This, too, helps to improve analysis and downstream processes by reducing the amount of data used. Pattern recognition is used by algorithms like K-nearest neighbors to merge similar data and make it more useful.

Conclusion – dataset preprocessing

Therefore, coming to end, we have seen Dataset processing techniques and their libraries in detail. The data set should be organized in such a way that it can run many Machines Learning and Deep Learning algorithms in parallel and choose the best one.

Quiz Result
Total Questions	Correct Answers	Wrong Answers	Percentage

Introduction to dataset preprocessing

Why use dataset pre-processing?

Importing datasets pre-processing

Feature datasets pre-processing

Conclusion – dataset preprocessing

Recommended Articles

Follow us!

APPS

Blog

Courses

Email