Introduction to Data Analytics
Data analytics is the science of raw data analysis to draw conclusions about it. Data Analytics refers to the techniques for analyzing data for improving productivity and the profit of the business. Data is extracted and cleaned from different sources to analyze various patterns. Many data analytics techniques and processes are automated into mechanical processes and algorithms which handle raw data for human consumption.
Types of Data Analytics
The Data Analytics Process is subjectively categorized into three types based on the purpose of analyzing data as:
- Descriptive Analytics
- Predictive Analytics
- Prescriptive Analytics
The features of the above-listed types of Analytics are given below:
1. Descriptive Analytics
Descriptive Analytics focuses on summarizing past data to derive inferences.
The most commonly used measures to characterize historical data distribution quantitatively includes:
- Measures of Central Tendency: Mean, Median, Quartiles, Mode
- Measures of variability or spread: Range, Inter-Quartile Range, Percentiles
In recent times, the difficulties and limitations involved to collect, store and comprehend massive data heaps are overcome with the statistical inference process. Generalized inferences about population dataset statistics are deduced by using sampling methods along with the application of central limiting theory. A leading news broadcaster gathers casted vote details of randomly chosen voters at the exit of a poll station on the election day to derive statistical inferences about the preferences of the entire population.
Repeated sampling of population dataset results in chunks of samples with sufficiently large sample size. Clustered sampling is generally preferred to generate well-stratified, unbiased representatives of the population dataset. The statistical measure of interest is calculated on the sampled data chunks to obtain a distribution of sample statistic values called a sampling distribution. The characteristics of sampling distribution are related to that of the population dataset using the central limiting theory.
2. Predictive Analytics
Predictive Analytics exploits patterns in historical or past data to estimate future outcomes, identify trends, uncover potential risks and opportunities, or forecast process behavior. As Prediction use-cases are plausible in nature, these approaches employ probabilistic models to measure the likelihood of all possible outcomes. The chatBot in Customer Service Portal of financial firm pro-actively learns the customers’ intent or need to be based on his/her past activities in its web domain. With the predicted context, chatBot interactively converses with the customer to deliver apt services quickly and achieve better customer satisfaction.
In addition to the extrapolation scenarios to predict what happens in the future based on available past data, there are few applications that guess missed data entries with the help of available data samples. This approximation of missed values within the range of given data samples is technically referred to as Interpolation. A powerful image editor application supports reconstructing missed parts of texture due to super-imposed text by interpolating feature function at the missed block. Feature function can be interpreted as a mathematical notation of patterns in the texture of a distorted image.
The significant factors that influence the choice of predictive models/strategies are:
- Prediction Accuracy: That conveys the degree of closeness between a predicted value and actual value. A lower variance of the difference between the predicted value and actual value implies a higher predictive model’s accuracy.
- Speed of Predictions: It is prioritized high in real-time tracking applications
- Model Learning Rate: It depends on the model’s complexity and computations involved in calculating model parameters.
3. Prescriptive Analytics
Prescriptive Analytics uses knowledge discovered as a part of both descriptive and predictive analysis to recommend a context-aware course of actions. Advanced statistical techniques and computational-intensive optimization methods are implemented to understand the distribution of estimated predictions.
In precise terms, the impact and benefit of each outcome that is estimated during predictive analytics are evaluated to make heuristic and time-sensitive decisions for a given set of conditions. A Stock market consultancy firm performs SWOT (Strength, Weakness, Opportunities, and Threat) analysis on predicted prices for stocks in investors’ portfolio and recommends the best Buy-Sell options to its clients.
Process Flow in Data Analytics
The process of data analytics have various stages of data processing as given below:
1. Data Extraction
Data ingestion from multiple data sources of various types, including web pages, databases, legacy applications, results in input datasets of different formats.
The data formats inputted to the data analytics flow can be broadly classified as:
- Structured data have a clear definition of data types along with associated field length or field delimiters. This type of data can be easily queried like the content stored in the Relational Database (RDBMS).
- Semi-structured data lack precise layout definition, but data elements can be identified, separated, and grouped based on a standard schema or other metadata rules. An XML file employs tagging to hold data, whereas the Javascript object Notation file (JSON) holds data in name-value pairs. NoSQL (Not only SQL) databases like MongoDB but couch base are also used to store semi-structured data.
- Unstructured data includes social media conversations, images, audio clips etc. Traditional data parsing methods fail to understand this data. Unstructured data is stored in data lakes.
Implementation of data parsing for structured and semi-structured data is incorporated in various ETL tools like Ab Initio, Informatica, Datastage, and open source alternatives like Talend.
2. Data Cleaning and Transformation
Cleaning of parsed data is done to ensure data consistency and availability of relevant data for the later stages in a process flow.
The major cleansing operations in Data analytics are:
- Detection and elimination of outliers in the data volumes.
- Removing duplicates in the dataset.
- Handling missing entries in data records with the understanding of functionality or use-cases.
- Validations for permissible field values in data records like “31-February” cannot be a valid value in any of the date fields.
Cleansed data is transformed into a suitable format to analyze data.
Data transformations includes:
- A filter of unwanted data records.
- Joining the data fetched from different sources.
- Aggregation or grouping of data.
- Data typecasting.
3. KPI/Insight Derivation
Data Mining, Deep learning methods are used to evaluate Key Performance Indicators(KPI) or derive valuable insights from the cleaned and transformed data. Based on the objective of analytics, data analysis is performed using various pattern recognition techniques like k-means clustering, SVM classification, Bayesian classifiers, etc. and machine learning models like Markov models, Gaussian Mixture Models(GMM), etc.
Probabilistic models in the training phase learn optimal model parameters, and in the validation phase, the model is tested using k-fold cross-validation testing to avoid over-fitting and under-fitting errors. The most commonly used programming language for data analysis is R and Python. Both have a rich set of libraries (SciPy, NumPy, Pandas) that are open-sourced to perform complex data analysis.
4. Data Visualization
Data visualization is the process of clear and effective presentation of uncovered patterns, derived conclusions from the data using graphs, plots, dashboards, and graphics.
- Data reporting tools like QlikView, Tableau, etc., display KPI and other derived metrics at various levels of granularity.
- Reporting tools enable end-users to create customized reports with pivot, drill-down options using user-friendly drag and drop interfaces.
- Interactive data visualization libraries like D3.js (Data-driven documents), HTML5-Anycharts, etc.. are used to increase the ability to explore analyzed data.
Recommended Articles
This has been a guide to What is Data Analytics. Here we discussed the different types of data analytics with the process flow. You can also go through other suggested articles to learn more –