Differences Between Spark SQL vs Presto
Presto, in simple terms, is the ‘SQL Query Engine,’ initially developed for Apache Hadoop. It’s an open-source distributed SQL query engine designed for running interactive analytic queries against data sets of all sizes. Spark SQL is a distributed in-memory computation engine with a SQL layer on top of structured and semi-structured data sets. Since it’s in-memory processing, the processing will be fast in Spark SQL.
Head to Head Comparison Between Spark SQL and Presto (Infographics)
Below are the Top 7 comparisons between Spark SQL and Presto:
Key Differences Between Spark SQL and Presto
Below is the list of the critical difference between Presto and Spark SQL:
- Apache Spark introduces a programming module for processing structured data called Spark SQL. Spark SQL includes an encoding abstraction called Data Frame which can act as distributed SQL query engine.
- The motive behind the beginning of Presto was to enable interactive analytics and approaches to the speed of commercial data warehouses with the power to scale the size of organizations matching Facebook.
- Whereas Spark SQL is a component on top of Spark Core that introduces a new data abstraction called SchemaRDD (Resilient Distributed Datasets), it supports structured/semi-structured data.
- Presto was designed as an alternative to tools that query HDFS data using MapReduce jobs such as Hive or Pig, but Presto is not limited to HDFS.
- Spark SQL follows in-memory processing that increases the processing speed. Spark is designed to process various workloads, such as batch queries, iterative algorithms, interactive queries, streaming, etc.
- Presto is capable of executing federative queries. Below is an example of Presto Federated Queries
Let us assume any RDBMS with table sample1
And HIVE with table sample2,
‘Testdb’ is the database in both hive and MYSQL. Using Presto, we can evaluate data using a single query once their connectors are configured correctly, as shown below-
presto> <Function (select/Group by ..etc.)> hive.Testdb.sample2
Function (select/Group by ..etc.)>mysql.Testdb.sample1
- Spark SQL architecture consists of Spark SQL, Schema RDD, and Data Frame.
-
- A Data Frame is a collection of data; the data is organized into named columns. Technically, it is the same as relational database tables.
- Schema RDD: Spark Core contains a unique data structure called RDD. Spark SQL works on schemas, tables, and records. Therefore, a user can use the Schema RDD as a temporary table. So that user can call this Schema RDD a Data Frame
- Data Frame Capabilities: Data frame process the data in the size of Kilobytes to Petabytes on a single node cluster to multiple node clusters,
- Data Frame supports different data formats ( CSV, elastic search, Cassandra, etc.) and storage systems (HDFS, HIVE tables, MySQL, etc.); it can be integrated with all Big Data tools/frameworks via Spark-Core and provides API for languages such as Python, Java, Scala, and R Programming.
- Whereas Presto is a distributed engine that works on a cluster setup. Presto architecture is simple to understand and extensible. Presto client (CLI) submits SQL statements to a master daemon coordinator, who manages the processing.
- Companies using Presto: Facebook, Netflix, Airbnb, Dropbox,, etc.
- Apache Spark Use Cases can be found in Industries like Finance, Retail, Healthcare, Travel,, etc. Many e-commerce websites like eBay, Alibaba, and Pinterest use Spark SQL to analyze hundreds of petabytes of data on their e-commerce platform.
Comparison Table of Spark SQL vs Presto
Below is the topmost comparison between SQL vs Presto.
Basis of comparison | Presto | Spark SQL |
Eco-Systems / Platforms | Hadoop, Big Data Processing, etc | Spark Framework, Big Data Processing, etc |
Purpose | Presto is designed for running SQL queries over Big Data (Huge workloads). It was designed by Facebook to process their huge workloads. |
Spark SQL is one of the components of Apache Spark Core. Spark Core is the fundamental execution engine for the spark platform |
Set up |
|
|
Capabilities/Features | Presto allows data querying over many data sources; For example, Data might be residing in data stores: Hive, Cassandra, RDBMS, and some other proprietary data stores. | Spark SQL gives flexibility in integration with other data sources using the data frames and JDBC connectors. |
Support for Connectors | Presto supports pluggable connectors. These connectors provide data sets for queries.
Below are several pre-existing connectors available in Presto, while Presto provides the ability to connect with custom connectors, as well.
|
A Data Frame interface allows different Data Sources to work on Spark SQL. Spark SQL includes a server mode with industry-standard JDBC and ODBC connectivity. |
Federated Queries | Presto supports the Federated Queries. Presto can be configured to connect with different DBs, and once configured, its CLI can be used to launch ‘Federated Queries’. In one Presto query user can combine data from multiple data sources and run the query. |
Spark SQL comes with an inbuilt feature to connect with other databases using JDBC, that is, “JDBC to other Databases,” which aids in the federation feature. Spark creates the data frames using the JDBC: database feature by leveraging scala/python API. Still, it also works directly with the Spark SQL Thrift server and allows users to query external JDBC tables effortlessly like other hive/spark tables. |
Who Uses? | Data Analysts, Data Engineers, Data Scientists, etc | Data Analysts, Data Engineers, Data Scientists, Spark Developer, etc |
Conclusions
Presto is very helpful regarding BI-type queries, and Spark SQL leads performance-wise in large analytics queries. When comparing with respect to configuration, Presto set up easy than Spark SQL. Both Spark SQL and Presto are standing equally in the market and solving different kinds of business problems.
Recommended Articles
We hope that this EDUCBA information on “Spark SQL vs Presto” was beneficial to you. You can view EDUCBA’s recommended articles for more information.