Updated February 18, 2023
Introduction to Avro File
Avro is a row-based social system that can use the data serialization framework developed by Apache Hadoop. The Avro file is a data file that can carry the data serialization for serializing the data in a compact binary format. The schema will be in JSON format when we try it with Apache MapReduce; then, these files can reserve the markers when we have huge datasets that need to distribute into subsets. It also has a container file for reserving cautious data that can easily be read and written; there is no need to do extra configuration.
Overviews of Avro File
The Avro file is a data serialization system that can supply a large data structure and compact, fast, binary data format. It can also have the container file, which can carry the continuous data and use the RPC procedures. Furthermore, as it has simple integration, it can be used with various languages, so new code creation is not necessary for reading or writing the data files in which creating the code will not be compulsory. It can only deploy with rigidly typed languages.
Normally it has two segments: the first is a schema that can be voluntary, and the second is binary data. So, for example, suppose we wanted to look at the avro file using the text editor. In that case, we can able to view the two-segment in which the first segment will contain the data which has been starting with the object, and the second segment will have data that can be able to read and the file type we need to confirm which Bhoomi will be able to read and write.
Avro File Configuration
Let us see the configuration of the Avro file, in which we can transform the actions of Avro data files with the help of different structured parameters.
When we are using Hadoop,
- If we wanted to configure the avro files, we do not want to achieve the ‘. Avro’s extension at the time of reading the files then allows to adjust the parameter by using ‘avro.mapred.ignore.inputs.without.extension’, which has false as the default value.
- For the above first, we have to reach spark, then spark content, then Hadoop configuration, and then we have to set(“avro.mapred.ignore.inputs.without.extension”, “true”).
When we try to configure the compression, then we have to set the following properties,
- The compression codec – spark.sql.avro.compression.codec has a snappy and deflates codec in which snappy is the default codec.
- If we wanted to set the compression codec as deflate, then we have to adjust the compression level as “spark.sql.avro.deflate.level,” and it has ‘-1’ as the default level.
- We can also adjust the things in the cluster of the spark, such as
a spark.conf.set(“spark.sql.avro.compression.codec”, “deflate”)
spark.conf.set(“spark.sql.avro.deflate.level”, “4”).
Types of Avro File
There are two types of Avro files,
1. Primitive Types
It includes null, Boolean, int, long, double, bytes, and string.
Schema: {"type": "null"}
2. Complex Types
- array:
{
"kind": "array"
"objects": "long"
}
- map: keys are string
{
"kind": "map"
"values": "long"
}
- record:
{
"kind": "record",
"name": "---",
"doc": "---",
"area": [
{"name": "--", "type": "int"},
---
]
}
- enum:
{
"kind": "enum",
"name": "---",
"doc": "---",
"symbols": ["--", "--"]
}
- fixed: It has 8-bit unsigned bytes
{
"kind": "fixed",
"name": "---",
"size": in bytes
}
- union: data will be equal to the schema
[
"null",
"string",
--
]
Examples of Avro File
Let us see the examples of avro files with schema and without the schema,
Example #1
Avro file using schema:
import java.util.Properties
import java.io.InputStream
import com.boomi.execution.ExecutionUtil
import org.apache.avro.Schema;
import org.apache.avro.file.DataFileStream;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.io.DatumReader;
logger = ExecutionUtil.getBaseLogger();
for (int j = 0; j < dataContext.getDataCount(); j++) {
InputStream istm = dataContext.getStream(j)
Properties prop = dataContext.getProperties(j)
DatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord>();
DataFileStream<GenericRecord> dataFileStream = new DataFileStream<GenericRecord>(istm, datumReader);
Schema sche = dataFileStream.getSchema();
logger.info("Schema utilize for: " + sche);
GenericRecord rec = null;
while (dataFileStream.hasNext()) {
rec = dataFileStream.next(rec);
System.out.println(rec);
istm = new ByteArrayInputStream(rec.toString().getBytes('UTF-8'))
dataContext.storeStream(istm, prop)
}
}
In the above example in which schema has been used with the avro files, we can say that this is the script that can read the avro file, and in this, we have generated more than one JSON document. We have imported the related packages, set the schema, and have called it by creating the object and writing the data in JSON using code as given in the above script.
Example #2
Avro file without a schema:
import java.util.Properties
import java.io.InputStream
import com.boomi.execution.ExecutionUtil
import org.apache.avro.Schema;
import org.apache.avro.file.DataFileStream;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.io.DatumReader;
logger = ExecutionUtil.getBaseLogger();
String schemaString = '{"type":"record","name":"college","namespace":"student.avro",' +
'"fields":[{"name":"title","type":"string","doc":"college title"},{"name":"exam_date","type":"string","sub":"start date"},{"name":"teacher","type":"int","sub":"main charactor is the teacher in college"}]}'
for (int k = 0; k < dataContext.getDataCount(); k++) {
InputStream istm = dataContext.getStream(k)
Properties prop = dataContext.getProperties(k)
DatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord>();
DataFileStream<GenericRecord> dataFileStre= new DataFileStream<GenericRecord>(istm, datumReader);
Schema sche = Schema.parse(scheString)
logger.info("Schema used: " + sche);
GenericRecord rec = null;
while (dataFileStre.hasNext()) {
rec = dataFileStre.next(rec);
System.out.println(rec);
is = new ByteArrayInputStream(rec.toString().getBytes('UTF-8'))
dataContext.storeStream(is, prop)
}
}
In the above example, we have written an example of reading files without schema in which we have to understand that if we have not included the schema under the avro file, then we have to perform some steps for informing the interpreter how to explain binary avro data, we also need to generate the schema which has been utilizing, in which this example can avro schema with a different name. We can also set it on another path.
Conclusion
In this article, we have concluded that the avro file is a data file that can work with the data serialized system utilized by Apache Hadoop. It has an open-source platform; we have also seen the configuration of the data files and examples, which helps to understand the concept.
Recommended Articles
This is a guide to Avro File. Here we discuss the introduction, overview, configuration, and types of Avro File, along with Examples. You may also have a look at the following articles to learn more –