Professionals of Big data Hadoop introducethe Avro data format and its use in Hadoop and big data development.You can go through the article and learn the concept of Avro data format asexplained by professionals.
Introduction To Avro dataformat
Avrois data serialization system. It provides fast, binary data format on schema.When we read the Avro format, schema always used to write the data. This makesthe serialization very fast with small time.
Theinput data with Avro is defined format data type so it will help otherdevelopers ensure the input format. When we store the data with Avro, the schemafile will be stored at the same time so that files can be process with otherprogram from other developers.
Environment
Java:JDK 1.7
Avrotool: http://www.us.apache.org/dist/avro/avro-1.7.4/java/avro-tools-1.7.4.jar
Clouderaversion: CDH4.6.0
Initial steps
- We needto prepare Avro tool at the above download link.
- We needto prepare some input data in JSON format to convert to Avro binary data andconvert from Avro binary data to JSON.
schemaFile.avscfile is schema file for Avro, this is sample avsc file:
{
"type" : "record",
"name" : "twitter_schema",
"namespace" : "com.hn.avro",
"fields" : [ {
"name" : "id",
"type" : "string",
"doc" : "id of user"
}, {
"name" : "name",
"type" : "string",
"doc" : "name of user"
}, {
"name" : "timestamp",
"type" : "long",
"doc" : "Unix epoch time in seconds"
} ],
"doc:" : "Basic information of user"
}
bigbasicInfor.json file is the real data filewhich will be converted to Avro with Avro tool:
{"id":"1","name":"Jean","timestamp": 1366123856 }
{"id":"2","name":"Jang","timestamp": 1366489544 }
3. Put files to HDFS with the command
hadoop fs -mkdir -p /data/mysample/avroTable
hadoop fs -put basicInfor.json/data/mysample/avroTable
hadoop fs -put schemaFile.avsc/data/mysample/avroTableFormat
Code walk through
This command will convert the data from json data to Avro format.
java -jar avro-tools-1.7.4.jar fromjson --schema-file schemaFile.avscbasicInfor.json>basicInfor.avro
Thesecommands will convert the data from avro format to JSON file format for bothcompression data and text data:
java -jar avro-tools-1.7.4.jar tojsonbasicInfor.avro>basicInfor.json
java -jar avro-tools-1.7.4.jar tojsonbasicInfor.snappy.avro>basicInfor.json
Thesecommands will create the schema file from avro format to avsc file format forboth compression data and text data:
java -jar avro-tools-1.7.4.jar getschemabasicInfor.avro>schemaFile.avsc
java -jar avro-tools-1.7.4.jar getschemabasicInfor.snappy.avro>schemaFile2.avsc
Thiscommand creates mapping table for Avro data in HDFS. After run this command, weare able to do the query with Avro data:
CREATE EXTERNAL TABLE avroHive
COMMENT "Mapping avro table in Hive"
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/data/mysample/avroTable'
TBLPROPERTIES (
'avro.schema.url'=/data/mysample/avroTableFormat'
);
Verify the result
Weare able to check in the local file to look the data with both avsc for schema,json file for the json data and avro file with avro format
cat filename
Wealso check the table in hive with command after access to hive client withcommand:
hive
show tables;
Hope that this blog can help you guys understand how we can interact with Avro file inHadoop and in big data application.
Professionals of bigdata hadoop architect have introduced Avro data format inthis post. If you want to know more about the format or hadoop development, askfreely in comments.
Related Post: