How to Read and Write JSON-formatted Data With Apache Pig
16 Apr 2014In this post, I will explain how to use the JsonStorage
and
JsonLoader
objects in Apache Pig to read
and write JSON-formatted data.
Reading JSON-Formatted Data With JsonLoader
Apache Pig can read JSON-formatted data if it is in a particular format. Each row in the file has to be a JSON dictionary where the keys specify the column names and the values specify the table content.
For example, supposed our data had three columns called food
,
person
, and amount
. We can store this data in first_table.json
as:
{"food":"Tacos", "person":"Alice", "amount":3}
{"food":"Tomato Soup", "person":"Sarah", "amount":2}
{"food":"Grilled Cheese", "person":"Alex", "amount":5}
We can then load the file using JsonLoader
as:
second_table = LOAD 'second_table.json'
USING JsonLoader('food:chararray, person:chararray, amount:int');
Here, 'food:chararray, person:chararray, amount:int'
is the Pig
schema for the data.
This creates the expected table:
food | person | amount |
---|---|---|
Tacos | Alice | 3 |
Tomato Soup | Sarah | 2 |
Grilled Cheese | Alex | 5 |
Reading Nested Data
What is nice is that JSON and Pig both support nesting data. We
can store both bags of data and tuples in JSON and have them read
into Pig. Pig expects tuples to be stored in JSON as dictionaries
and bags as lists of dictionaries. In our next example, third_table.json
contains rows with both a bag and a tuple:
{"recipe":"Tacos","ingredients":[{"name":"Beef"},{"name":"Lettuce"},{"name":"Cheese"}],"inventor":{"name":"Alex","age":25}}
{"recipe":"TomatoSoup","ingredients":[{"name":"Tomatoes"},{"name":"Milk"}],"inventor":{"name":"Steve","age":23}}
Notice that for the first row, the ingredients
bag is stored as
a list of dictionaries ([{"name":"Beef"},{"name":"Lettuce"},{"name":"Cheese"}]
).
Similarly, the inventor
tuple is stored as a dictionary ({"name":"Alex","age":25}
).
We can read this data in Pig by specifying a more complicated schema:
third_table = LOAD 'third_table.json'
USING JsonLoader('recipe:chararray,
ingredients: {(name:chararray)},
inventor: (name:chararray, age:int)');
We can DUMP
this data using Pig to ensure that the data is loaded
correctly:
(Tacos,{(Beef),(Lettuce),(Cheese)},(Alex,25))
(Tomato Soup,{(Tomatoes),(Milk)},(Steve,23))
Writing JSON-Formatted Data With JsonStorage
Finally, we can write JSON-formatted data using JsonStorage
. Imagine we
had a simple CSV
file first_table.dat
:
cat > first_table.dat
Tacos
Tomato Soup
Grilled Cheese
We can read it into Pig using PigStorage
and then save it out
using JsonStorage
:
first_table = LOAD 'first_table.dat'
USING PigStorage()
AS (col1:chararray);
...
STORE first_table
INTO 'first_table.json'
USING JsonStorage();
As is the convention in HDFS, the output is a folder called
first_table.json
. Inside the folder is a file called part-m-00000
that contains the data in JSON format:
{"col1":"Tacos"}
{"col1":"Tomato Soup"}
{"col1":"Grilled Cheese"}
If the job had lots of output data, it would be spread across
additional files like part-m-00001
.
Pig also wrote out an intermediate file in the folder called
.pig_schema
that explicitly specifies the schema of the output
data:
{"fields":[{"name":"col1","type":55,"description":"autogenerated from Pig Field Schema","schema":null}],"version":0,"sortKeys":[],"sortKeyOrders":[]}
This file allows the table to be read in by subsequent Pig jobs without explicitly specifying the schema.
If you have any questions or comments, please post them below. If you liked this post, you can share it with your followers or follow me on Twitter!