Data Engineering Cookbook     About     Cookbook     Feed

This site is hosted by Helmut Zechmann. I use it to publish data engineering related HOWTOs and code snippets.

Processing Nested Data In Hadoop

In traditional relational database systems data structures always should follow the first normal form. The first normal form demands that each attribute of an entity only contains atomic values. This is usually achieved by distributing data among multiple tables. Joins are being used to retrieve information from multiple tables.

While relational databases have excellent support for joining multiple datasets efficiently, in big data systems such as apache hadoop joins usually are very expensive, inefficient operations. Because of this limitation it often makes sense to store data in a semi-structured manner that does not follow the the first normal form.

This article shows how to store and process semi-structured data using data attributes of the types map and list in the hadoop ecosystem. For illustration purposes we use a data structure that contains annotations about apps. The data structure has the following attributes:

  • name: the name of the app
  • attributes: a map for storing arbitrary key value pairs
  • tags: a list of tags

A Schema For Semi-structured Data

The apache avro project provides a data format for storing semi-structured data. The avro format is supported by all major projects in the hadoop ecosystem. It offers support for list types as well as support for maps.

The following snippet defines an avro schema for our example data structure:

{
  "namespace": "example.avro",
  "type": "record",
  "name": "app",
  "fields": [
    {"name": "name", "type": "string"},
    {"name": "attributes", "type": {"type": "map", "values": "string"}},
    {"name": "tags", "type": {"type": "array", "items": "string"}}
  ]
}

The keys of an avro map have the type string. In our example the values also are strings. The data type for lists is called array. The list of tags is also of type string, it may have arbitrary length.

Nested Data In Hive

The apache hive project supports mapping avro data to tables (see hive avro docs). We can simply declare a table that uses our avro schema for the definition of the table structure.

Create Table

The following hive statement creates the table for our app data:

CREATE TABLE apps
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  TBLPROPERTIES (
    'avro.schema.url'='hdfs:///demo/app.avsc');

Once the table is created, we can inspect the table structure:

DESCRIBE apps;

+-----------------------+-----------------------+-----------------------+
|       col_name        |       data_type       |        comment        |
+-----------------------+-----------------------+-----------------------+
| name                  | string                | from deserializer     |
| attributes            | map<string,string>    | from deserializer     |
| tags                  | array<string>         | from deserializer     |
+-----------------------+-----------------------+-----------------------+

Put Data Into The Table

For demonstration purposes we use an existing avro file created with a small java application and put in into the hdfs folder that holds the data of our new hive table:

hdfs dfs -put avro-demo/nested.avro /user/hive/warehouse/demo.db/apps;

To inspect the data we can issue a simple SELECT statement:

SELECT * FROM apps;
+------------------+------------------------------------------+------------------------------------------+
|       name       |                attributes                |                   tags                   |
+------------------+------------------------------------------+------------------------------------------+
| Angry Birds      | {"category":"game","publisher":"rovio"}  | ["entertainment","casual"]               |
| Microsoft Excel  | {"category":"office"}                    | ["office","spreadsheet","productivity"]  |
+------------------+------------------------------------------+------------------------------------------+

As you can see the sample data contains two records. The first record has two attributes and two tags, the second record has only one attribute an three tags.

Query Nested Data

Hive provides some special functions for working with complex data types. The following table of functions is taken from the hive documentation:

Return Type Name(Signature) Description
array<K> map_keys(Map<K, V>) Returns an unordered array containing the keys of the input map.
array<T> sort_array(Array<T>) Sorts the input array in ascending order according to the natural ordering of the array elements and returns it (as of version 0.9.0).
array<V> map_values(Map<K, V>) Returns an unordered array containing the values of the input map.
boolean array_contains(Array<T>, value) Returns TRUE if the array contains value.
int size(Array<T>) Returns the number of elements in the map type.

Query Arrays In Hive

The following example demonstrates how to query our data for records with the tag office using the array_contains function:

SELECT *
FROM apps
WHERE array_contains(tags, "office");

+------------------+------------------------+------------------------------------------+
|       name       |       attributes       |                   tags                   |
+------------------+------------------------+------------------------------------------+
| Microsoft Excel  | {"category":"office"}  | ["office","spreadsheet","productivity"]  |
+------------------+------------------------+------------------------------------------+

Query Maps In Hive

The second example finds all apps that are known to be published by a publisher called rovio. Therefore we use the []-operator for accessing map entries.

SELECT *
FROM apps
WHERE attributes['publisher'] = 'rovio';

+--------------+------------------------------------------+-----------------------------+
|     name     |                attributes                |            tags             |
+--------------+------------------------------------------+-----------------------------+
| Angry Birds  | {"category":"game","publisher":"rovio"}  | ["entertainment","casual"]  |
+--------------+------------------------------------------+-----------------------------+

Nested Data In Pig

Access to the hive-mapped data is not limited to hive. This section shows how to access our data using pig. Therefore we leverage the pig hcatalog loader, especially the support for for handling complex types. Our avro list gets loaded into a pig tuple, avro maps are loaded into pig maps.

The following interactive pig session illustrates this. First we start a pig session with hcatalog access enabled:

pig -useHCatalog

In the next step we load our example data and inspect it:

grunt> APPS = LOAD 'demo.apps' USING org.apache.hcatalog.pig.HCatLoader();
grunt> describe APPS;
APPS: {name: chararray,attributes: map[],tags: {innertuple: (innerfield: chararray)}}
grunt> dump APPS;
(Angry Birds,[category#game,publisher#rovio],{(entertainment),(casual)})
(Microsoft Excel,[category#office],{(office),(spreadsheet),(productivity)})

Query Arrays In Pig

In our first pig based analysis we find again all apps that are tagged with office. Therefore we use the flatten function to convert the tags-bag to tuples:

grunt>APPS_FLAT = FOREACH APPS GENERATE name, attributes, flatten(tags) AS tag;
grunt> OFFICE_APPS = FILTER APPS_FLAT BY tag == 'office';
grunt> DUMP OFFICE_APPS;
(Microsoft Excel,[category#office],office)

Query Maps In Pig

In the second pig example we query our data again for apps published by rovio:

grunt> ROVIO_APPS = FILTER APPS BY attributes#'publisher' == 'rovio';
grunt> dump ROVIO_APPS;
(Angry Birds,[category#game,publisher#rovio],{(entertainment),(casual)})

Conclusion

This article showed the basic concepts of processing nested data based on the avro file format with hive and pig. Of course there are also other file formats (e.g. apache parquet) that support nested data.

On the processing side there are also many other tools (e.g. apache drill) with build in support for nested data structures.

Inspect Docker Containers With JQ

The docker inspect command provides useful information about docker containers. But the the huge output of this command can be quite confusing. Since the output comes in json format, the jq-tool can be used to get an overview of the output and pick interesting parts.

Show all keys in the output:

The following command shows all available keys in the inspect output for the container with id 89db758135a4:

docker inspect 89db758135a4 | jq .[0] | jq keys

[
  "AppArmorProfile",
  "Args",
  "Config",
  "Created",
  "Driver",
  "ExecDriver",
  "ExecIDs",
  "HostConfig",
  "HostnamePath",
  "HostsPath",
  "Id",
  "Image",
  "LogPath",
  "MountLabel",
  "Name",
  "NetworkSettings",
  "Path",
  "ProcessLabel",
  "ResolvConfPath",
  "RestartCount",
  "State",
  "Volumes",
  "VolumesRW"
]

Show values for a specific key:

The following command can be used to show information about the state of the container:

docker inspect 89db758135a4 | jq .[0] | jq .State
{
"Running": false,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 0,
"ExitCode": 127,
"Error": "",
"StartedAt": "2015-07-30T07:46:55.819253566Z",
"FinishedAt": "2015-07-30T07:46:55.967718855Z"
}

Principles Of Big Data - Part 1: Your Input Data Is Immutable

Data storage systems are used to store information. This information may change in the course of time. You have two basic options to reflect these changes in your data storage systems:

  1. Update the information
  2. Store the new information in addition to the existing information

Consider the following example:

In a social network a user has a list of followers. This list may be modified by two events:

  • follow event - A new follower is added to the list of followers
  • unfollow event - A follower chooses to unfollow the user.

One possibility to store this information is to always store and update a list of current followers for each user. Each time a new follower is added or removed you update this list in your storage system. The second possibility is to store all follow and unfollow events. The current list of followers for a user is derived from this information.

scenario solution one solution tow
Get the current list of followers for the user Arthur. Read the list of current followers. Derive the list of followers from the sequence of follow and unfollow events.
Get the number of followers for the user Arthur. Compute the length of the list of current followers. Derive the number of followers from the sequence of follow and unfollow events.
Get all users that have been following Arthur two years ago. - Derive the list of followers from the sequence of follow and unfollow events
Get a list of all users the user Ford has been following on 2000-01-01. - Derive the list of followed users from the sequence of follow and unfollow events.

As you can see solution one offers a simple and efficient solution for answering the questions you may have had in mind while implementing the solution. But there are many possible questions you cannot answer with this data model.

Solutions two requires much more storage and also additional computation efforts to answer simple questions. This is a high price to pay, but you get a great reward: Since you do not lose information due to data updates you have the possibilites to answer that arise later in time. Questions that you did not even think of when you where implementing your application.

Solution two provides another big advantage: Since you never update your raw data the danger of data corruption due to an application error is much less!