hive_tables

Synopsis

Provide a list of HIVE tables required by the application. They will be created (or destroyed) by HADeploy.

Attributes

Each item of the list has the following attributes:

Name req? Description
name yes The name of the table
database yes The database this table belong to
external no Boolean. Equivalent to the EXTERNAL Hive DDL clause
fields no A list of fields definition describing the table's column. Refer to Field definition below
comment no An optional comment associated to the table
location no Equivalent to the LOCATION Hive DDL clause
properties no A map of properties. Equivalent to TBLPROPERTIES Hive DDL clause
stored_as no Specify the file format, such as SEQUENCEFILE, TEXTEFILE, RCFILE, ....
Equivalent to STORED AS Hive DDL clause [1]
input_format no Equivalent to STORED AS INPUT FORMAT '....' Hive DDL clause [1][2]
output_format no Equivalent to STORED AS OUTPUT FORMAT '....' Hive DDL clause [1][2]
delimited no A map of delimiter character. Equivalent to ROW FORMAT DELIMITED Hive DDL clause. Refer to Delimited row format below [3]
serde no Allow explicit definition of a serde'. Equivalent to ROW FORMAT SERDE Hive DDL clause [3]
serde_properties no A map of properties associated to the serde. Equivalent to WITH SERDEPROPERTIES Hive DDL clause
storage_handler no Allow definition of the storage handler. Equivalent to STORED BY Hive DDL clause
partitions no A list of fields definition describing the table's partitioning. Refer to Field definition below
clustered_by no Allow definition of a CLUSTERED BY Hive DDL clause. Refer to Table clustering below
skewed_by no Allow definition of a SKEWED BY Hive DDL Clause. Refer to Skewed values
alterable no Boolean. Allow most of ALTER TABLE commands to be automatically issued for table modification. Refer to Altering table below.
Default: no.
droppable no Boolean. Allow this table to be dropped and recreated if definition is modified.
Default value is yes if the table is external, nofor all other cases
no_remove no Boolean: Prevent this database to be removed when HADeploy will be used in REMOVE mode.
Default: no
ranger_policy no Definition of Apache Ranger policy bound to this table.
Parameters are same as hive_ranger_policies except than database, tables and columns should not be defined. The policy will apply on all columns of this table.
The policy name can be explicitly defined. Otherwise, a name will be generated as "_<database.table>_".
See example below for more information
when no Boolean. Allow conditional deployment of this item.
Default True

[1]: Storage format can be defined using two methods:

The two approaches are exclusive. Defining both stored_by and input/output_format will generate an error.

[2] input_format and output_format must be defined both if used.

[3] delimited and serde are exclusive and can't be defined both.

Field definition:

Here is the definition of a field element:

Field Attribute Req. Description
name yes The name of the field
type yes The type of the field. Note this file if used as-is, without interpretation from HADeploy.
This will allow to define Hive Complex Types here (arrays, maps, structs,union). See examples below.
comment no An optional comment

Delimited row format

The delimited map can hold the following values:

fields_terminated_by:
fields_escaped_by:
collection_items_terminated_by:
map_keys_terminated_by:
lines_terminated_by:
null_defined_as:

The characters must be expressed between single quote, and can be a regular character, an usual backslash escape sequence, or a unicode value. For example:

...
delimited:
  fields_terminated_by: ','
  map_keys_terminated_by: '\u0009'  # Same as '\t'
  lines_terminated_by: '\n'
  null_defined_as: '\u0001'

using octal notation (i.e. '\001') is not supported.

Table clustering

Here is the definition of a clustered_by element:

Attribute Req. Description
columns yes This list of columns to CLUSTER BY
nbr_buckets yes The number of buckets
sorted_by no A list of sort item element, as defined just below

Inner sort item element:

Attribute Req. Description
columns yes A list of column
direction no The direction: ASC or DESC. Default is ASC

Example:

  ...
  clustered_by:
    columns:
    - userid
    - page_url
    sorted_by:
    - { column: userid, direction: asc }
    - { column: page_url, direction: desc }
    nbr_buckets: 16

Will be interpreted as:

 CLUSTERED BY(userid,page_url) SORTED BY (userid asc, page_url desc) INTO 16 BUCKETS

Skewed values

Here is the definition of the skewed_by element:

Attribute Req. Description
columns yes A list of column
values yes A list of list of values
stored_as_directories no Boolean. Is skewed value stored as directory.
Default no

Example:

  ...
  skewed_by:
    columns:
    - key
    values:
    - [ 1 ]
    - [ 5 ]
    - [ 6 ]
    stored_as_directories: true

will be interpreted as:

SKEWED BY(key) ON(('1'),('5'),('6')) STORED AS DIRECTORIES

And:

  skewed_by:
    columns:
    - col1
    - col2
    values:
    - [ "s1", 1 ]
    - [ "s3", 3 ]
    - [ "s13", 13 ]
    - [ "s78", 78 ]

will be interpreted as:

SKEWED BY(col1,col2) ON(('s1', 1),('s3', 3),('s13', 13),('s78',78))

Altering table

The must tricky operation with tools like HADeploy is not table creation, but how it must behave on existing table evolution, specially if theses table already contains some data.

In case of table schema update, operation can be classified in several categories:

Most if not all ALTER TABLE commands will only modify Hive's metadata, and will not modify data. Users should make sure the actual data layout of the table/partition conforms with the new metadata definition.

Database migration.

Database migration is a complex subject, as it involve not only modifying the database schema, but also adjusting existing data to comply for the new schema.

In some case, like adding a new column, it could be quite simple. But it is generally more involving.

Just think of a simple use case: Breaking a full_name field (content: i.e. 'Sylvester STALLONE') in two fields: first_name ('Sylvester') and last_name ('STALLONE'). It is obvious you will need some application specific code to be executed to perform this.

This problem has been addressed by some tools in the RDBMS fields. But there is no miracle. These tools are in fact more frameworks which globally act the following way:

Currently, HADeploy do not provide such solution yet. But, under the hood, all HIVE operation are performed by a special tools: jdchive. This tool has been designed with this migraton pattern in mind. Check here

Table ownership

As there is no command such as 'ALTER TABLE SET USER...' the database owner will be the account under which the table creation commands was launched during database creation.

This account can be set be the userattribute of the hive_relay. If not, it will be the ssh_user attached to the host used for relaying (hive_relay.host).

Once the table is created, there is no way to change this table ownership.

Example:

hive_tables:
- name: testSimple
  database: jdctest1
  comment: "A first, simple test table"
  location: "/tmp/xxx"
  fields:
  - name: fname
    type: string
    comment: "First name"
  - name: lname
    type: string
    comment: "Last name"
- name: testRcFile
  database: jdctest1
  comment: "A RCFILE table"
  fields: [ { name: fname, type: string, comment: 'The first name' }, { name: lname, type: string } ]
  stored_as: RCFILE

Will be interpreted, for creation as:

CREATE  TABLE jdctest1.testSimple ( fname string COMMENT 'First name', lname string COMMENT 'Last name' ) COMMENT 'A first, simple test table' LOCATION 'hdfs://mycluster/tmp/xxx'
CREATE  TABLE jdctest1.testRcFile ( fname string COMMENT 'The first name', lname string ) COMMENT 'A RCFILE table' STORED AS RCFILE

And:

hive_tables:
- name: testPartitions
  database: jdctest1
  comment: "A table with partitions"
  fields:
  - name: viewTime
    type: INT
  - name: userid
    type: BIGINT
  - name: page_url
    type: STRING
  - name: referrer_url
    type: STRING
  - name: ip
    type: STRING
    comment: "IP Address of the User"
  partitions:
  - name: dt
    type: STRING
  - name: country
    type: STRING
  stored_as: SEQUENCEFILE
  state: present

Will be interpreted, for creation as:

CREATE  TABLE jdctest1.testPartitions ( viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User' ) COMMENT 'A table with partitions' PARTITIONED BY ( dt STRING, country STRING ) STORED AS SEQUENCEFILE

And:

hive_tables:
- name: testSerde
  database: jdctest1
  comment: "Serde test"
  fields:
  - { name: host, type: STRING }
  - { name: identity, type: STRING }
  - { name: theuser, type: STRING }
  serde: "org.apache.hadoop.hive.serde2.RegexSerDe"
  serde_properties:
    input.regex: "([^ ]*) ([^ ]*) ([^ ]*)"
  state: present
  alterable: true

Will be interpreted, for creation as:

CREATE  TABLE jdctest1.testSerde ( host STRING, identity STRING, theuser STRING) COMMENT 'Serde test' ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'input.regex'='([^ ]*) ([^ ]*) ([^ ]*)')

Complex Types example

Here, the field value is not a simple scalar one:

- name: testComplex
  database: jdctest1
  external: true
  fields:
  - { name: compressed, type: boolean }
  - { name: value, type: "struct<contentType:string, message:string, sender:string, properties: array<struct<key:string,value:int>>, type:string>" }
  - { name: timestamp, type: string }

And same example, but expressed differently:

- name: testComplex
  database: jdctest1
  external: true
  fields:
  - { name: compressed, type: boolean }
  - name: value
    type: |
      struct<
        contentType:string, 
        message:string, 
        sender:string, 
        properties: array<
          struct<
            key:string,
            value:int
          >
        >, 
        type:string
      >
  - { name: timestamp, type: string }

Apache Ranger example

Following is an illustration of Apache Ranger policy association: The table is created with select and update permissions for all users of the 'users' group. And user 'sa' can also create new indexes.

hive_tables:
- name: testranger
  database: jdctest1
  comment: "A first, simple test table"
  location: "/tmp/xxx"
  fields:
  - name: fname
    type: string
    comment: "First name"
  - name: lname
    type: string
    comment: "Last name"
  ranger_policy:
    permissions:
    - groups:
      - users
      accesses: 
      - select
      - update
    - users:
      - sa
      accesses:
      - index

HBase table mapping

A frequent use case is the mapping of HBase table by HIVE, to ease querying. This can be achieved using HADeploy.

Example:

Provided the following HBase table definition:

hbase_namespaces:
- name: test2

hbase_tables:
  - name: test2a
    namespace: test2
    column_families:
    - name: id
    - name: job

(Refer to hbase_namespaces and hbase_tables for more informations

One can easely map an external HIVE table:

hive_tables
- name: testHBase
  database: jdctest1
  fields:
  - { name: rowkey, type: string, comment: "The row key" }
  - { name: number, type: int }
  - { name: prefix, type: string }
  - { name: fname, type: string }
  - { name: lname, type: string }
  - { name: company, type: string }
  - { name: title, type: string }
  external: true
  storage_handler: "org.apache.hadoop.hive.hbase.HBaseStorageHandler"
  properties:
    hbase.table.name: "test2:test2a"
  serde_properties:
    hbase.columns.mapping: ":key,id:reg,id:prefix,id:fname,id:lname,job:cpny,job:title"
  state: present

Will be interpreted, for creation as:

CREATE EXTERNAL TABLE jdctest1.testHBase ( rowkey string COMMENT 'The row key', number int, prefix string, fname string, lname string, company string, title string ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( 'hbase.columns.mapping'=':key,id:reg,id:prefix,id:fname,id:lname,job:cpny,job:title') TBLPROPERTIES ( 'hbase.table.name'='test2:test2a')