# Learning Hive (Pt. 4): Data Types

This post covers different data types and file formats supported by Hive.

### Data Types

The following primitive data types are supported:

• TINYINT: 1 byte signed integer
• SMALLINT: 2 bytes
• INT: 4 bytes
• BIGINT: 8 bytes
• BOOLEAN
• FLOAT
• DOUBLE
• STRING: Single or double quotes can be used for literals.
• TIMESTAMP: Integer, float, or string.
• Integer: For seconds from Unix epoch.
• Float: Seconds from Unix epoch and nanoseconds.
• String: JDBC-compliant java.sql.Timestamp format convention, i.e. YYYY-MM-DD hh:mm:ss.fffffffff
• BINARY: array of bytes. Used to include arbitrary bytes and prevent Hive from attempting to parse them.

As you can see, Hive supports most basic primitive data types conventionally found in relational databases. Moreover, it helps to remember that these data types are implemented in Java, so their behaviors will be similar to their Java counterparts.

NOTE: Not mentioned in the Programming Hive book, but the types DECIMAL and DATE are introduced since Hive 0.13.0. In addition, the book claimed “Hive does not support character arrays with maximum-allowed lengths, as is common in other SQL dialects” but VARCHAR type, introduced in Hive 0.12.0, does exactly that.

Besides primitive data types, Hive supports the following collection data types:

• STRUCT: Analogous to a C struct or POJO (Plain Old Java Object). The elements can be accessed using the DOT (.) notation.
• Example: Declaration -> struct<name:string,id:int>. Literal -> struct('John',1).
• MAP: A collection of key-value tuples. The elements can be accessed using array notation, e.g. persons[‘John’].
• Example: Declaration -> map<string,int>. Literal -> map('John',1).
• ARRAY: Ordered sequences of the same type. The elements can be accessed using array notation, e.g. person[2].
• Example: Declaration -> array<string>. Literal -> array('John','Peter').

Relational databases don’t usually support such collection types because they tend to break normal form. In Hive/Hadoop, sacrificing normal form is pretty common as it can give benefit of higher processing throughput, especially with large amount of data (tens of terabytes).

### Text File Formats

Hive can use comma-separated values (CSV) or tab-separated values (TSV) text file format. A Hive table declaration with all row format specified (with default values, however) looks like this: