Use Hive Serde for Fixed Length (index based) strings

Hive fixed length serde can be used in scenarios where we do not have any delimiters in out data file. Using RegexSerDe for fixed length strings is pretty straight:

CREATE EXTERNAL TABLE customers (userid STRING, fb_id STRING, twitter_id STRING, status STRING)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe’
WITH SERDEPROPERTIES (“input.regex” = “(.{10})(.{10})(.{10})(.2})” )
LOCATION ‘path/to/data’;

The above query only expects exactly 32 characters in a line of text (10+10+10+2). The query can be customized to Ignore any characters at end after the useful data is read:

CREATE EXTERNAL TABLE customers ((userid STRING, fb_id STRING, twitter_id STRING, status STRING)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe’
WITH SERDEPROPERTIES (“input.regex” = “(.{10})(.{10})(.{10})(.{2}).*” )
LOCATION ‘path/to/data’;

Thats all. Have fun. Cheers \m/

2 thoughts on “Use Hive Serde for Fixed Length (index based) strings

  1. Hi..thanks for such a nice explanation..!
    I have a scenario in which I’ll be receiving the fixed length record in a file but the records vary in length and they have to be extracted out from different positions.
    For eg. if I get a record like AAAABBBCCCCCCDD, I just need AAAA (from position 1 to 4) as my first column and then DD (from position 13 to 14) as my second column. I need only few columns, not all.
    Is there any way to work out the same?
    I know I can write the substring function to extract the same but I wanted to know whether this can be handled while creating external table only?
    Thanks in advance..!

Leave a Reply

Your email address will not be published. Required fields are marked *