How to view content of parquet files on S3/HDFS from Hadoop cluster using parquet-tools

Sometimes we quickly need to check the schema of a parquet file, or to head the parquet file for some sample records.

Here are some straight ways by which you could check the contents of a parquet file from local or S3/ HDFS.

Get the parquet-tools jar

  • Download jar, or
  • Build jar

Download jar

Download the jar from maven repo, or any location of your choice. Just google it. The time of this post I can get the parquet-tools from here.

If you’re logged in the hadoop box:

wget http://central.maven.org/maven2/org/apache/parquet/parquet-tools/1.9.0/parquet-tools-1.9.0.jar

This link might stop working few days later. So get the new link from maven repo.

Build jar

If you are unable to download the jar, you could also build the jar from source. Clone the parquet-mr repo and build the jar from the source

git clone https://github.com/apache/parquet-mr

mvn clean package

Note: you need maven on your box to build the source.

 

Read parquet file

You can use these commands to view the contents of the parquet file-

Check schema for s3/hdfs file:

hadoop jar parquet-tools-1.9.0.jar schema s3://path/to/file.snappy.parquet

hadoop jar parquet-tools-1.9.0.jar schema hdfs://path/to/file.snappy.parquet

Head file contents:

hadoop jar parquet-tools-1.9.0.jar head -n5 s3://path/to/file.snappy.parquet

Check contents of local file:

java -jar parquet-tools-1.9.0.jar head -n5 /tmp/path/to/file.snappy.parquet

java -jar parquet-tools-1.9.0.jar schema /tmp/path/to/file.snappy.parquet

More commands:

hadoop jar parquet-tools-1.9.0.jar –help

Hope it is helpful.

Leave a Reply

Your email address will not be published. Required fields are marked *