Usually source data come as compressed text files into Hadoop and we often run SQL queries on top of them without any transformations.
Sometimes these queries are simple single-table search queries returning a few rows based on the specified predicates, and people often complain about their performance.
Compressed Text Files
Consider the following sample table:
CREATE TABLE clicks ( id STRING, name STRING, ... referral_id STRING ) STORED AS TEXTFILE LOCATION 's3://cloudsqale/hive/dmtolpeko.db/clicks/';
In my case s3://cloudsqale/hive/dmtolpeko.db/clicks
contains single file data.txt.gz
that has 27.3M rows and relatively small size of 5.3 GB.