Usually in a Data Lake we get source data as compressed JSON payloads (.gz
files). Additionally, the first level of JSON objects is often parsed into map<string, string>
structure to speed up the access to the first level keys/values, and then get_json_object
function can be used to parse further JSON levels whenever required.
But it still makes sense to convert data into the ORC format to evenly distribute data processing to smaller chunks, or to use indexes and optimize query execution for complementary columns such as event names, geo information, and some other system attributes.
In this example we will load the source data stored in single 2.5 GB .gz
file into the following ORC table: