• Hive,  Pig,  Tez

    Apache Hive/Pig on Tez – Long Running Tasks and Their Failed Attempts – Analyzing Performance and Finding Bottlenecks (Insufficient Parallelism) using Application Master Logs

    Apache Tez is the main distributed execution engine for Apache Hive and Apache Pig jobs.

    Tez represents a data flow as DAGs (Directed acyclic graph) that consists of a set of vertices connected by edges. Vertices represent data transformations while edges represent movement of data between vertices.

    For example, the following Hive SQL query:

    SELECT u.name, sum_items 
    FROM
     (
       SELECT user_id, SUM(items) sum_items
       FROM sales
       GROUP BY user_id
     ) s
     JOIN
       users u
     ON s.user_id = u.id
    

    and its corresponding Apache Pig script:

    sales = LOAD 'sales' USING org.apache.hive.hcatalog.pig.HCatLoader();
    users = LOAD 'users' USING org.apache.hive.hcatalog.pig.HCatLoader();
    
    sales_agg = FOREACH (GROUP sales BY user_id)
      GENERATE 
        group.user_id as user_id,
        SUM(sales.items) as sum_items;
        
    data = JOIN sales_agg BY user_id, users BY id;
    

    Can be represented as the following DAG in Tez:

    In my case the job ran almost 5 hours:

    Why did it take so long to run the job? Is there any way to improve its performance?

  • Data Skew,  Distributed,  Pig,  Tez

    Reduce Number of Output Files for Skewed Data – ORDER in Apache Pig – Sampler and Weighted Range Partitioner to Balance Reducers

    One of our event tables is very large, it contains billions of rows per day. Since the analytics team is often interested in specific events only it makes sense to process the raw events and generate a partition for every event type. Then when a data analyst runs an ad-hoc query, it reads data for the required event type only increasing the query performance.

    The problem is that there are 3,000 map tasks are launched to read the daily data and there are 250 distinct event types, so the mappers will produce 3,000 * 250 = 750,000 files per day. That’s too much.