• Amazon,  AWS,  I/O,  Monitoring,  S3,  Storage

    S3 Monitoring #4 – Read Operations and Tables

    Knowing how Hive table storage is organized can help us extract some additional information for S3 read operations for each table.

    In most cases (and you can easily adapt this for your specific table storage pattern), tables are stored in a S3 bucket under the following key structure:


    For example, hourly data for orders table can be stored as follows:

  • Amazon,  AWS,  I/O,  Monitoring,  S3,  Storage

    S3 Monitoring Step #3 – Read Operations and File Types

    After you get the summary information for S3 read operations (see Step #2), it makes sense to look at file types. Analyzing the object keys you can easily summarize information about compressed files such as .gz files.

    Later I will use the Hive metadata information to define whether files named like 00000_0 are uncompressed text or ORC files.

    select type, count(*) keys, count(distinct key) dist_keys, 
      sum(bytes_sent)/sum(total_time_ms/1000)/(1024*1024) rate_mb_sec, 
      sum(total_time_ms/1000) time_spent,
      sum(bytes_sent)/(cast(1024 as bigint)*1024*1024*1024) terabytes_read
    from (
        when key like '%.gz' then 'Compressed .gz'
        else 'Other'
      end type,
    from s3_access_logs 
    where event_dt ='{$EVENT_DT}' and operation='REST.GET.OBJECT') t
    group by type;

    Here is my sample output:

    type keys dist_keys rate_mb_sec time_spent terabytes_read
    Compressed .gz 21,535,003 7,411,981 3.8 504,318,631 1,812.8
    Other 6,345,354 647,040 18.5 1,465,848 25.9

    File Types and Object Size Bins

    Now let’s see the distribution of file types for each size bin:

    select type, size_type, count(*) keys, count(distinct key) dist_keys, 
      sum(bytes_sent)/sum(total_time_ms/1000)/(1024*1024) rate_mb_sec, 
      sum(bytes_sent)/(cast(1024 as bigint)*1024*1024*1024) terabytes_read
    from (
        when key like '%.gz' then 'Compressed .gz'
        else 'Other'
      end type,
        when total_size <= 1024*1024 then '<= 1 MB'
        when total_size <= 30*1024*1024 then '<= 30 MB'
        when total_size <= 100*1024*1024 then '<= 100 MB'
        else '> 100 MB'
      end size_type,
    from s3_access_logs 
    where event_dt ='{$EVENT_DT}' and operation='REST.GET.OBJECT') t
    group by type, size_type;

    Sample output:

    type size_type keys dist_keys rate_mb_sec terabytes_read
    Compressed .gz <= 1 MB 7,759,230 3,579,785 5.2 2.4
    Compressed .gz <= 30 MB 6,927,405 2,456,010 4.6 47.3
    Compressed .gz <= 100 MB 1,136,926 436,463 3.7 71.1
    Compressed .gz > 100 MB 5,711,442 939,723 3.7 1,691.9
    Other <= 1 MB 2,535,108 496,286 3.2 0.2
    Other <= 30 MB 1,152,742 90,472 22.7 1.7
    Other <= 100 MB 150,521 7,119 14.7 1.0
    Other > 100 MB 2,506,983 53,191 19.4 23.0

    See also, S3 Monitoring Step #2 – Read Operations.

  • Amazon,  AWS,  I/O,  Monitoring,  S3,  Storage

    S3 Monitoring Step #1 – Bucket Size and Number of Objects

    The first step in Amazon S3 monitoring is to check the current state of your S3 buckets and how fast they grow. You can easily get this information from the CloudWatch Management console, running a AWS CLI command or AWS SDK script.

    Bucket Size

    Here is an example of AWS CLI command to get the size of a bucket for every day within --start-time and --end-time date range:

    aws cloudwatch get-metric-statistics \
      --metric-name BucketSizeBytes --namespace AWS/S3 \
      --start-time 2018-10-01T00:00:00Z --end-time 2018-10-08T00:00:00Z \
      --statistics Maximum --unit Bytes --region us-east-1 \
      --dimensions Name=BucketName,Value=cloudsqale Name=StorageType,Value=StandardStorage \
      --period 86400 --query 'Datapoints[*].[Timestamp, Maximum]' \
      --output text | sort  | python cloudwatch_s3_metrics.py
  • Hive,  I/O,  ORC,  S3,  Storage

    Simple Hive Queries with Predicates – Compressed Text vs ORC Files

    Usually source data come as compressed text files into Hadoop and we often run SQL queries on top of them without any transformations.

    Sometimes these queries are simple single-table search queries returning a few rows based on the specified predicates, and people often complain about their performance.

    Compressed Text Files

    Consider the following sample table:

    CREATE TABLE clicks
       id STRING, 
       name STRING,
       referral_id STRING
    LOCATION 's3://cloudsqale/hive/dmtolpeko.db/clicks/';

    In my case s3://cloudsqale/hive/dmtolpeko.db/clicks contains single file data.txt.gz that has 27.3M rows and relatively small size of 5.3 GB.

  • Hive,  Hue,  Qubole

    Operation Timed Out for Hive Query in Hue and Qubole

    Sometimes when you execute a very simple query in Hive you can get “The operation timed out. Do you want to retry?” error in Hue or “Error: java.io.IOException: Time limit exceeded for local fetch task with SimpleFetchOptimization” error in Qubole.

    TL;DR: Set this before your query to resolve this:

    set hive.fetch.task.conversion = none;

    Consider the following query that times out in my case:

    SELECT * 
    FROM clicks
    WHERE event_dt >= '2018-09-17' AND event_name = 'WEB_EVENT' AND 
      application = 'CLOUD_APP' AND env = 'PRD' AND id = 5
    LIMIT 100;

    Instead of executing a MapReduce or Tez job Hive just decides that it can read the data directly from the storage, it takes too long time so a time out happens.

  • Amazon,  AWS,  I/O,  Logs,  S3,  Storage

    Collecting S3 Access Logs

    Amazon allows you to enable S3 access logging that you can use to monitor S3 performance: request rate, I/O workload, user and compute node level statistics, service delays and outages, and much more.

    S3 log files are quite small, uncompressed text files that in case of intensive S3 usage can be generated almost every second:

    2018-09-20 16:20:27     323567 2018-09-20-16-20-26-DE17FAE504462084
    2018-09-20 16:20:28     598192 2018-09-20-16-20-27-5F17C98DFA22DA31
    2018-09-20 16:20:29     618862 2018-09-20-16-20-28-4660E2CBCB0FB2C5
    2018-09-20 16:20:32     381675 2018-09-20-16-20-31-16549B7BABDA06AE
    2018-09-20 16:20:33     405131 2018-09-20-16-20-32-14AB46312C254397
    2018-09-20 16:20:34     587042 2018-09-20-16-20-33-385E799AFCEBAEE3
    2018-09-20 16:20:35     358275 2018-09-20-16-20-34-FA52E601A410E529
    2018-09-20 16:20:36     604080 2018-09-20-16-20-35-C02066EDF9026EF9

    So you can have 35K+ files generated per day (and there is no a sub-directory for each day), and if you are going to analyze S3 statistics for a long period of time (weeks, months), the performance of your Hive or Presto queries can be very low.

    Additionally there is often a lifecycle rule defined to keep logs only for 1-2 days.