hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xuefu Zhang (JIRA)" <>
Subject [jira] [Commented] (HIVE-15682) Eliminate per-row based dummy iterator creation
Date Wed, 08 Feb 2017 01:22:41 GMT


Xuefu Zhang commented on HIVE-15682:

Hi [~Ferd], when I ran the query, I had two day's data which is about 25m rows. I just ran
the query again, with about 10 day's data, the runtime is about 600s with 130m rows. I have
32 executors, each having 4 cores. The query spends most of the time on the second stage where
sorting via a single reducer occurs.

I don't think the scale matters much as long as the query runs for sometime (in minutes at
least).  Thus, you should be able to use TPC-DS data for this exercise.

> Eliminate per-row based dummy iterator creation
> -----------------------------------------------
>                 Key: HIVE-15682
>                 URL:
>             Project: Hive
>          Issue Type: Improvement
>          Components: Spark
>    Affects Versions: 2.2.0
>            Reporter: Xuefu Zhang
>            Assignee: Xuefu Zhang
>             Fix For: 2.2.0
>         Attachments: HIVE-15682.patch
> HIVE-15580 introduced a dummy iterator per input row which can be eliminated. This is
because {{SparkReduceRecordHandler}} is able to handle single key value pairs. We can refactor
this part of code 1. to remove the need for a iterator and 2. to optimize the code path for
per (key, value) based (instead of (key, value iterator)) processing. It would be also great
if we can measure the performance after the optimizations and compare to performance prior
to HIVE-15580.

This message was sent by Atlassian JIRA

View raw message