drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sreeparna bhabani <bhabani.sreepa...@gmail.com>
Subject Re: Suggestion needed for UNION ALL performance in Apache drill
Date Tue, 28 Apr 2020 15:16:56 GMT
Hi Paul and Team,

As you suggested I have created a Jira ticket which is  -
https://issues.apache.org/jira/browse/DRILL-7720.
I have mentioned details in the Jira you asked. Please have a look. As the
data is sensitive, I am trying to create dummy dataset. Will provide once
it is ready.

Thanks,
Sreeparna Bhabani

On Fri, Apr 24, 2020 at 11:28 AM sreeparna bhabani <
bhabani.sreeparna@gmail.com> wrote:

>
> ---------- Forwarded message ---------
> From: Paul Rogers <par0328@yahoo.com>
> Date: Thu, 23 Apr 2020, 23:59
> Subject: Re: Suggestion needed for UNION ALL performance in Apache drill
> To: <user@drill.apache.org>, sreeparna bhabani <
> bhabani.sreeparna@gmail.com>
> Cc: <arun.ns@gmail.com>, <navin.bhawsar@gmail.com>
>
>
> Hi Sreeparna,
>
>
> As suggested in the earlier e-mail, we would not expect to see different
> performance in UNION ALL than in a simple scan. Clearly you've found some
> kind of issue. The next step is to investigate that issue, which is a bit
> hard to do over e-mail.
>
>
> Please file a JIRA ticket to describe the issue and provide a reproducible
> test case including query and data. If your data is sensitive, please
> create a dummy data set, or use the provided TPC-H data set to recreate the
> issue. We can then take a look to see what might be happening.
>
>
> Thanks,
>
> - Paul
>
>
>
> On Thursday, April 23, 2020, 10:18:13 AM PDT, sreeparna bhabani <
> bhabani.sreeparna@gmail.com> wrote:
>
>
> Hi Team,
>
> In addition to the below mail I have another finding. Please consider
> below scenarios. The first 2 scenarios are giving expected results in terms
> of performance. But we are not getting expected performance for 3rd
> scenario which is UNION ALL with 2 different types of datasets.
>
> *Scenario 1- Parquet UNION ALL Parquet*
> Individual execution time of 1st query - 5 secs
> Individual execution time of 2nd query - 5 secs
> UNION ALL of both queries execution time - 10 secs
>
> *Scenario 2 - DB query UNION ALL DB* *query*
> Individual execution time of 1st query - 5 secs
> Individual execution time of 2nd query - 5 secs
> UNION ALL of both queries execution time - 10 secs
>
> *Scenario 3 - Parquet UNION ALL DB query*
> Individual execution time of 1st query - 5 secs
> Individual execution time of 2nd query - 1 sec
> UNION ALL execution time - 20 secs
> Ideally the execution time should not be more than 6 secs.
>
> May I request you to check whether the UNION ALL performance of 3rd
> scenario is expected with different dataset types.
>
> Please suggest if there is any specific way to bring down the execution
> time of 3rd scenario.
>
> Thanks in advance.
>
> Sreeparna Bhabani
>
>
>
> On Thu, 23 Apr 2020, 12:18 sreeparna bhabani, <bhabani.sreeparna@gmail.com>
> wrote:
>
> Hi Team,
>
> Apart from the below issue I have another question.
>
> Is there any relation between number of row groups and performance ?
>
> In the below query the number of files is 13 and numRowGroups is 69. Is
> the UNION ALL takes more time if the number of rowgroup is high like that.
>
> Please note that the individual Parquet query takes 6 secs. But UNION ALL
> takes 20 secs. Details are given in trail mail.
>
> Thanks,
> Sreeparna Bhabani
>
> On Thu, 23 Apr 2020, 11:08 sreeparna bhabani, <dishari.5681@gmail.com>
> wrote:
>
> Hi Paul,
>
> Please find the details below. We are using 2 drillbits. Heap memory 16 G,
> Max direct memory 32 G. One query selects from Parquet. Another one selects
> fron JDBC. The parquet file size is 849 MB. It is UNION ALL. There is not
> sorting.
>
> Single parquet query-
> Total execution time - 6.6 sec
> Scan time - 0.152 sec
> Screen wait time - 5.3 sec
>
> Single JDBC query-
> Total execution time - 0.261 sec
> JDBC scan - 0.152 sec
> Screen wait - 0.004 sec
>
>
> Union all query -
> Execution time - 21. 118 sec
> Screen wait time - 5.351 sec
> Parquet scan - 15.368 sec
> Unordered receiver wait time - 14.41 sec
>
> Thanks,
> Sreeparna Bhabani
>
>
> On Thu, 23 Apr 2020, 10:43 Paul Rogers, <par0328@yahoo.com> wrote:
>
> Hi Sreeparna,
>
>
> The short answer is it *should* work: a UNION ALL is simply an append. (Be
> sure you are not using a plain UNION as that needs to do more work to
> remove duplicates.)
>
>
> Since you are seeing unexpected behavior, we may have some kind of issue
> to investigate and perhaps fix. Always hard to do over e-mail, but let's
> see what we can do.
>
>
> The first question is to understand the full query: are you doing more
> than a simple scan of two files and a UNION ALL? Are there sorts or joins
> involved?
>
>
> The best place to start to investigate performance issues is the query
> profile, which it looks like you are doing. What is the time for the scans
> if you run each of the two scans separately? You said that they take 8 and
> 1 seconds. Is that for the whole query or just the scan operators?
>
>
> Then, when you run the UNION ALL, again looking at the scan operators, is
> there any difference in run times? If the scans take longer, that is one
> thing to investigate. If the scans take the same amount of time, what other
> operator(s) are taking the rest of the time? Your note suggests that it is
> the scan taking the time. But, there should be two scan operators: one for
> each file. How is the time divided between them?
>
>
> How large are the data files? Using what storage system? How many
> Drillbits? How much memory?
>
>
> Thanks,
>
> - Paul
>
>
>
> On Wednesday, April 22, 2020, 11:32:24 AM PDT, sreeparna bhabani <
> bhabani.sreeparna@gmail.com> wrote:
>
>
> Hi Team,
>
> I reach out to you for a specific problem regarding UNION ALL. There is one
> UNION ALL statement which combines 2 queries. The individual queries are
> taking 8 secs and 1 sec respectively. But UNION ALL takes 30 secs.
> PARQUET_SCAN_ROW_GROUP takes the maximum time. Apache drill version is
> 1.17.
>
> Please help to suggest how to improve this UNION ALL performance. We are
> using parquet file.
>
> Thanks,
> Sreeparna Bhabani
>
>

-- 

Thanks n Regards,
*Sreeparna Bhabani*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message