spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nguyen duc Tuan <newvalu...@gmail.com>
Subject Re: parquet late column materialization
Date Sun, 18 Mar 2018 23:56:26 GMT
You can use EXPLAIN statement to see optimized plan for each query. (
https://stackoverflow.com/questions/35883620/spark-how-can-get-the-logical-physical-query-execution-using-thirft-hive
).

2018-03-19 0:52 GMT+07:00 CPC <achalil@gmail.com>:

> Hi nguyen,
>
> Thank you for quick response. But what i am trying to understand is in
> both query predicate evolution require only one column. So actually spark
> does not need to read all column in projection if they are not used in
> filter predicate. Just to give an example, amazon redshift has this kind of
> optimization (https://aws.amazon.com/about-aws/whats-new/2017/12/amazon-
> redshift-introduces-late-materialization-for-faster-query-processing/)
>
> Thanks..
>
>
> On Mar 18, 2018 8:09 PM, "nguyen duc Tuan" <newvalue92@gmail.com> wrote:
>
>> Hi @CPC,
>> Parquet is column storage format, so if you want to read data from only
>> one column, you can do that without accessing all of your data. Spark SQL
>> consists of a query optimizer ( see https://databricks.com/blo
>> g/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html), so it
>> will optimize your query and create optimized plan to execute your query.
>> Since your second query only need data from  2 columns (businesskey and
>> transactionname) so it will read less data as you see.
>> Hope it help you.
>>
>> 2018-03-19 0:02 GMT+07:00 CPC <achalil@gmail.com>:
>>
>>> Hi everybody,
>>>
>>> I try to understand how spark reading parquet files but i am confused a
>>> little bit. I have a table with 4 columns and named
>>> businesskey,transactionname,request and response Request and response
>>> columns are huge columns(10-50kb). when i execute a query like
>>> "select * from mytable where businesskey='key1'"
>>> it reads whole table(2.4 tb) even though it returns 1 row. If i execute
>>> "select transactionname from mytable where businesskey='key1'"
>>> it reads 390gb. I expect two query to read same amount of data since it
>>> filter on businesskey. In some databases this called late
>>> materialization(dont read whole row if predicate eliminate it)Why first
>>> query reading whole data? Do you have any idea? Spark version is 2.2 on
>>> cloudera 5.12.
>>>
>>> Thanks in advance...
>>>
>>
>>

Mime
View raw message