spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From CPC <>
Subject parquet late column materialization
Date Sun, 18 Mar 2018 17:02:27 GMT
Hi everybody,

I try to understand how spark reading parquet files but i am confused a
little bit. I have a table with 4 columns and named
businesskey,transactionname,request and response Request and response
columns are huge columns(10-50kb). when i execute a query like
"select * from mytable where businesskey='key1'"
it reads whole table(2.4 tb) even though it returns 1 row. If i execute
"select transactionname from mytable where businesskey='key1'"
it reads 390gb. I expect two query to read same amount of data since it
filter on businesskey. In some databases this called late
materialization(dont read whole row if predicate eliminate it)Why first
query reading whole data? Do you have any idea? Spark version is 2.2 on
cloudera 5.12.

Thanks in advance...

View raw message