spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From java8964 <>
Subject From Spark web ui, how to prove the parquet column pruning working
Date Mon, 09 Mar 2015 19:15:15 GMT
Hi, Currently most of the data in our production is using Avro + Snappy. I want to test the
benefits if we store the data in Parquet format. I changed the our ETL to generate the Parquet
format, instead of Avor, and want to test a simple sql in Spark SQL, to verify the benefits
from Parquet.
I generated the same dataset in both Avro and Parquet in HDFS, and load them both in Spark-SQL.
Now I run the same query like "select colum1 from src_table_avro/parqut where colum2=xxx",
I can see that for the parquet data format, the job runs much fast. The test files size for
both format are around 930M. So Avro job generated 8 tasks to read the data with 21s as the
median duration, vs parquet job generate 7 tasks to read the data with 0.4s as the median
Since the dataset has more than 100 columns, I can see the parquet file really coming with
fast read. But my question is that from the spark UI, both job show 900M as the input size,
and 0 for rest, in this case, how do I know column pruning really works? I think it is due
to that, so parquet file can be read so fast, but is there any statistic can prove that to
me on the Spark UI? Something like the input total file size is 900M, but only 10M really
read due to column pruning? So in case that the columns pruning not work in parquet due to
what kind of SQL query, I can identify in the first place.
View raw message