spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Takeshi Yamamuro <linguin....@gmail.com>
Subject Re: Spark input size when filtering on parquet files
Date Fri, 27 May 2016 03:25:53 GMT
Hi,

Spark just prints #bytes in the web UI that is accumulated from
InputSplit#getLength (it is just a length of files).
Therefore, I'm afraid this metric does not reflect actual read #bytes for
parquet.
If you get the metric, you need to use other tools such as iostat or
something.

// maropu


// maropu


On Fri, May 27, 2016 at 5:45 AM, Dennis Hunziker <dennis.hunziker@gmail.com>
wrote:

> Hi all
>
>
>
> I was looking into Spark 1.6.1 (Parquet 1.7.0, Hive 1.2.1) in order to
> find out about the improvements made in filtering/scanning parquet files
> when querying for tables using SparkSQL and how these changes relate to the
> new filter API introduced in Parquet 1.7.0.
>
>
>
> After checking the usual sources, I still can’t make sense of some of the
> numbers shown on the Spark UI. As an example, I’m looking at the collect
> stage for a query that’s selecting a single row from a table containing 1
> million numbers using a simple where clause (i.e. col1 = 500000) and this
> is what I see on the UI:
>
>
>
> 0 SUCCESS ... 2.4 MB (hadoop) / 0
>
> 1 SUCCESS ... 2.4 MB (hadoop) / 250000
>
> 2 SUCCESS ... 2.4 MB (hadoop) / 0
>
> 3 SUCCESS ... 2.4 MB (hadoop) / 0
>
>
>
> Based on the min/max statistics of each of the parquet parts, it makes
> sense not to expect any records for 3 out of the 4, because the record I’m
> looking for can only be in a single file. But why is the input size above
> shown as 2.4 MB, totaling up to an overall input size of 9.7 MB for the
> whole stage? Isn't it just meant to read the metadata and ignore the
> content of the file?
>
>
>
> Regards,
>
> Dennis
>



-- 
---
Takeshi Yamamuro

Mime
View raw message