spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zheng, Xudong" <dong...@gmail.com>
Subject Re: Parquet Hive table become very slow on 1.3?
Date Tue, 31 Mar 2015 15:49:24 GMT
Thanks Cheng!

Set 'spark.sql.parquet.useDataSourceApi' to false resolves my issues, but
the PR 5231 seems not. Not sure any other things I did wrong ...

BTW, actually, we are very interested in the schema merging feature in
Spark 1.3, so both these two solution will disable this feature, right? It
seems that Parquet metadata is store in a file named _metadata in the
Parquet file folder (each folder is a partition as we use partition table),
why we need scan all Parquet part files? Is there any other solutions could
keep schema merging feature at the same time? We are really like this
feature :)

On Tue, Mar 31, 2015 at 3:19 PM, Cheng Lian <lian.cs.zju@gmail.com> wrote:

>  Hi Xudong,
>
> This is probably because of Parquet schema merging is turned on by
> default. This is generally useful for Parquet files with different but
> compatible schemas. But it needs to read metadata from all Parquet
> part-files. This can be problematic when reading Parquet files with lots of
> part-files, especially when the user doesn't need schema merging.
>
> This issue is tracked by SPARK-6575, and here is a PR for it:
> https://github.com/apache/spark/pull/5231. This PR adds a configuration
> to disable schema merging by default when doing Hive metastore Parquet
> table conversion.
>
> Another workaround is to fallback to the old Parquet code by setting
> spark.sql.parquet.useDataSourceApi to false.
>
> Cheng
>
>
> On 3/31/15 2:47 PM, Zheng, Xudong wrote:
>
> Hi all,
>
>  We are using Parquet Hive table, and we are upgrading to Spark 1.3. But
> we find that, just a simple COUNT(*) query will much slower (100x) than
> Spark 1.2.
>
>  I find the most time spent on driver to get HDFS blocks. I find large
> amount of get below logs printed:
>
>  15/03/30 23:03:43 DEBUG ProtobufRpcEngine: Call: getBlockLocations took 2097ms
> 15/03/30 23:03:43 DEBUG DFSClient: newInfo = LocatedBlocks{
>   fileLength=77153436
>   underConstruction=false
>   blocks=[LocatedBlock{BP-1236294426-10.152.90.181-1425290838173:blk_1075187948_1448275;
getBlockSize()=77153436; corrupt=false; offset=0; locs=[10.152.116.172:50010, 10.152.116.169:50010,
10.153.125.184:50010]}]
>   lastLocatedBlock=LocatedBlock{BP-1236294426-10.152.90.181-1425290838173:blk_1075187948_1448275;
getBlockSize()=77153436; corrupt=false; offset=0; locs=[10.152.116.169:50010, 10.153.125.184:50010,
10.152.116.172:50010]}
>   isLastBlockComplete=true}
> 15/03/30 23:03:43 DEBUG DFSClient: Connecting to datanode 10.152.116.172:50010
>
>
>  I compare the printed log with Spark 1.2, although the number of
> getBlockLocations call is similar, but each such operation only cost 20~30
> ms (but it is 2000ms~3000ms now), and it didn't print the detailed
> LocatedBlocks info.
>
>  Another finding is, if I read the Parquet file via scala code form
> spark-shell as below, it looks fine, the computation will return the result
> quick as before.
>
>  sqlContext.parquetFile("data/myparquettable")
>
>
>  Any idea about it? Thank you!
>
>
>  --
>   郑旭东
> Zheng, Xudong
>
>
>


-- 
郑旭东
Zheng, Xudong

Mime
View raw message