spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheng Lian <lian.cs....@gmail.com>
Subject Re: Parquet Hive table become very slow on 1.3?
Date Tue, 31 Mar 2015 07:19:31 GMT
Hi Xudong,

This is probably because of Parquet schema merging is turned on by 
default. This is generally useful for Parquet files with different but 
compatible schemas. But it needs to read metadata from all Parquet 
part-files. This can be problematic when reading Parquet files with lots 
of part-files, especially when the user doesn't need schema merging.

This issue is tracked by SPARK-6575, and here is a PR for it: 
https://github.com/apache/spark/pull/5231. This PR adds a configuration 
to disable schema merging by default when doing Hive metastore Parquet 
table conversion.

Another workaround is to fallback to the old Parquet code by setting 
spark.sql.parquet.useDataSourceApi to false.

Cheng

On 3/31/15 2:47 PM, Zheng, Xudong wrote:
> Hi all,
>
> We are using Parquet Hive table, and we are upgrading to Spark 1.3. 
> But we find that, just a simple COUNT(*) query will much slower (100x) 
> than Spark 1.2.
>
> I find the most time spent on driver to get HDFS blocks. I find large 
> amount of get below logs printed:
>
> 15/03/30 23:03:43 DEBUG ProtobufRpcEngine: Call: getBlockLocations took 2097ms
> 15/03/30 23:03:43 DEBUG DFSClient: newInfo = LocatedBlocks{
>    fileLength=77153436
>    underConstruction=false
>    blocks=[LocatedBlock{BP-1236294426-10.152.90.181-1425290838173:blk_1075187948_1448275;
getBlockSize()=77153436; corrupt=false; offset=0; locs=[10.152.116.172:50010  <http://10.152.116.172:50010>,10.152.116.169:50010
 <http://10.152.116.169:50010>, 10.153.125.184:50010]}]
>    lastLocatedBlock=LocatedBlock{BP-1236294426-10.152.90.181-1425290838173:blk_1075187948_1448275;
getBlockSize()=77153436; corrupt=false; offset=0; locs=[10.152.116.169:50010  <http://10.152.116.169:50010>,10.153.125.184:50010
 <http://10.153.125.184:50010>,10.152.116.172:50010  <http://10.152.116.172:50010>]}
>    isLastBlockComplete=true}
> 15/03/30 23:03:43 DEBUG DFSClient: Connecting to datanode10.152.116.172:50010  <http://10.152.116.172:50010>
>
> I compare the printed log with Spark 1.2, although the number of 
> getBlockLocations call is similar, but each such operation only cost 
> 20~30 ms (but it is 2000ms~3000ms now), and it didn't print the 
> detailed LocatedBlocks info.
>
> Another finding is, if I read the Parquet file via scala code form 
> spark-shell as below, it looks fine, the computation will return the 
> result quick as before.
>
> |sqlContext.parquetFile("data/myparquettable")|
>
> Any idea about it? Thank you!
>
>
> -- 
> 郑旭东
> Zheng, Xudong
>


Mime
View raw message