spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: RDD Partitions on HDFS file in Hive on Spark Query
Date Mon, 21 Nov 2016 22:59:51 GMT
Use as a format orc, parquet or avro because they support any compression type with parallel
processing. Alternatively split your file in several smaller ones. Another alternative would
be bzip2 (but slower in general) or Lzo (usually it is not included by default in many distributions).

> On 21 Nov 2016, at 23:17, yeshwanth kumar <yeshwanth43@gmail.com> wrote:
> 
> Hi,
> 
> we are running Hive on Spark, we have an external table over snappy compressed csv file
of size 917.4 M
> HDFS block size is set to 256 MB
> 
> as per my Understanding, if i run a query over that external table , it should launch
4 tasks. one for each block.
> but i am seeing one executor and one task processing all the file.
> 
> trying to understand the reason behind,
> 
> i went one step further to understand the block locality 
> when i get the block locations for that file, i found
> 
> [DatanodeInfoWithStorage[10.11.0.226:50010,DS-bf39d33d-48e1-4a8f-be48-b0953fdaad37,DISK],

>  DatanodeInfoWithStorage[10.11.0.227:50010,DS-a760c1c8-ce0c-4eb8-8183-8d8ff5f24115,DISK],

>  DatanodeInfoWithStorage[10.11.0.228:50010,DS-0e5427e2-b030-43f8-91c9-d8517e68414a,DISK]]
>  
> DatanodeInfoWithStorage[10.11.0.226:50010,DS-f50ddf2f-b827-4845-b043-8b91ae4017c0,DISK],
> DatanodeInfoWithStorage[10.11.0.228:50010,DS-e8c9785f-c352-489b-8209-4307f3296211,DISK],

> DatanodeInfoWithStorage[10.11.0.225:50010,DS-6f6a3ffd-334b-45fd-ae0f-cc6eb268b0d2,DISK]]
> 
> DatanodeInfoWithStorage[10.11.0.226:50010,DS-f8bea6a8-a433-4601-8070-f6c5da840e09,DISK],
> DatanodeInfoWithStorage[10.11.0.227:50010,DS-8aa3f249-790e-494d-87ee-bcfff2182a96,DISK],
> DatanodeInfoWithStorage[10.11.0.228:50010,DS-d06714f4-2fbb-48d3-b858-a023b5c44e9c,DISK]
> 
> DatanodeInfoWithStorage[10.11.0.226:50010,DS-b3a00781-c6bd-498c-a487-5ce6aaa66f48,DISK],

> DatanodeInfoWithStorage[10.11.0.228:50010,DS-fa5aa339-e266-4e20-a360-e7cdad5dacc3,DISK],

> DatanodeInfoWithStorage[10.11.0.225:50010,DS-9d597d3f-cd4f-4c8f-8a13-7be37ce769c9,DISK]]
> 
> and in the spark UI i see the Locality Level is  RACK_LOCAL. for that task
> 
> if it is RACK_LOCAL then it should run either in node 10.11.0.226 or 10.11.0.228, because
these 2 nodes has all the four blocks needed for computation
> but the executor is running in 10.11.0.225
> 
> my theory is not applying anywhere.
> 
> please help me in understanding how spark/yarn calculates number of executors/tasks.
> 
> Thanks,
> -Yeshwanth

Mime
View raw message