spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yeshwanth kumar <yeshwant...@gmail.com>
Subject Re: RDD Partitions on HDFS file in Hive on Spark Query
Date Tue, 22 Nov 2016 23:00:14 GMT
Hi Ayan,

we have  default rack topology.



-Yeshwanth
Can you Imagine what I would do if I could do all I can - Art of War

On Tue, Nov 22, 2016 at 6:37 AM, ayan guha <guha.ayan@gmail.com> wrote:

> Because snappy is not splittable, so single task makes sense.
>
> Are sure about rack topology? Ie 225 is in a different rack than 227 or
> 228? What does your topology file says?
> On 22 Nov 2016 10:14, "yeshwanth kumar" <yeshwanth43@gmail.com> wrote:
>
>> Thanks for your reply,
>>
>> i can definitely change the underlying compression format.
>> but i am trying to understand the Locality Level,
>> why executor ran on a different node, where the blocks are not present,
>> when Locality Level is RACK_LOCAL
>>
>> can you shed some light on this.
>>
>>
>> Thanks,
>> Yesh
>>
>>
>> -Yeshwanth
>> Can you Imagine what I would do if I could do all I can - Art of War
>>
>> On Mon, Nov 21, 2016 at 4:59 PM, Jörn Franke <jornfranke@gmail.com>
>> wrote:
>>
>>> Use as a format orc, parquet or avro because they support any
>>> compression type with parallel processing. Alternatively split your file in
>>> several smaller ones. Another alternative would be bzip2 (but slower in
>>> general) or Lzo (usually it is not included by default in many
>>> distributions).
>>>
>>> On 21 Nov 2016, at 23:17, yeshwanth kumar <yeshwanth43@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> we are running Hive on Spark, we have an external table over snappy
>>> compressed csv file of size 917.4 M
>>> HDFS block size is set to 256 MB
>>>
>>> as per my Understanding, if i run a query over that external table , it
>>> should launch 4 tasks. one for each block.
>>> but i am seeing one executor and one task processing all the file.
>>>
>>> trying to understand the reason behind,
>>>
>>> i went one step further to understand the block locality
>>> when i get the block locations for that file, i found
>>>
>>> [DatanodeInfoWithStorage[10.11.0.226:50010,DS-bf39d33d-48e1-
>>> 4a8f-be48-b0953fdaad37,DISK],
>>>  DatanodeInfoWithStorage[10.11.0.227:50010,DS-a760c1c8-ce0c-
>>> 4eb8-8183-8d8ff5f24115,DISK],
>>>  DatanodeInfoWithStorage[10.11.0.228:50010,DS-0e5427e2-b030-
>>> 43f8-91c9-d8517e68414a,DISK]]
>>>
>>> DatanodeInfoWithStorage[10.11.0.226:50010,DS-f50ddf2f-b827-4
>>> 845-b043-8b91ae4017c0,DISK],
>>> DatanodeInfoWithStorage[10.11.0.228:50010,DS-e8c9785f-c352-4
>>> 89b-8209-4307f3296211,DISK],
>>> DatanodeInfoWithStorage[10.11.0.225:50010,DS-6f6a3ffd-334b-4
>>> 5fd-ae0f-cc6eb268b0d2,DISK]]
>>>
>>> DatanodeInfoWithStorage[10.11.0.226:50010,DS-f8bea6a8-a433-4
>>> 601-8070-f6c5da840e09,DISK],
>>> DatanodeInfoWithStorage[10.11.0.227:50010,DS-8aa3f249-790e-4
>>> 94d-87ee-bcfff2182a96,DISK],
>>> DatanodeInfoWithStorage[10.11.0.228:50010,DS-d06714f4-2fbb-4
>>> 8d3-b858-a023b5c44e9c,DISK]
>>>
>>> DatanodeInfoWithStorage[10.11.0.226:50010,DS-b3a00781-c6bd-4
>>> 98c-a487-5ce6aaa66f48,DISK],
>>> DatanodeInfoWithStorage[10.11.0.228:50010,DS-fa5aa339-e266-4
>>> e20-a360-e7cdad5dacc3,DISK],
>>> DatanodeInfoWithStorage[10.11.0.225:50010,DS-9d597d3f-cd4f-4
>>> c8f-8a13-7be37ce769c9,DISK]]
>>>
>>> and in the spark UI i see the Locality Level is  RACK_LOCAL. for that
>>> task
>>>
>>> if it is RACK_LOCAL then it should run either in node 10.11.0.226 or
>>> 10.11.0.228, because these 2 nodes has all the four blocks needed for
>>> computation
>>> but the executor is running in 10.11.0.225
>>>
>>> my theory is not applying anywhere.
>>>
>>> please help me in understanding how spark/yarn calculates number of
>>> executors/tasks.
>>>
>>> Thanks,
>>> -Yeshwanth
>>>
>>>
>>

Mime
View raw message