spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Abhinay Mehta <abhinay.me...@gmail.com>
Subject Re: Spark S3
Date Tue, 11 Oct 2016 09:22:16 GMT
Hi Selvam,

Is your 35GB parquet file split up into multiple S3 objects or just one big
Parquet file?

If its just one big file then I believe only one executor will be able to
work on it until some job action partitions the data into smaller chunks.



On 11 October 2016 at 06:03, Selvam Raman <selmna@gmail.com> wrote:

> I mentioned parquet as input format.
> On Oct 10, 2016 11:06 PM, "ayan guha" <guha.ayan@gmail.com> wrote:
>
>> It really depends on the input format used.
>> On 11 Oct 2016 08:46, "Selvam Raman" <selmna@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> How spark reads data from s3 and runs parallel task.
>>>
>>> Assume I have a s3 bucket size of 35 GB( parquet file).
>>>
>>> How the sparksession will read the data and process the data parallel.
>>> How it splits the s3 data and assign to each executor task.
>>>
>>> ​Please share me your points.
>>>
>>> Note:
>>> if we have RDD , then we can look at the partitions.size or length to
>>> check how many partition for a file. But how this will be accomplished in
>>> terms of S3 bucket.​
>>>
>>> --
>>> Selvam Raman
>>> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>>>
>>

Mime
View raw message