spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Selvam Raman <sel...@gmail.com>
Subject Spark S3
Date Mon, 10 Oct 2016 21:46:38 GMT
Hi,

How spark reads data from s3 and runs parallel task.

Assume I have a s3 bucket size of 35 GB( parquet file).

How the sparksession will read the data and process the data parallel. How
it splits the s3 data and assign to each executor task.

​Please share me your points.

Note:
if we have RDD , then we can look at the partitions.size or length to check
how many partition for a file. But how this will be accomplished in terms
of S3 bucket.​

-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"

Mime
View raw message