spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@hortonworks.com>
Subject Re: S3A Creating Task Per Byte (pyspark / 1.6.1)
Date Fri, 13 May 2016 10:19:45 GMT

On 12 May 2016, at 18:35, Aaron Jackson <ajackson@pobox.com<mailto:ajackson@pobox.com>>
wrote:

I'm using the spark 1.6.1 (hadoop-2.6) and I'm trying to load a file that's in s3.  I've done
this previously with spark 1.5 with no issue.  Attempting to load and count a single file
as follows:

dataFrame = sqlContext.read.text('s3a://bucket/path-to-file.csv')
dataFrame.count()

But when it attempts to load, it creates 279K tasks.  When I look at the tasks, the # of tasks
is identical to the # of bytes in the file.  Has anyone seen anything like this or have any
ideas why it's getting that granular?

yeah, seen that. The blocksize being returned by the FS is coming back as 0, which is then
triggering a split on every byte. Which as you have noticed, doesn't work

you've hit https://issues.apache.org/jira/browse/HADOOP-11584 , fixed in Hadoop 2.7.0

You need to consider S3A not usable in production in the 2.6.0 release; things surfaced in
the field which only got caught later. HADOOP


 https://issues.apache.org/jira/browse/HADOOP-11571 covered the issues that surfaced. Stay
on S3N for a 2.6-x based release, run to Hadoop 2.7.1+ for S3A to be ready to play.



Mime
View raw message