spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Jackson <ajack...@pobox.com>
Subject S3A Creating Task Per Byte (pyspark / 1.6.1)
Date Thu, 12 May 2016 17:35:31 GMT
I'm using the spark 1.6.1 (hadoop-2.6) and I'm trying to load a file that's
in s3.  I've done this previously with spark 1.5 with no issue.  Attempting
to load and count a single file as follows:

dataFrame = sqlContext.read.text('s3a://bucket/path-to-file.csv')
dataFrame.count()

But when it attempts to load, it creates 279K tasks.  When I look at the
tasks, the # of tasks is identical to the # of bytes in the file.  Has
anyone seen anything like this or have any ideas why it's getting that
granular?

Mime
View raw message