spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: Readin from Amazon S3 behaves inconsistently: return different number of lines...
Date Fri, 01 Aug 2014 11:00:48 GMT
See https://issues.apache.org/jira/browse/SPARK-2579

It also was mentioned on the mailing list a while ago, and have heard
tell of this from customers. I am trying to get to the bottom of it
too.

What version are you using, to start? I am wondering if it was fixed
in 1.0.x since I was not able to reproduce it in my example.

On Fri, Aug 1, 2014 at 12:37 AM, nit <nitinpanj@gmail.com> wrote:
> *First Question:*
>
> On Amazon S3 I have a directory with 1024 files, where each file size is
> ~9Mb; and each line in a file has two entries separated by '\t'.
>
> Here is my program, which is calculating total number of entries in the
> dataset
>
> --
>     val inputId = sc.textFile(inputhPath, noParts).flatMap {line=>
>       val lineArray = line.split("\\t")
>       Iterator(lineArray(0).toLong, lineArray(1).toLong)
>     }.distinct(noParts)
>  println("######input-cnt = %s;  ".
>       format(inputId.count))
> --
> Where inputpath =
> "s3n://my-AWS_ACCESS_KEY_ID:myAWS_ACCESS_KEY_SECRET@bucket-id/directory"
>
> When I run this program multiple times on EC2, "input-cnt"  across
> iterations is not consistent. FYI, I uploaded the data to S3 two days back;
> so I assume by now data is properly replicated/(eventually-concistency).
> *
> Is this a known issue with S3? What it the solution?
> *
> Note: When I ran same experiment on my yarn cluster; where  inputhPath is
> hdfs-path, I got the results as expected.

Mime
View raw message