spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@hortonworks.com>
Subject Re: CSV write to S3 failing silently with partial completion
Date Fri, 08 Sep 2017 20:09:32 GMT

On 7 Sep 2017, at 18:36, Mcclintic, Abbi <abbim@amazon.com<mailto:abbim@amazon.com>>
wrote:

Thanks all – couple notes below.

Generally all our partitions are of equal size (ie on a normal day in this particular case
I see 10 equally sized partitions of 2.8 GB). We see the problem with repartitioning and without
– in this example we are repartitioning to 10 but we also see the problem without any repartitioning
when the default partition count is 200. We know that data loss is occurring because we have
a final quality gate that counts the number of rows and halts the process if we see too large
of a drop.

We have one use case where the data needs to be read on a local machine after processing and
one use case of copying to redshift. Regarding the redshift copy, it gets a bit complicated
owing to VPC and encryption requirements so we haven’t looked into using the JDBC driver
yet.

My understanding was that Amazon EMR does not support s3a<https://aws.amazon.com/premiumsupport/knowledge-center/emr-file-system-s3/>,
but it may be worth looking into.

1. No, it doesn't
2. You can't currently use s3a as a direct destination of work due to s3 not being consistent,
not without a consistency layer on top (S3Guard, etc)

We may also try a combination of writing to HDFS combined with s3distcp.


+1


Mime
View raw message