spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mcclintic, Abbi" <ab...@amazon.com>
Subject Re: CSV write to S3 failing silently with partial completion
Date Wed, 27 Sep 2017 19:25:27 GMT
Hi folks,
We appear to have mitigated the issue by including the following configurations to our jobs,
with significant improvement in S3 consistency with CSV and JSON (which turned out to be worse
than CSV initially):

spark.speculation=false
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1

Still not really sure of the root cause, but this has at least stopped the bleeding for my
team and so far hasn’t caused any large degradation in runtime for our jobs.

I’ve looked into the spark-redshift connector but I don’t think it supports client side
encryption which is a requirement for our data and wouldn’t solve the problem for our data
used outside of Redshift.

Hope that helps someone else out if you hit the same issue.

-Abbi


From: Gourav Sengupta <gourav.sengupta@gmail.com>
Date: Monday, September 11, 2017 at 6:32 AM
To: "Mcclintic, Abbi" <abbim@amazon.com>
Cc: user <user@spark.apache.org>
Subject: Re: CSV write to S3 failing silently with partial completion

Hi,

Can you please let me know the following:
1. Why are you using JAVA?
2. The way you are creating the SPARK cluster
3. The way you are initiating SPARK session or context
4. Are you able to query the data that is written to S3 using a SPARK dataframe and validate
that the number of rows in the source are same as the ones written to target?
5. how are you loading the data to Redshift (cluster size, version, command, compression,
command, manifest file)
6. using Redshift JDBC (https://github.com/databricks/spark-redshift) you will have to play
around with it a bit to understand how it works (be careful that it does not drop the table
at target Redshift database)

Regards,
Gourav

On Thu, Sep 7, 2017 at 7:02 AM, abbim <abbim@amazon.com<mailto:abbim@amazon.com>>
wrote:
Hi all,
My team has been experiencing a recurring unpredictable bug where only a
partial write to CSV in S3 on one partition of our Dataset is performed. For
example, in a Dataset of 10 partitions written to CSV in S3, we might see 9
of the partitions as 2.8 GB in size, but one of them as 1.6 GB. However, the
job does not exit with an error code.

This becomes problematic in the following ways:
1. When we copy the data to Redshift, we get a bad decrypt error on the
partial file, suggesting that the failure occurred at a weird byte in the
file.
2. We lose data - sometimes as much as 10%.

We don't see this problem with parquet format, which we also use, but moving
all of our data to parquet is not currently feasible. We're using the Java
API with Spark 2.2 and Amazon EMR 5.8, code is a simple as this:
df.write().csv("s3://some-bucket/some_location"). We're experiencing the
issue 1-3x/week on a daily job and are unable to reliably reproduce the
problem.

Any thoughts on why we might be seeing this and how to resolve?
Thanks in advance.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org<mailto:user-unsubscribe@spark.apache.org>

Mime
View raw message