spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mcclintic, Abbi" <ab...@amazon.com>
Subject Re: CSV write to S3 failing silently with partial completion
Date Thu, 07 Sep 2017 17:36:16 GMT
Thanks all – couple notes below.



Generally all our partitions are of equal size (ie on a normal day in this particular case
I see 10 equally sized partitions of 2.8 GB). We see the problem with repartitioning and without
– in this example we are repartitioning to 10 but we also see the problem without any repartitioning
when the default partition count is 200. We know that data loss is occurring because we have
a final quality gate that counts the number of rows and halts the process if we see too large
of a drop.



We have one use case where the data needs to be read on a local machine after processing and
one use case of copying to redshift. Regarding the redshift copy, it gets a bit complicated
owing to VPC and encryption requirements so we haven’t looked into using the JDBC driver
yet.



My understanding was that Amazon EMR does not support s3a<https://aws.amazon.com/premiumsupport/knowledge-center/emr-file-system-s3/>,
but it may be worth looking into. We may also try a combination of writing to HDFS combined
with s3distcp.



Thanks,



Abbi





On 9/7/17, 7:50 AM, "Patrick Alwell" <palwell@hortonworks.com> wrote:



    Sounds like an S3 bug. Can you replicate locally with HDFS?



    Try using S3a protocol too; there is a jar you can leverage like so: spark-submit --packages
com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 my_spark_program.py



    EMR can sometimes be buggy. :/



    You could also try leveraging EC2 nodes and manually creating a cluster with password
less SSH.



    But I feel your pain man, I’ve had weird issues with Redshift and EMR as well.



    Let me know if you can or can’t replicate locally; and I can bring it up with our S3
team for the next release of HDP and we can file a bug with AWS.



    -Pat



    On 9/7/17, 2:59 AM, "JG Perrin" <jperrin@lumeris.com> wrote:



        Are you assuming that all partitions are of equal size? Did you try with more partitions
(like repartitioning)? Does the error always happen with the last (or smaller) file? If you
are sending to redshift, why not use the JDBC driver?



        -----Original Message-----

        From: abbim [mailto:abbim@amazon.com]

        Sent: Thursday, September 07, 2017 1:02 AM

        To: user@spark.apache.org

        Subject: CSV write to S3 failing silently with partial completion



        Hi all,

        My team has been experiencing a recurring unpredictable bug where only a partial write
to CSV in S3 on one partition of our Dataset is performed. For example, in a Dataset of 10
partitions written to CSV in S3, we might see 9 of the partitions as 2.8 GB in size, but one
of them as 1.6 GB. However, the job does not exit with an error code.



        This becomes problematic in the following ways:

        1. When we copy the data to Redshift, we get a bad decrypt error on the partial file,
suggesting that the failure occurred at a weird byte in the file.

        2. We lose data - sometimes as much as 10%.



        We don't see this problem with parquet format, which we also use, but moving all of
our data to parquet is not currently feasible. We're using the Java API with Spark 2.2 and
Amazon EMR 5.8, code is a simple as this:

        df.write().csv("s3://some-bucket/some_location"). We're experiencing the issue 1-3x/week
on a daily job and are unable to reliably reproduce the problem.



        Any thoughts on why we might be seeing this and how to resolve?

        Thanks in advance.







        --

        Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/



        ---------------------------------------------------------------------

        To unsubscribe e-mail: user-unsubscribe@spark.apache.org



        ______________________________________________________________________

        This electronic transmission and any documents accompanying this electronic transmission
contain confidential information belonging to the sender.  This information may contain confidential
health information that is legally privileged.  The information is intended only for the use
of the individual or entity named above.  The authorized recipient of this transmission is
prohibited from disclosing this information to any other party unless required to do so by
law or regulation and is required to delete or destroy the information after its stated need
has been fulfilled.  If you are not the intended recipient, you are hereby notified that any
disclosure, copying, distribution or the taking of any action in reliance on or regarding
the contents of this electronically transmitted information is strictly prohibited.  If you
have received this E-mail in error, please notify the sender and delete this message immediately.



        ---------------------------------------------------------------------

        To unsubscribe e-mail: user-unsubscribe@spark.apache.org








Mime
View raw message