spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mingyu Kim <m...@palantir.com>
Subject Re: Which OutputCommitter to use for S3?
Date Fri, 20 Feb 2015 23:52:07 GMT
I didn’t get any response. It’d be really appreciated if anyone using a special OutputCommitter
for S3 can comment on this!

Thanks,
Mingyu

From: Mingyu Kim <mkim@palantir.com<mailto:mkim@palantir.com>>
Date: Monday, February 16, 2015 at 1:15 AM
To: "user@spark.apache.org<mailto:user@spark.apache.org>" <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Which OutputCommitter to use for S3?

HI all,

The default OutputCommitter used by RDD, which is FileOutputCommitter, seems to require moving
files at the commit step, which is not a constant operation in S3, as discussed in http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3C543E33FA.2000802@entropy.be%3E<https://urldefense.proofpoint.com/v2/url?u=http-3A__mail-2Darchives.apache.org_mod-5Fmbox_spark-2Duser_201410.mbox_-253C543E33FA.2000802-40entropy.be-253E&d=AwMFAg&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=ennQJq47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84&m=CQfyLCSSjJfOHcbsMrRNihcDeMtHvLkCD5_O0J786BY&s=2t0BawrpQPkJJgxklG_YX6LFzD1VaHTgDXI-w37smyc&e=>.
People seem to develop their own NullOutputCommitter implementation or use DirectFileOutputCommitter
(as mentioned in SPARK-3595<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D3595&d=AwMFAg&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=ennQJq47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84&m=CQfyLCSSjJfOHcbsMrRNihcDeMtHvLkCD5_O0J786BY&s=i-gC5iPL8kGUDicLXowgLl5ncIyDknsulTlh7o23W_g&e=>),
but I wanted to check if there is a de facto standard, publicly available OutputCommitter
to use for S3 in conjunction with Spark.

Thanks,
Mingyu

Mime
View raw message