spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ash <and...@andrewash.com>
Subject Re: Which OutputCommitter to use for S3?
Date Sat, 21 Feb 2015 20:12:51 GMT
Josh is that class something you guys would consider open sourcing, or
would you rather the community step up and create an OutputCommitter
implementation optimized for S3?

On Fri, Feb 20, 2015 at 4:02 PM, Josh Rosen <rosenville@gmail.com> wrote:

> We (Databricks) use our own DirectOutputCommitter implementation, which is
> a couple tens of lines of Scala code.  The class would almost entirely be a
> no-op except we took some care to properly handle the _SUCCESS file.
>
> On Fri, Feb 20, 2015 at 3:52 PM, Mingyu Kim <mkim@palantir.com> wrote:
>
>>  I didn’t get any response. It’d be really appreciated if anyone using a
>> special OutputCommitter for S3 can comment on this!
>>
>>  Thanks,
>> Mingyu
>>
>>   From: Mingyu Kim <mkim@palantir.com>
>> Date: Monday, February 16, 2015 at 1:15 AM
>> To: "user@spark.apache.org" <user@spark.apache.org>
>> Subject: Which OutputCommitter to use for S3?
>>
>>   HI all,
>>
>>  The default OutputCommitter used by RDD, which is FileOutputCommitter,
>> seems to require moving files at the commit step, which is not a constant
>> operation in S3, as discussed in
>> http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3C543E33FA.2000802@entropy.be%3E
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__mail-2Darchives.apache.org_mod-5Fmbox_spark-2Duser_201410.mbox_-253C543E33FA.2000802-40entropy.be-253E&d=AwMFAg&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=ennQJq47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84&m=CQfyLCSSjJfOHcbsMrRNihcDeMtHvLkCD5_O0J786BY&s=2t0BawrpQPkJJgxklG_YX6LFzD1VaHTgDXI-w37smyc&e=>.
>> People seem to develop their own NullOutputCommitter implementation or use
>> DirectFileOutputCommitter (as mentioned in SPARK-3595
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D3595&d=AwMFAg&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=ennQJq47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84&m=CQfyLCSSjJfOHcbsMrRNihcDeMtHvLkCD5_O0J786BY&s=i-gC5iPL8kGUDicLXowgLl5ncIyDknsulTlh7o23W_g&e=>),
>> but I wanted to check if there is a de facto standard, publicly available
>> OutputCommitter to use for S3 in conjunction with Spark.
>>
>>  Thanks,
>> Mingyu
>>
>
>

Mime
View raw message