spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Davidson <ilike...@gmail.com>
Subject Re: Which OutputCommitter to use for S3?
Date Sun, 22 Feb 2015 00:01:54 GMT
Here is the class: https://gist.github.com/aarondav/c513916e72101bbe14ec

You can use it by setting "mapred.output.committer.class" in the Hadoop
configuration (or "spark.hadoop.mapred.output.committer.class" in the Spark
configuration). Note that this only works for the old Hadoop APIs, I
believe the new Hadoop APIs strongly tie committer to input format (so
FileInputFormat always uses FileOutputCommitter), which makes this fix more
difficult to apply.

On Sat, Feb 21, 2015 at 12:12 PM, Andrew Ash <andrew@andrewash.com> wrote:

> Josh is that class something you guys would consider open sourcing, or
> would you rather the community step up and create an OutputCommitter
> implementation optimized for S3?
>
> On Fri, Feb 20, 2015 at 4:02 PM, Josh Rosen <rosenville@gmail.com> wrote:
>
>> We (Databricks) use our own DirectOutputCommitter implementation, which
>> is a couple tens of lines of Scala code.  The class would almost entirely
>> be a no-op except we took some care to properly handle the _SUCCESS file.
>>
>> On Fri, Feb 20, 2015 at 3:52 PM, Mingyu Kim <mkim@palantir.com> wrote:
>>
>>>  I didn’t get any response. It’d be really appreciated if anyone using
>>> a special OutputCommitter for S3 can comment on this!
>>>
>>>  Thanks,
>>> Mingyu
>>>
>>>   From: Mingyu Kim <mkim@palantir.com>
>>> Date: Monday, February 16, 2015 at 1:15 AM
>>> To: "user@spark.apache.org" <user@spark.apache.org>
>>> Subject: Which OutputCommitter to use for S3?
>>>
>>>   HI all,
>>>
>>>  The default OutputCommitter used by RDD, which is FileOutputCommitter,
>>> seems to require moving files at the commit step, which is not a constant
>>> operation in S3, as discussed in
>>> http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3C543E33FA.2000802@entropy.be%3E
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__mail-2Darchives.apache.org_mod-5Fmbox_spark-2Duser_201410.mbox_-253C543E33FA.2000802-40entropy.be-253E&d=AwMFAg&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=ennQJq47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84&m=CQfyLCSSjJfOHcbsMrRNihcDeMtHvLkCD5_O0J786BY&s=2t0BawrpQPkJJgxklG_YX6LFzD1VaHTgDXI-w37smyc&e=>.
>>> People seem to develop their own NullOutputCommitter implementation or use
>>> DirectFileOutputCommitter (as mentioned in SPARK-3595
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D3595&d=AwMFAg&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=ennQJq47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84&m=CQfyLCSSjJfOHcbsMrRNihcDeMtHvLkCD5_O0J786BY&s=i-gC5iPL8kGUDicLXowgLl5ncIyDknsulTlh7o23W_g&e=>),
>>> but I wanted to check if there is a de facto standard, publicly available
>>> OutputCommitter to use for S3 in conjunction with Spark.
>>>
>>>  Thanks,
>>> Mingyu
>>>
>>
>>
>

Mime
View raw message