crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Micah Whitacre (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-306) MultipleOutput Targets
Date Wed, 11 Dec 2013 03:06:08 GMT


Micah Whitacre commented on CRUNCH-306:

I think my/Bryan's use case is slightly different than Jeremy's in that we don't expect the
files to be named "key.avro" but instead were thinking /<basePath>/<some key derived
path>/part-*-*.avro  This would eliminate the thread contention if a key existed in multiple

Jeremy would that work for you?  Since the AvroFileSource would support reading from a directory
you could still consume it in a similar fashion without it being a single file.

Looking at the AvroFilePerKeyTarget/AvroFilePerKeyOutputFormat should we also document the
hint that sorting by keys would be helpful as well to have improved performance (less opening
and closing of files).  I'd most will be doing a GBK to ensure a single partition and then
would get this naturally as part of the ungroup() but this wouldn't be the case if they are
doing it in the map only.

> MultipleOutput Targets
> ----------------------
>                 Key: CRUNCH-306
>                 URL:
>             Project: Crunch
>          Issue Type: New Feature
>          Components: IO
>            Reporter: Josh Wills
>         Attachments: CRUNCH-306.patch, CRUNCH-306b.patch
> A commonly desired feature for Crunch is the ability to write an output file for each
key in a PTable/PGroupedTable containing the values associated with that key. We should find
a way to support that one-output-per-key model.

This message was sent by Atlassian JIRA

View raw message