crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <>
Subject Re: Splitting a PCollection
Date Tue, 26 Nov 2013 20:46:47 GMT
JIRA is here--

The question I have right off the bat is whether we should restrict these
outputs to PGroupedTable types, where we know that all of the records for
the same key will be in the same partition. For arbitrary PTable types, we
might have multiple partitions containing the same key, and we might need
to keep a large number of output record writers open at the same time,
which probably isn't a great idea.

On Tue, Nov 26, 2013 at 11:50 AM, Josh Wills <> wrote:

> Hey Bryan,
> This comes up often enough that we need to prioritize the use case-- what
> we really want is a Target that would take in a PTable<String, T> and would
> be able to write an output file/directory for each String key. I'll create
> a JIRA to track this.
> Josh
> On Tue, Nov 26, 2013 at 11:25 AM, Bryan Baugher <> wrote:
>> Hi everyone,
>> I have a PCollection of avro based objects and I want to categorize these
>> avro objects by a certain property by writing each category into a
>> different avro file. The number of distinct categories should be small
>> (hundreds) and the property I am categorizing on is a String. I was hoping
>> there was some way to end up with a Map<String, PCollection> but there
>> didn't seem to be any obvious choice. For now I have gone with a simple
>> approach of
>>    - Find all categories (DoFn that returns PCollection<String>)
>>    - Materialize and iterate over this collection
>>       - For each category use a FilterFn to create desired categorized
>>       PCollection
>>       - Write this to avro file
>> This works but it seems like there should be a better way to do it. Any
>> thoughts?
>> -Bryan
> --
> Director of Data Science
> Cloudera <>
> Twitter: @josh_wills <>

Director of Data Science
Cloudera <>
Twitter: @josh_wills <>

View raw message