crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <>
Subject Re: Splitting a PCollection
Date Tue, 26 Nov 2013 19:50:45 GMT
Hey Bryan,

This comes up often enough that we need to prioritize the use case-- what
we really want is a Target that would take in a PTable<String, T> and would
be able to write an output file/directory for each String key. I'll create
a JIRA to track this.


On Tue, Nov 26, 2013 at 11:25 AM, Bryan Baugher <> wrote:

> Hi everyone,
> I have a PCollection of avro based objects and I want to categorize these
> avro objects by a certain property by writing each category into a
> different avro file. The number of distinct categories should be small
> (hundreds) and the property I am categorizing on is a String. I was hoping
> there was some way to end up with a Map<String, PCollection> but there
> didn't seem to be any obvious choice. For now I have gone with a simple
> approach of
>    - Find all categories (DoFn that returns PCollection<String>)
>    - Materialize and iterate over this collection
>       - For each category use a FilterFn to create desired categorized
>       PCollection
>       - Write this to avro file
> This works but it seems like there should be a better way to do it. Any
> thoughts?
> -Bryan

Director of Data Science
Cloudera <>
Twitter: @josh_wills <>

View raw message