That sounds great, thanks.
On Tue, Nov 26, 2013 at 2:46 PM, Josh Wills <jwills@cloudera.com> wrote:
> JIRA is here-- https://issues.apache.org/jira/browse/CRUNCH-306
>
> The question I have right off the bat is whether we should restrict these
> outputs to PGroupedTable types, where we know that all of the records for
> the same key will be in the same partition. For arbitrary PTable types, we
> might have multiple partitions containing the same key, and we might need
> to keep a large number of output record writers open at the same time,
> which probably isn't a great idea.
>
>
> On Tue, Nov 26, 2013 at 11:50 AM, Josh Wills <jwills@cloudera.com> wrote:
>
>> Hey Bryan,
>>
>> This comes up often enough that we need to prioritize the use case-- what
>> we really want is a Target that would take in a PTable<String, T> and would
>> be able to write an output file/directory for each String key. I'll create
>> a JIRA to track this.
>>
>> Josh
>>
>>
>> On Tue, Nov 26, 2013 at 11:25 AM, Bryan Baugher <bjbq4d@gmail.com> wrote:
>>
>>> Hi everyone,
>>>
>>> I have a PCollection of avro based objects and I want to categorize
>>> these avro objects by a certain property by writing each category into a
>>> different avro file. The number of distinct categories should be small
>>> (hundreds) and the property I am categorizing on is a String. I was hoping
>>> there was some way to end up with a Map<String, PCollection> but there
>>> didn't seem to be any obvious choice. For now I have gone with a simple
>>> approach of
>>>
>>> - Find all categories (DoFn that returns PCollection<String>)
>>> - Materialize and iterate over this collection
>>> - For each category use a FilterFn to create desired categorized
>>> PCollection
>>> - Write this to avro file
>>>
>>> This works but it seems like there should be a better way to do it. Any
>>> thoughts?
>>>
>>> -Bryan
>>>
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>
--
-Bryan
|