[ https://issues.apache.org/jira/browse/CRUNCH-144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13562730#comment-13562730
]
Gabriel Reid commented on CRUNCH-144:
-------------------------------------
Looks fine to me, and I wouldn't say it's as ugly as you made it out to be (and sorry for
not taking a look earlier).
One small (cosmetic) change that I would make is on line 166 of MRPipeline, there's an instanceof
check against Source and then a cast to SourceTarget. It works out all the same, but I think
it would be more readable if the instanceof check was against SourceTarget.
> Ability to re-use PCollections after a write without having to recompute them
> -----------------------------------------------------------------------------
>
> Key: CRUNCH-144
> URL: https://issues.apache.org/jira/browse/CRUNCH-144
> Project: Crunch
> Issue Type: Improvement
> Components: Core
> Affects Versions: 0.4.0
> Reporter: Dave Beech
> Assignee: Josh Wills
> Attachments: CRUNCH-144b.patch, CRUNCH-144.patch
>
>
> I have a pipeline that consists of several stages to process and filter a dataset. I
would like to persist this dataset to HDFS and then perform further computation on it.
> Example:
> 1. ) Load text data A and convert to avro -> A'
> 2. ) Load text data B and convert to avro -> B'
> 3. ) Union A' and B' -> C
> 4. ) Filter C -> D
> 5. ) Write D to HDFS
> 6a. ) Use DoFn to extract strings from D -> E
> 6b. ) Aggregate E ( count strings ) -> F
> 6c. ) Convert F to HBase puts -> G
> 6d. ) Write G to HBase
> Running this pipeline code generates two mapreduce jobs which run in parallel:
> job A) runs steps 1, 2, 3, 4, 5
> job B) runs steps 1, 2, 3, 4, 6abcd
> If a "pipeline.run()" call is included after step 5, the same two jobs are run but sequentially.
> What I would like is to be able to hold on to the PCollection reference to "D", so that
steps 6* can be run without going back to the start and re-doing all the work needed to generate
it.
> --
> Ref to original discussion on crunch-user: http://mail-archives.apache.org/mod_mbox/incubator-crunch-user/201301.mbox/%3CCAH29n6MORejkxD%2ByRycRw40vxf4GruJ8m46AMjx_RGd6DvDUQA%40mail.gmail.com%3E
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
|