crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dave Beech (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CRUNCH-144) Ability to re-use PCollections after a write without having to recompute them
Date Wed, 16 Jan 2013 11:02:13 GMT
Dave Beech created CRUNCH-144:
---------------------------------

             Summary: Ability to re-use PCollections after a write without having to recompute
them
                 Key: CRUNCH-144
                 URL: https://issues.apache.org/jira/browse/CRUNCH-144
             Project: Crunch
          Issue Type: Improvement
          Components: Core
    Affects Versions: 0.4.0
            Reporter: Dave Beech
            Assignee: Josh Wills


I have a pipeline that consists of several stages to process and filter a dataset. I would
like to persist this dataset to HDFS and then perform further computation on it. 

Example:
1. ) Load text data A and convert to avro -> A'
2. ) Load text data B and convert to avro -> B'
3. ) Union A' and B' -> C
4. ) Filter C -> D

5. ) Write D to HDFS

6a. ) Use DoFn to extract strings from D -> E
6b. ) Aggregate E ( count strings ) -> F
6c. ) Convert F to HBase puts -> G
6d. ) Write G to HBase

Running this pipeline code generates two mapreduce jobs which run in parallel:
job A) runs steps 1, 2, 3, 4, 5
job B) runs steps 1, 2, 3, 4, 6abcd

If a "pipeline.run()" call is included after step 5, the same two jobs are run but sequentially.


What I would like is to be able to hold on to the PCollection reference to "D", so that steps
6* can be run without going back to the start and re-doing all the work needed to generate
it.

-- 

Ref to original discussion on crunch-user: http://mail-archives.apache.org/mod_mbox/incubator-crunch-user/201301.mbox/%3CCAH29n6MORejkxD%2ByRycRw40vxf4GruJ8m46AMjx_RGd6DvDUQA%40mail.gmail.com%3E


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message