crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Micah Whitacre (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-470) Add hdfs/yarn minicluster crunch pipeline
Date Thu, 11 Sep 2014 21:27:34 GMT


Micah Whitacre commented on CRUNCH-470:

I'm not understanding what the mini cluster pipeline would actually be doing different.  The
actual pipeline code would be the same but the only change would be values inside of the Configuration
object passed to the pipeline.

So the workflow would be:

1. Setup minicluster (either as YARN or MR depending on what you need)
2. Retrieve configuration from minicluster
3. Pass configuration to MRPipeline and run

Are you proposing that a single pipeline would do all of that?  If so that would only be for
testing purposes and in that case for faster and more stable tests you would actually want
to use a single minicluster across all your tests.  So having a pipeline spin one up and tear
down for each run would make your tests run considerably slower.

> Add hdfs/yarn minicluster crunch pipeline
> -----------------------------------------
>                 Key: CRUNCH-470
>                 URL:
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.8.3
>            Reporter: Rafal Wojdyla
>            Assignee: Josh Wills
>            Priority: Minor
> Crunch currently has two pipelines:
> * MemPipeline
> * MRPipeline
> MemPipeline is in-memory pipelines based on local in-memory mapreduce mode.
> MRPipeline is distributed pipeline based on distributed MapReduce.
> Using HDFS/YARN Minicluster it's possible to better emulate Hadoop cluster, and it could
be a 'final test' before running on the cluster.

This message was sent by Atlassian JIRA

View raw message