crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Attila Sasvari (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-636) Make replication factor for temporary files configurable
Date Mon, 20 Feb 2017 20:45:45 GMT


Attila Sasvari commented on CRUNCH-636:

I have a poc that suggests that the approach I previously recommended is fragile (executed
3 times a sample dataflow, and replication settings were not set deterministically).

[~joshwills] What is your opinion about this ticket/feature? If we allow users to set different
replication factors for intermediate files, and they set it to 1, then if a disk fail that
stores the data before the pipeline finishes, the whole Crunch pipeline should crash. If a
job has both temporary and non-temporary output, then the replication factor should be the
one used for the non-temporary. I don't know all the possible cases, but it doesn't seem that
trivial to me.

> Make replication factor for temporary files configurable
> --------------------------------------------------------
>                 Key: CRUNCH-636
>                 URL:
>             Project: Crunch
>          Issue Type: New Feature
>            Reporter: Attila Sasvari
>            Assignee: Attila Sasvari
> As of now, Crunch does not allow having different replication factor for temporary files
and non-temporary files (e.g. final output data of leaf nodes) at the same time. If a user
has a large amount of data (say hundreds a of gigabytes) to process, they might want to have
lower replication factor for large temporary files between Crunch jobs. 
> We could make this configurable via a new setting (e.g. {{crunch.tmp.dir.replication}}).

This message was sent by Atlassian JIRA

View raw message