crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ben Roling (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-663) Expose Record-level File Path to Processing Functions
Date Tue, 30 Jan 2018 20:15:00 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16345731#comment-16345731
] 

Ben Roling commented on CRUNCH-663:
-----------------------------------

So, my initial quick and dirty solution to this problem is for the CrunchRecordReader to share
the path to the current file via a property on the Configuration.  The property might be
called something like "crunch.split.file" and each time initNextRecordReader() is invoked
to move to the next chunk of the CombineFile input split, that property would get updated
to point to the new file.

 

DoFn's that want to know the file they are working on would look at that property.

 

I will share a proof-of-concept patch.  I'm curious for feedback on whether or not Crunch
would find such a solution acceptable.  Obviously any DoFn that chooses to use this property
for access to the file path is bound to an assumption that it is actually processing on top
of a file source.

 

This solution was somewhat inspired by this thread on StackOverflow:
[https://stackoverflow.com/questions/17105173/hadoop-how-to-get-each-file-path-in-combinefileinputformat]

 

That thread revealed to me that the native org.apache.hadoop.mapred.lib.CombineFileRecordReader
sets a config property named "map.input.file".

> Expose Record-level File Path to Processing Functions
> -----------------------------------------------------
>
>                 Key: CRUNCH-663
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-663
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Ben Roling
>            Assignee: Josh Wills
>            Priority: Major
>
> We have some processing pipelines where we want to know the file path that each record
being processed came from.  It would be nice if this could be exposed to the DoFns in our
pipelines.
>  
> This same desire was expressed a little over 1 year ago on the mailing list:
> [http://mail-archives.apache.org/mod_mbox/crunch-user/201611.mbox/%3CCAG-tO+Y42KRFiocg1RJT4qFcyvkPjFSfZa4z=wk34AriP4weTw@mail.gmail.com%3E]
>  
> Unfortunately, that thread dead-ended.
>  
> I will use the comments section and a patch to propose a simple, albeit slightly hacky
solution.  Another alternative would be to create a new Source that provides a PCollection<Pair<Path,
Record>>, but I'm not sure of the effort it would take to create that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message