falcon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Venkatesan Ramachandran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FALCON-2030) Enforce time partition pattern in the data location path in feed definition
Date Fri, 17 Jun 2016 19:45:05 GMT

    [ https://issues.apache.org/jira/browse/FALCON-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15336792#comment-15336792

Venkatesan Ramachandran commented on FALCON-2030:


Let's assume that feed A is pointing to a dir /basedir/feedA 
Metadata gets exported and stored as a file (for simplicity we assume 1 file) in the dir as

The consumers (Pig or MR job) takes feed A as input and reads from the dir /basedir/feedA/
and so the file /basedir/feedA/datafile-t1

After some days, the metadata changes and a new export happens that produces a new file under
the feed dir as /basedir/feedA/datafile-t2

Now there are two files - one with slightly oder data and the other one with updated data
as below

Let's assume that the custom has implemented a custom retention that retires all the files
except the last one (and the retention job runs once a day)

At this point, 

a) the workflow (pig/mr etc) will consume both the files (duplicate data)
    If I read your comment above correctly, you are suggesting to consume only the latest
    This would require developing custom pig loaders and input formats etc and is not very
common and error prone.

b) In the absence of (a), when the workflow consumes both the files under the feed dir and
if the retention deletes the older one, the Pig or MR task will try to read the file and fail.

It is better to write the files under a <version or pattern> subdir and apply custom
retention (based on access time etc) to retire that dir. 
The workflow can easily use the LATEST EL to safely access the latest <pattern dir>.
This seems to be a more plausible use-case IMO.

With this regard, I do not believe this validation is restricting any use cases. In fact,
I think it makes users avoid pitfalls. 


> Enforce time partition pattern in the data location path in feed definition 
> ----------------------------------------------------------------------------
>                 Key: FALCON-2030
>                 URL: https://issues.apache.org/jira/browse/FALCON-2030
>             Project: Falcon
>          Issue Type: Improvement
>          Components: feed
>            Reporter: Venkatesan Ramachandran
>            Assignee: Venkatesan Ramachandran
> In feed definition, data location can be specified without time series pattern like below:
>    <locations>
>         <location type="data" path="/tmp/falcon-regression/RetentionTest/testFolders/"/>
>         <location type="stats" path="/projects/falcon/clicksStats"/>
>         <location type="meta" path="/projects/falcon/clicksMetaData"/>
>     </locations>

This message was sent by Atlassian JIRA

View raw message