datafu-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Russell Jurney (JIRA)" <>
Subject [jira] [Commented] (DATAFU-85) Add SPRINTF to provide this functionality to Pig < 0.14.0
Date Thu, 01 Jan 2015 16:34:13 GMT


Russell Jurney commented on DATAFU-85:

I've emailed the list. No replies yet. Pasted below:


I think this raises an issue that merits discussion.


There are two different release schedules that occur between Pig and DataFu:

1) Pig is released about twice a year (14 major releases in 6 years). Getting UDF code into
Pig (builtin) or Piggybank is a major undertaking. What is more, the popular Hadoop distributions
(Cloudera, Hortonworks, MapR, Pivotal) lag behind the current Apache version of Pig by a year
or more. In other words: adding a simple UDF to Pig can take a year and a half to actually
reach users.

2) DataFu releases every month or two, as new features are added. Using DataFu is as simple
as grabbing a jar file, so it isn't tied to a distribution (although several include it).
One needn't upgrade Pig to use new features of DataFu.

This leads to an interesting situation... take PIG-3939, which added SPRINTF as a Pig builtin,
in Pig 0.14, released in November, 2014. In practice, Pig users wanting SPRINTF must wait
for the distributions to include Pig 0.14, which could take a year or more. When you factor
in the six-month time between the patch's submission (June, 2014) and release (November, 2014),
it could take two or more years for most users to get the SPRINTF feature.


For me, this begs the question... why don't we add SPRINTF to DataFu, so that older versions
of Pig (before 0.14) can have this feature? I happen to be in a situation where we're using
CDH 5.2/Pig 0.12, and we need SPRINTF. I think this is a common situation.

So the question I'm raising is: Is it appropriate to implement UDF/builtin features of Pig
in DataFu, to enable older versions of Pig to use them and dramatically decrease the delay
until users can start using them?

In the case of SPRINTF, I believe we should add it to DataFu. I've created DATAFU-85 to track
this issue. The Hadoop distributions won't ship 0.14 for some time. The majority of Hadoop
users will be using Pig 0.12 for several years. Adding this kind of feature will benefit users
in the meanwhile.

> Add SPRINTF to provide this functionality to Pig < 0.14.0
> ---------------------------------------------------------
>                 Key: DATAFU-85
>                 URL:
>             Project: DataFu
>          Issue Type: Bug
>            Reporter: Russell Jurney
>            Assignee: Russell Jurney
> I need SPRINTF in DataFu for a book I'm working on. I'd like to add this to DataFu so
that CDH, HDP, MapR, etc. users can use SPRINTF as soon as DataFu cuts a new release.
> See PIG-3939
> Thoughts?

This message was sent by Atlassian JIRA

View raw message