drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Omernik <j...@omernik.com>
Subject Re: Dealing with bad data when trying to do date computations
Date Tue, 28 Feb 2017 16:48:29 GMT
Thanks Charles, that worked even on my 1.8.

Drill folks: We need to do some documentation updates.   We've added
functions (like REGEXP_MATCHES, and it's in 1.8, so I am not sure where it
was added) and other functions like SPLIT and yet no mention in
https://drill.apache.org/docs/string-manipulation/

So, yes, this is "meh" work compared to programming all the cool things in
Drill.  But there are a number of reasons that this needs to be done
besides common practices.

1.  Users, and more importantly POTENTIAL users get frustrated when trying
to use drill for the first time. Coming from other Big Data systems like
Hive, not having Regex, split, and other functions is frustrating. But what
is more frustrating is to find that they actually exist, and are just not
documented.  Nothing will turn people off faster.

2.  Without the knowledge of these functions, people try "hacky" work
arounds like what I did, killing performance, and setting Drill in a bad
light.

3.  It provides an over all feeling of lack of effort by the community.  I
am know that resources are not unlimited, and these things need to be
addressed by "someone" but issues like this are really important for
getting more people into the community who may be able to help contribute!

4.  I think as part of developer review and pull requests that add
functions/functionality should require a pull request to also provide a
documentation update. This helps to ensure that the docs keep up to date,
as well as keeping users appraised of what is happening... i.e. it's a good
"feeling" to see a great tool like Drill "improving" with new
functionality.

Please, folks, we need to do some one time clean up (go back through pull
requests to ensure all functions are documented up to now) and then then
get processes in place to ensure ongoing updates.

Thanks

John Omernik


On Tue, Feb 28, 2017 at 10:15 AM, Charles Givre <cgivre@gmail.com> wrote:

> Hi John,
> I believe that Drill 1.9 includes a REGEXP_MATCHES( <source>, <pattern> )
> function which does what you'd expect it to.  I'm not sure when this was
> introduced, so it maybe in earlier versions of Drill.
> Best,
> -- C
>
> On Tue, Feb 28, 2017 at 11:03 AM, John Omernik <john@omernik.com> wrote:
>
> > I have a data set that has birthdays in YYYY-MM-DD format.
> >
> > Most of this data is great. I am trying to compute the age using
> >
> > EXTRACT(year from age(dob))
> >
> >
> > But some of my data is crapola... let's call it alternative data...
> >
> >
> > When I try to run the Extract function, I get
> >
> > Error: SYSTEM ERROR: IllegalFieldValueException: Value 0 for monthOfYear
> > must be in the range [1,12]
> >
> > Fragment 5:17
> >
> > [Error Id: 62f90784-c9f4-4362-9710-a37464fc801a on drillnode:20005]
> >
> >
> > I've tried an ugly where clause, and this works:
> >
> > where
> >
> > (dob LIKE '%-01-%' or dob LIKE '%-02-%' or dob LIKE '%-03-%' or dob LIKE
> > '%-04-%' or dob LIKE '%-05-%' or dob LIKE '%-06-%' or dob LIKE '%-07-%'
> or
> > dob LIKE '%-08-%' or dob LIKE '%-09-%' or
> >
> > dob LIKE '%-1-%' or dob LIKE '%-2-%' or dob LIKE '%-3-%' or dob LIKE
> > '%-4-%' or dob LIKE '%-5-%' or dob LIKE '%-6-%' or dob LIKE '%-7-%' or
> dob
> > LIKE '%-8-%' or dob LIKE '%-9-%' or
> >
> > dob LIKE '%-10-%' or dob LIKE '%-11-%' or dob LIKE '%-12-%')
> >
> >
> > But WOW is that ugly. I could add the jar for regex contains, and make it
> > much easier (do we have a regex search function built into drill? I think
> > we should at this point...)
> >
> >
> > Is there another way to say try the extra function, and catch a failure,
> >  and ignore on failure? What if we had a cast function that returned NULL
> > on failure so we could use it in the where clause?  Any other more
> elegant
> > ways to handle this?
> >
> >
> > Thanks!
> >
> >
> > John
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message