drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Omernik <j...@omernik.com>
Subject Re: REGEX search Operator
Date Tue, 09 Feb 2016 16:19:21 GMT
I copied both files and it appears to work, but after some testing, I am
getting inconsistent results, see below. I ran three queries. first a like
looking for domain names that end in .com (domain_name like '%.com' that
returned a count of 9.8 million.  Then I tried the contains, with '\.com$'
which is ends in dot com.... this failed (this goes to my earlier comments
about really wishing we did not do double escaping as normal... for users,
double escaping is NOT normal, thus doing that to meet Java's issues is
hard... not sure how to handle it, it may be a tough issue, but it really
seems like something worth exploring).

I then did contains(domain_name, '\\.com$)  This took quite a bit longer,
and returned 0, so I am not really sure how the function is working at this
point.  Thoughts?

John



> select count(1) from view_mydata where srcday = '2016-02-05' and
domain_name like '%.com';
+----------+
|  EXPR$0  |
+----------+
| 9810609  |
+----------+
1 row selected (123.673 seconds)


> select count(1) from view_mydata where srcday = '2016-02-05' and
contains(domain_name, '\.com$');
Error: SYSTEM ERROR: ExpressionParsingException: Expression has syntax
error! line 1:79:mismatched input '<EOF>' expecting CParen

Fragment 1:13

[Error Id: 8e46bed4-f9ba-444f-a3aa-2f57db5ae34f on node3:31010]
(state=,code=0)

> select count(1) from view_mydata where srcday = '2016-02-05' and
contains(domain_name, '\\.com$');
+---------+
| EXPR$0  |
+---------+
| 0       |
+---------+
1 row selected (201.391 seconds)



On Tue, Feb 9, 2016 at 9:34 AM, Nicolas Paris <niparisco@gmail.com> wrote:

> Hi John,
>
> They are actualy two jars to put in the folder (generated in /target). Have
> you restarted drill after ?
>
>
>
>
>
> 2016-02-09 16:20 GMT+01:00 John Omernik <john@omernik.com>:
>
> > Nicolas, not really sure what's happening here. it compiled fine, but
> when
> > I run it I get this error. The jar is distributed to my bits, I validated
> > that... it's in the DRILL_HOME/jars/3rdparty folder on every bit... do I
> > need to do something more than that?
> >
> >
> >
> > select count(1) from view_myview where srcday = '2016-02-05' and
> > contains(domain_name, 'com');
> > Error: SYSTEM ERROR: IllegalArgumentException: resource
> > /org/apache/drill/contrib/function/SimpleContains.java relative to
> > org.apache.drill.contrib.function.SimpleContains not found.
> >
> > Fragment 1:44
> >
> > [Error Id: 30c11047-9896-4e16-a207-e3cce79c9db5 on node1:31010]
> >
> >   (java.lang.IllegalArgumentException) resource
> > /org/apache/drill/contrib/function/SimpleContains.java relative to
> > org.apache.drill.contrib.function.SimpleContains not found.
> >     com.google.common.base.Preconditions.checkArgument():119
> >     com.google.common.io.Resources.getResource():203
> >     org.apache.drill.exec.expr.fn.FunctionInitializer.get():127
> >     org.apache.drill.exec.expr.fn.FunctionInitializer.checkInit():99
> >     org.apache.drill.exec.expr.fn.FunctionInitializer.getMethod():81
> >     org.apache.drill.exec.expr.fn.DrillFuncHolder.meth():94
> >     org.apache.drill.exec.expr.fn.DrillSimpleFuncHolder.setupBody():50
> >     org.apache.drill.exec.expr.fn.DrillSimpleFuncHolder.renderEnd():80
> >
> >
> >
> org.apache.drill.exec.expr.EvaluationVisitor$EvalVisitor.visitFunctionHolderExpression():203
> >
> >
> >
> org.apache.drill.exec.expr.EvaluationVisitor$ConstantFilter.visitFunctionHolderExpression():1078
> >
> >
> >
> org.apache.drill.exec.expr.EvaluationVisitor$CSEFilter.visitFunctionHolderExpression():816
> >
> >
> >
> org.apache.drill.exec.expr.EvaluationVisitor$CSEFilter.visitFunctionHolderExpression():796
> >
>  org.apache.drill.common.expression.FunctionHolderExpression.accept():47
> >
> >
> >
> org.apache.drill.exec.expr.EvaluationVisitor$EvalVisitor.visitBooleanAnd():690
> >
> >
> >
> org.apache.drill.exec.expr.EvaluationVisitor$EvalVisitor.visitBooleanOperator():172
> >
> >
> >
> org.apache.drill.exec.expr.EvaluationVisitor$ConstantFilter.visitBooleanOperator():1092
> >
> >
> >
> org.apache.drill.exec.expr.EvaluationVisitor$CSEFilter.visitBooleanOperator():836
> >
> >
> >
> org.apache.drill.exec.expr.EvaluationVisitor$CSEFilter.visitBooleanOperator():796
> >     org.apache.drill.common.expression.BooleanOperator.accept():36
> >
> >
> >
> org.apache.drill.exec.expr.EvaluationVisitor$EvalVisitor.visitReturnValueExpression():551
> >
> >
> org.apache.drill.exec.expr.EvaluationVisitor$EvalVisitor.visitUnknown():344
> >
> >
> >
> org.apache.drill.exec.expr.EvaluationVisitor$ConstantFilter.visitUnknown():1328
> >
> >
> org.apache.drill.exec.expr.EvaluationVisitor$CSEFilter.visitUnknown():1027
> >
> > org.apache.drill.exec.expr.EvaluationVisitor$CSEFilter.visitUnknown():796
> >
> >
> >
> org.apache.drill.exec.physical.impl.filter.ReturnValueExpression.accept():56
> >     org.apache.drill.exec.expr.EvaluationVisitor.addExpr():105
> >     org.apache.drill.exec.expr.ClassGenerator.addExpr():227
> >
> >
> >
> org.apache.drill.exec.physical.impl.filter.FilterRecordBatch.generateSV2Filterer():187
> >
> >
> >
> org.apache.drill.exec.physical.impl.filter.FilterRecordBatch.setupNewSchema():109
> >     org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():78
> >     org.apache.drill.exec.record.AbstractRecordBatch.next():162
> >     org.apache.drill.exec.record.AbstractRecordBatch.next():119
> >     org.apache.drill.exec.record.AbstractRecordBatch.next():109
> >     org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
> >
> >
> >
> org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext():94
> >     org.apache.drill.exec.record.AbstractRecordBatch.next():162
> >     org.apache.drill.exec.record.AbstractRecordBatch.next():119
> >     org.apache.drill.exec.record.AbstractRecordBatch.next():109
> >     org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
> >
> >
> >
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():132
> >     org.apache.drill.exec.record.AbstractRecordBatch.next():162
> >     org.apache.drill.exec.record.AbstractRecordBatch.next():119
> >     org.apache.drill.exec.record.AbstractRecordBatch.next():109
> >
> >
> >
> org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch.buildSchema():100
> >     org.apache.drill.exec.record.AbstractRecordBatch.next():142
> >     org.apache.drill.exec.physical.impl.BaseRootExec.next():104
> >
> >
> >
> org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():93
> >     org.apache.drill.exec.physical.impl.BaseRootExec.next():94
> >     org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():256
> >     org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():250
> >     java.security.AccessController.doPrivileged():-2
> >     javax.security.auth.Subject.doAs():415
> >     org.apache.hadoop.security.UserGroupInformation.doAs():1595
> >     org.apache.drill.exec.work.fragment.FragmentExecutor.run():250
> >     org.apache.drill.common.SelfCleaningRunnable.run():38
> >     java.util.concurrent.ThreadPoolExecutor.runWorker():1145
> >     java.util.concurrent.ThreadPoolExecutor$Worker.run():615
> >     java.lang.Thread.run():745 (state=,code=0)
> >
> > On Fri, Feb 5, 2016 at 2:39 AM, Nicolas Paris <niparisco@gmail.com>
> wrote:
> >
> > > John,
> > >
> > > Sorry for that, this already work as expected.
> > > Give it a try, this is so easy to deploy
> > >
> > > SELECT first_name FROM cp.`employee.json` WHERE
> > contains(first_name,'\w+')
> > > LIMIT 5;
> > > first_name |
> > > -----------|
> > > Sheri      |
> > > Derrick    |
> > > Michael    |
> > > Maya       |
> > > Roberta    |
> > >
> > >
> > > 2016-02-04 20:41 GMT+01:00 John Omernik <john@omernik.com>:
> > >
> > > > Ya, do you see where I am coming from here? Let's let the users
> submit
> > > > regex in the pure form if possible, and code the nuances of java
> regex
> > > > behind the scenes. I think it would be a great way to make Drill very
> > > > accessible and desirable.  I think what happened in Hive is the regex
> > > > commands started with the users having the escape and now there are
> > just
> > > to
> > > > many things that using the escaped regex and the project doesn't want
> > to
> > > > adjust.
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, Feb 4, 2016 at 1:38 PM, Nicolas Paris <niparisco@gmail.com>
> > > wrote:
> > > >
> > > > > You mean:
> > > > > userRegex=>javaRegex
> > > > > "\d" => "\\d"
> > > > > "\w" => "\\w"
> > > > > "\n" => "\n"
> > > > > I can do that thanks to regex I guess.
> > > > > I will give a try
> > > > >
> > > > >
> > > > > 2016-02-04 19:37 GMT+01:00 John Omernik <john@omernik.com>:
> > > > >
> > > > > > So my question on the double escape, is there no way to handle
> that
> > > so
> > > > > the
> > > > > > user can use single escaped regex? I know many folks who use
big
> > data
> > > > > > platform to test large complex regexes for things like security
> > > > > appliances,
> > > > > > and having to convert the regex seems like a lot of work if
you
> > > > consider
> > > > > > every user has to do that.  If there was a way to do it in Drill,
> > > that
> > > > > > would save countless people hours and save many mistakes.
> > > > > >
> > > > > > On Thu, Feb 4, 2016 at 12:03 PM, Nicolas Paris <
> > niparisco@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > John, Jason,
> > > > > > >
> > > > > > > 2016-02-04 18:47 GMT+01:00 John Omernik <john@omernik.com>:
> > > > > > >
> > > > > > > > I'd be curios on how you are implemeting the regex...
using
> > > Java's
> > > > > > regex
> > > > > > > > libraries? etc.
> > > > > > > >
> > > > > > > ​Yeah, I use
> > > > > > > java.util.regex
> > > > > > > ​
> > > > > > >
> > > > > > >
> > > > > > > > I know one thing with Hive that always bothered me
was the
> need
> > > to
> > > > > > double
> > > > > > > > escape things.
> > > > > > > >
> > > > > > > > '\d\d\d\d-\d\d-\d\d'  needed to be
> '\\d\\d\\d\\d-\\d\\d-\\d\\d'
> > > of
> > > > we
> > > > > > can
> > > > > > > > avoid that it would be AWESOME.
> > > > > > > >
> > > > > > > ​My guess is this comes from java way to handle strings.
All
> > > > langages I
> > > > > > > have used need to double escape.​
> > > > > > >
> > > > > > >
> > > > > > > > On Thu, Feb 4, 2016 at 11:37 AM, Jason Altekruse <
> > > > > > > altekrusejason@gmail.com
> > > > > > > > >
> > > > > > > > wrote:
> > > > > > >
> > > > > > > ​code is here:
> https://github.com/parisni/drill-simple-contains
> > > > > > > It's disturbing how it is simple...
> > > > > > > ​
> > > > > > >
> > > > > > >
> > > > > > > > > I think you should actually just put the function
in
> > > > > > > > ​​
> > > > > > > > Drill itself. System
> > > > > > > > > native functions are implemented in the same
interface as
> > UDFs,
> > > > > > because
> > > > > > > > our
> > > > > > > > > mechanism for evaluating them is very efficient
(we code
> > > generate
> > > > > > code
> > > > > > > > > blocks by linking together the bodies of the
individual
> > > functions
> > > > > to
> > > > > > > > > evaluate a complete expression).
> > > > > > > >
> > > > > > > ​well the folder tree is quite impressive (
> > > > > > https://github.com/apache/drill
> > > > > > > ).
> > > > > > > ​
> > > > > > >
> > > > > > > ​what folder is supposed to be "
> > > > > > > ​
> > > > > > > Drill itself"
> > > > > > > ​ ?​
> > > > > > > ​
> > > > > > >
> > > > > > > > > You can open a JIRA, marking it a feature request.
You can
> > > open a
> > > > > > poll
> > > > > > > > > request against the apache github repo, making
sure you
> > follow
> > > > the
> > > > > > > > standard
> > > > > > > > > format for your commit message, prefixing with
the JIRA
> > number
> > > in
> > > > > the
> > > > > > > > > format
> > > > > > > > > Example:
> > > > > > > > > DRILL-XXXX: Feature description
> > > > > > > > >
> > > > > > > > > This will automatically link the PR to your JIRA.
> > > > > > > >
> > > > > > > ​Ok I will try thanks​
> > > > > > >
> > > > > > > ​a lot​
> > > > > > >
> > > > > > > > > - Jason
> > > > > > > > >
> > > > > > > > > On Thu, Feb 4, 2016 at 8:44 AM, Nicolas Paris
<
> > > > niparisco@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Jason, I have it working,
> > > > > > > > > >
> > > > > > > > > > Just tell me the way to proceed to PR.
> > > > > > > > > > 1. where do I put my maven project ? Witch
folder in my
> > drill
> > > > > > github
> > > > > > > > > fork?
> > > > > > > > > > 2. do I need a jira ? how proceed ?
> > > > > > > > > >
> > > > > > > > > > For now, I only published it on my github
account in a
> > > separate
> > > > > > > project
> > > > > > > > > >
> > > > > > > > > > Thanks
> > > > > > > > > >
> > > > > > > > > > 2016-02-04 16:52 GMT+01:00 Jason Altekruse
<
> > > > > > altekrusejason@gmail.com
> > > > > > > >:
> > > > > > > > > >
> > > > > > > > > > > Awesome, thanks!
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Feb 4, 2016 at 7:44 AM, Nicolas
Paris <
> > > > > > niparisco@gmail.com
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Well I am creating a udf
> > > > > > > > > > > > good exercise
> > > > > > > > > > > > I hope a PR soon
> > > > > > > > > > > >
> > > > > > > > > > > > 2016-02-04 16:37 GMT+01:00 Jason
Altekruse <
> > > > > > > > altekrusejason@gmail.com
> > > > > > > > > >:
> > > > > > > > > > > >
> > > > > > > > > > > > > I didn't realize that we
were lacking this
> > > functionality.
> > > > > As
> > > > > > > the
> > > > > > > > > > > > > repeated_contains operator
handles wildcards it
> makes
> > > > sense
> > > > > > to
> > > > > > > > add
> > > > > > > > > > > such a
> > > > > > > > > > > > > function to drill.
> > > > > > > > > > > > >
> > > > > > > > > > > > > It should be simple to implement,
would someone
> like
> > to
> > > > > open
> > > > > > a
> > > > > > > > JIRA
> > > > > > > > > > and
> > > > > > > > > > > > > submit a PR for this?
> > > > > > > > > > > > >
> > > > > > > > > > > > > - Jason
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Tue, Feb 2, 2016 at 8:56
AM, John Omernik <
> > > > > > john@omernik.com
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > I would like to see
something like this as well,
> > even
> > > > if
> > > > > > it's
> > > > > > > > an
> > > > > > > > > > > > included
> > > > > > > > > > > > > > UDF like REGEX(field,
pattern) using Java's
> library
> > > for
> > > > > > regex
> > > > > > > > > like
> > > > > > > > > > > Hive
> > > > > > > > > > > > > > does.  That would be
EXTREMELY helpful.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Tue, Feb 2, 2016
at 6:55 AM, Nicolas Paris <
> > > > > > > > > niparisco@gmail.com
> > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > ANSI SQL doesn't
define regex operator.
> > > > > > > > > > > > > > > > Drill neither.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > ​Drill has SQL
functions extension like
> > > > > > > "REPEATED_CONTAINS"​
> > > > > > > > > that
> > > > > > > > > > > > looks
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > handle regex. regex
operator could be replaced
> > with
> > > > one
> > > > > > new
> > > > > > > > SQL
> > > > > > > > > > > > > > extension ?
> > > > > > > > > > > > > > > I guess I could
create my own functions in
> java,
> > > > right
> > > > > ?
> > > > > > > > Maybe
> > > > > > > > > > push
> > > > > > > > > > > > it
> > > > > > > > > > > > > > into
> > > > > > > > > > > > > > > github then ?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Doesn't it
enough 'LIKE' operator?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > ​Sadly not, I'am
looking for complex pattern
> > > > matching.
> > > > > ​
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > Miura, Masahide
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > -----Original
Message-----
> > > > > > > > > > > > > > > > From: Nicolas
Paris [mailto:
> > niparisco@gmail.com]
> > > > > > > > > > > > > > > > Sent: Tuesday,
February 02, 2016 9:04 PM
> > > > > > > > > > > > > > > > To: user@drill.apache.org
> > > > > > > > > > > > > > > > Subject: REGEX
search Operator
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hello,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I can't find
any reference in the
> documentation
> > > > > about a
> > > > > > > > regex
> > > > > > > > > > > > > operator.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I would like
to be able to query this way :
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > SELECT *
> > > > > > > > > > > > > > > > FROM xxx
> > > > > > > > > > > > > > > > WHERE  text_field
  regexOperator
> > > > 'regex_pattern';
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks for
helping,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message