drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Paris <nipari...@gmail.com>
Subject Re: REGEX search Operator
Date Tue, 09 Feb 2016 18:22:33 GMT
John,

About the escape, I will explore that question.
About your query, you may try this pattern :
select count(1) from view_mydata where srcday = '2016-02-05' and
contains(domain_name, '.*\\.com$');


2016-02-09 17:19 GMT+01:00 John Omernik <john@omernik.com>:

> I copied both files and it appears to work, but after some testing, I am
> getting inconsistent results, see below. I ran three queries. first a like
> looking for domain names that end in .com (domain_name like '%.com' that
> returned a count of 9.8 million.  Then I tried the contains, with '\.com$'
> which is ends in dot com.... this failed (this goes to my earlier comments
> about really wishing we did not do double escaping as normal... for users,
> double escaping is NOT normal, thus doing that to meet Java's issues is
> hard... not sure how to handle it, it may be a tough issue, but it really
> seems like something worth exploring).
>
> I then did contains(domain_name, '\\.com$)  This took quite a bit longer,
> and returned 0, so I am not really sure how the function is working at this
> point.  Thoughts?
>
> John
>
>
>
> > select count(1) from view_mydata where srcday = '2016-02-05' and
> domain_name like '%.com';
> +----------+
> |  EXPR$0  |
> +----------+
> | 9810609  |
> +----------+
> 1 row selected (123.673 seconds)
>
>
> > select count(1) from view_mydata where srcday = '2016-02-05' and
> contains(domain_name, '\.com$');
> Error: SYSTEM ERROR: ExpressionParsingException: Expression has syntax
> error! line 1:79:mismatched input '<EOF>' expecting CParen
>
> Fragment 1:13
>
> [Error Id: 8e46bed4-f9ba-444f-a3aa-2f57db5ae34f on node3:31010]
> (state=,code=0)
>
> > select count(1) from view_mydata where srcday = '2016-02-05' and
> contains(domain_name, '\\.com$');
> +---------+
> | EXPR$0  |
> +---------+
> | 0       |
> +---------+
> 1 row selected (201.391 seconds)
>
>
>
> On Tue, Feb 9, 2016 at 9:34 AM, Nicolas Paris <niparisco@gmail.com> wrote:
>
> > Hi John,
> >
> > They are actualy two jars to put in the folder (generated in /target).
> Have
> > you restarted drill after ?
> >
> >
> >
> >
> >
> > 2016-02-09 16:20 GMT+01:00 John Omernik <john@omernik.com>:
> >
> > > Nicolas, not really sure what's happening here. it compiled fine, but
> > when
> > > I run it I get this error. The jar is distributed to my bits, I
> validated
> > > that... it's in the DRILL_HOME/jars/3rdparty folder on every bit... do
> I
> > > need to do something more than that?
> > >
> > >
> > >
> > > select count(1) from view_myview where srcday = '2016-02-05' and
> > > contains(domain_name, 'com');
> > > Error: SYSTEM ERROR: IllegalArgumentException: resource
> > > /org/apache/drill/contrib/function/SimpleContains.java relative to
> > > org.apache.drill.contrib.function.SimpleContains not found.
> > >
> > > Fragment 1:44
> > >
> > > [Error Id: 30c11047-9896-4e16-a207-e3cce79c9db5 on node1:31010]
> > >
> > >   (java.lang.IllegalArgumentException) resource
> > > /org/apache/drill/contrib/function/SimpleContains.java relative to
> > > org.apache.drill.contrib.function.SimpleContains not found.
> > >     com.google.common.base.Preconditions.checkArgument():119
> > >     com.google.common.io.Resources.getResource():203
> > >     org.apache.drill.exec.expr.fn.FunctionInitializer.get():127
> > >     org.apache.drill.exec.expr.fn.FunctionInitializer.checkInit():99
> > >     org.apache.drill.exec.expr.fn.FunctionInitializer.getMethod():81
> > >     org.apache.drill.exec.expr.fn.DrillFuncHolder.meth():94
> > >     org.apache.drill.exec.expr.fn.DrillSimpleFuncHolder.setupBody():50
> > >     org.apache.drill.exec.expr.fn.DrillSimpleFuncHolder.renderEnd():80
> > >
> > >
> > >
> >
> org.apache.drill.exec.expr.EvaluationVisitor$EvalVisitor.visitFunctionHolderExpression():203
> > >
> > >
> > >
> >
> org.apache.drill.exec.expr.EvaluationVisitor$ConstantFilter.visitFunctionHolderExpression():1078
> > >
> > >
> > >
> >
> org.apache.drill.exec.expr.EvaluationVisitor$CSEFilter.visitFunctionHolderExpression():816
> > >
> > >
> > >
> >
> org.apache.drill.exec.expr.EvaluationVisitor$CSEFilter.visitFunctionHolderExpression():796
> > >
> >  org.apache.drill.common.expression.FunctionHolderExpression.accept():47
> > >
> > >
> > >
> >
> org.apache.drill.exec.expr.EvaluationVisitor$EvalVisitor.visitBooleanAnd():690
> > >
> > >
> > >
> >
> org.apache.drill.exec.expr.EvaluationVisitor$EvalVisitor.visitBooleanOperator():172
> > >
> > >
> > >
> >
> org.apache.drill.exec.expr.EvaluationVisitor$ConstantFilter.visitBooleanOperator():1092
> > >
> > >
> > >
> >
> org.apache.drill.exec.expr.EvaluationVisitor$CSEFilter.visitBooleanOperator():836
> > >
> > >
> > >
> >
> org.apache.drill.exec.expr.EvaluationVisitor$CSEFilter.visitBooleanOperator():796
> > >     org.apache.drill.common.expression.BooleanOperator.accept():36
> > >
> > >
> > >
> >
> org.apache.drill.exec.expr.EvaluationVisitor$EvalVisitor.visitReturnValueExpression():551
> > >
> > >
> >
> org.apache.drill.exec.expr.EvaluationVisitor$EvalVisitor.visitUnknown():344
> > >
> > >
> > >
> >
> org.apache.drill.exec.expr.EvaluationVisitor$ConstantFilter.visitUnknown():1328
> > >
> > >
> >
> org.apache.drill.exec.expr.EvaluationVisitor$CSEFilter.visitUnknown():1027
> > >
> > >
> org.apache.drill.exec.expr.EvaluationVisitor$CSEFilter.visitUnknown():796
> > >
> > >
> > >
> >
> org.apache.drill.exec.physical.impl.filter.ReturnValueExpression.accept():56
> > >     org.apache.drill.exec.expr.EvaluationVisitor.addExpr():105
> > >     org.apache.drill.exec.expr.ClassGenerator.addExpr():227
> > >
> > >
> > >
> >
> org.apache.drill.exec.physical.impl.filter.FilterRecordBatch.generateSV2Filterer():187
> > >
> > >
> > >
> >
> org.apache.drill.exec.physical.impl.filter.FilterRecordBatch.setupNewSchema():109
> > >
>  org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():78
> > >     org.apache.drill.exec.record.AbstractRecordBatch.next():162
> > >     org.apache.drill.exec.record.AbstractRecordBatch.next():119
> > >     org.apache.drill.exec.record.AbstractRecordBatch.next():109
> > >
>  org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
> > >
> > >
> > >
> >
> org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext():94
> > >     org.apache.drill.exec.record.AbstractRecordBatch.next():162
> > >     org.apache.drill.exec.record.AbstractRecordBatch.next():119
> > >     org.apache.drill.exec.record.AbstractRecordBatch.next():109
> > >
>  org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
> > >
> > >
> > >
> >
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():132
> > >     org.apache.drill.exec.record.AbstractRecordBatch.next():162
> > >     org.apache.drill.exec.record.AbstractRecordBatch.next():119
> > >     org.apache.drill.exec.record.AbstractRecordBatch.next():109
> > >
> > >
> > >
> >
> org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch.buildSchema():100
> > >     org.apache.drill.exec.record.AbstractRecordBatch.next():142
> > >     org.apache.drill.exec.physical.impl.BaseRootExec.next():104
> > >
> > >
> > >
> >
> org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():93
> > >     org.apache.drill.exec.physical.impl.BaseRootExec.next():94
> > >     org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():256
> > >     org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():250
> > >     java.security.AccessController.doPrivileged():-2
> > >     javax.security.auth.Subject.doAs():415
> > >     org.apache.hadoop.security.UserGroupInformation.doAs():1595
> > >     org.apache.drill.exec.work.fragment.FragmentExecutor.run():250
> > >     org.apache.drill.common.SelfCleaningRunnable.run():38
> > >     java.util.concurrent.ThreadPoolExecutor.runWorker():1145
> > >     java.util.concurrent.ThreadPoolExecutor$Worker.run():615
> > >     java.lang.Thread.run():745 (state=,code=0)
> > >
> > > On Fri, Feb 5, 2016 at 2:39 AM, Nicolas Paris <niparisco@gmail.com>
> > wrote:
> > >
> > > > John,
> > > >
> > > > Sorry for that, this already work as expected.
> > > > Give it a try, this is so easy to deploy
> > > >
> > > > SELECT first_name FROM cp.`employee.json` WHERE
> > > contains(first_name,'\w+')
> > > > LIMIT 5;
> > > > first_name |
> > > > -----------|
> > > > Sheri      |
> > > > Derrick    |
> > > > Michael    |
> > > > Maya       |
> > > > Roberta    |
> > > >
> > > >
> > > > 2016-02-04 20:41 GMT+01:00 John Omernik <john@omernik.com>:
> > > >
> > > > > Ya, do you see where I am coming from here? Let's let the users
> > submit
> > > > > regex in the pure form if possible, and code the nuances of java
> > regex
> > > > > behind the scenes. I think it would be a great way to make Drill
> very
> > > > > accessible and desirable.  I think what happened in Hive is the
> regex
> > > > > commands started with the users having the escape and now there are
> > > just
> > > > to
> > > > > many things that using the escaped regex and the project doesn't
> want
> > > to
> > > > > adjust.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Feb 4, 2016 at 1:38 PM, Nicolas Paris <niparisco@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > You mean:
> > > > > > userRegex=>javaRegex
> > > > > > "\d" => "\\d"
> > > > > > "\w" => "\\w"
> > > > > > "\n" => "\n"
> > > > > > I can do that thanks to regex I guess.
> > > > > > I will give a try
> > > > > >
> > > > > >
> > > > > > 2016-02-04 19:37 GMT+01:00 John Omernik <john@omernik.com>:
> > > > > >
> > > > > > > So my question on the double escape, is there no way to
handle
> > that
> > > > so
> > > > > > the
> > > > > > > user can use single escaped regex? I know many folks who
use
> big
> > > data
> > > > > > > platform to test large complex regexes for things like
security
> > > > > > appliances,
> > > > > > > and having to convert the regex seems like a lot of work
if you
> > > > > consider
> > > > > > > every user has to do that.  If there was a way to do it
in
> Drill,
> > > > that
> > > > > > > would save countless people hours and save many mistakes.
> > > > > > >
> > > > > > > On Thu, Feb 4, 2016 at 12:03 PM, Nicolas Paris <
> > > niparisco@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > John, Jason,
> > > > > > > >
> > > > > > > > 2016-02-04 18:47 GMT+01:00 John Omernik <john@omernik.com>:
> > > > > > > >
> > > > > > > > > I'd be curios on how you are implemeting the
regex... using
> > > > Java's
> > > > > > > regex
> > > > > > > > > libraries? etc.
> > > > > > > > >
> > > > > > > > ​Yeah, I use
> > > > > > > > java.util.regex
> > > > > > > > ​
> > > > > > > >
> > > > > > > >
> > > > > > > > > I know one thing with Hive that always bothered
me was the
> > need
> > > > to
> > > > > > > double
> > > > > > > > > escape things.
> > > > > > > > >
> > > > > > > > > '\d\d\d\d-\d\d-\d\d'  needed to be
> > '\\d\\d\\d\\d-\\d\\d-\\d\\d'
> > > > of
> > > > > we
> > > > > > > can
> > > > > > > > > avoid that it would be AWESOME.
> > > > > > > > >
> > > > > > > > ​My guess is this comes from java way to handle
strings. All
> > > > > langages I
> > > > > > > > have used need to double escape.​
> > > > > > > >
> > > > > > > >
> > > > > > > > > On Thu, Feb 4, 2016 at 11:37 AM, Jason Altekruse
<
> > > > > > > > altekrusejason@gmail.com
> > > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > >
> > > > > > > > ​code is here:
> > https://github.com/parisni/drill-simple-contains
> > > > > > > > It's disturbing how it is simple...
> > > > > > > > ​
> > > > > > > >
> > > > > > > >
> > > > > > > > > > I think you should actually just put the
function in
> > > > > > > > > ​​
> > > > > > > > > Drill itself. System
> > > > > > > > > > native functions are implemented in the
same interface as
> > > UDFs,
> > > > > > > because
> > > > > > > > > our
> > > > > > > > > > mechanism for evaluating them is very efficient
(we code
> > > > generate
> > > > > > > code
> > > > > > > > > > blocks by linking together the bodies of
the individual
> > > > functions
> > > > > > to
> > > > > > > > > > evaluate a complete expression).
> > > > > > > > >
> > > > > > > > ​well the folder tree is quite impressive (
> > > > > > > https://github.com/apache/drill
> > > > > > > > ).
> > > > > > > > ​
> > > > > > > >
> > > > > > > > ​what folder is supposed to be "
> > > > > > > > ​
> > > > > > > > Drill itself"
> > > > > > > > ​ ?​
> > > > > > > > ​
> > > > > > > >
> > > > > > > > > > You can open a JIRA, marking it a feature
request. You
> can
> > > > open a
> > > > > > > poll
> > > > > > > > > > request against the apache github repo,
making sure you
> > > follow
> > > > > the
> > > > > > > > > standard
> > > > > > > > > > format for your commit message, prefixing
with the JIRA
> > > number
> > > > in
> > > > > > the
> > > > > > > > > > format
> > > > > > > > > > Example:
> > > > > > > > > > DRILL-XXXX: Feature description
> > > > > > > > > >
> > > > > > > > > > This will automatically link the PR to your
JIRA.
> > > > > > > > >
> > > > > > > > ​Ok I will try thanks​
> > > > > > > >
> > > > > > > > ​a lot​
> > > > > > > >
> > > > > > > > > > - Jason
> > > > > > > > > >
> > > > > > > > > > On Thu, Feb 4, 2016 at 8:44 AM, Nicolas
Paris <
> > > > > niparisco@gmail.com
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Jason, I have it working,
> > > > > > > > > > >
> > > > > > > > > > > Just tell me the way to proceed to
PR.
> > > > > > > > > > > 1. where do I put my maven project
? Witch folder in my
> > > drill
> > > > > > > github
> > > > > > > > > > fork?
> > > > > > > > > > > 2. do I need a jira ? how proceed ?
> > > > > > > > > > >
> > > > > > > > > > > For now, I only published it on my
github account in a
> > > > separate
> > > > > > > > project
> > > > > > > > > > >
> > > > > > > > > > > Thanks
> > > > > > > > > > >
> > > > > > > > > > > 2016-02-04 16:52 GMT+01:00 Jason Altekruse
<
> > > > > > > altekrusejason@gmail.com
> > > > > > > > >:
> > > > > > > > > > >
> > > > > > > > > > > > Awesome, thanks!
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Feb 4, 2016 at 7:44 AM,
Nicolas Paris <
> > > > > > > niparisco@gmail.com
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Well I am creating a udf
> > > > > > > > > > > > > good exercise
> > > > > > > > > > > > > I hope a PR soon
> > > > > > > > > > > > >
> > > > > > > > > > > > > 2016-02-04 16:37 GMT+01:00
Jason Altekruse <
> > > > > > > > > altekrusejason@gmail.com
> > > > > > > > > > >:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > I didn't realize that
we were lacking this
> > > > functionality.
> > > > > > As
> > > > > > > > the
> > > > > > > > > > > > > > repeated_contains operator
handles wildcards it
> > makes
> > > > > sense
> > > > > > > to
> > > > > > > > > add
> > > > > > > > > > > > such a
> > > > > > > > > > > > > > function to drill.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > It should be simple
to implement, would someone
> > like
> > > to
> > > > > > open
> > > > > > > a
> > > > > > > > > JIRA
> > > > > > > > > > > and
> > > > > > > > > > > > > > submit a PR for this?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > - Jason
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Tue, Feb 2, 2016
at 8:56 AM, John Omernik <
> > > > > > > john@omernik.com
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I would like to
see something like this as
> well,
> > > even
> > > > > if
> > > > > > > it's
> > > > > > > > > an
> > > > > > > > > > > > > included
> > > > > > > > > > > > > > > UDF like REGEX(field,
pattern) using Java's
> > library
> > > > for
> > > > > > > regex
> > > > > > > > > > like
> > > > > > > > > > > > Hive
> > > > > > > > > > > > > > > does.  That would
be EXTREMELY helpful.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Tue, Feb 2,
2016 at 6:55 AM, Nicolas Paris <
> > > > > > > > > > niparisco@gmail.com
> > > > > > > > > > > >
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > ANSI
SQL doesn't define regex operator.
> > > > > > > > > > > > > > > > > Drill
neither.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > ​Drill has
SQL functions extension like
> > > > > > > > "REPEATED_CONTAINS"​
> > > > > > > > > > that
> > > > > > > > > > > > > looks
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > handle regex.
regex operator could be
> replaced
> > > with
> > > > > one
> > > > > > > new
> > > > > > > > > SQL
> > > > > > > > > > > > > > > extension ?
> > > > > > > > > > > > > > > > I guess I
could create my own functions in
> > java,
> > > > > right
> > > > > > ?
> > > > > > > > > Maybe
> > > > > > > > > > > push
> > > > > > > > > > > > > it
> > > > > > > > > > > > > > > into
> > > > > > > > > > > > > > > > github then
?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Doesn't
it enough 'LIKE' operator?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > ​Sadly not,
I'am looking for complex pattern
> > > > > matching.
> > > > > > ​
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > Miura,
Masahide
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > -----Original
Message-----
> > > > > > > > > > > > > > > > > From:
Nicolas Paris [mailto:
> > > niparisco@gmail.com]
> > > > > > > > > > > > > > > > > Sent:
Tuesday, February 02, 2016 9:04 PM
> > > > > > > > > > > > > > > > > To: user@drill.apache.org
> > > > > > > > > > > > > > > > > Subject:
REGEX search Operator
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hello,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I can't
find any reference in the
> > documentation
> > > > > > about a
> > > > > > > > > regex
> > > > > > > > > > > > > > operator.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I would
like to be able to query this way :
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > SELECT
*
> > > > > > > > > > > > > > > > > FROM
xxx
> > > > > > > > > > > > > > > > > WHERE
 text_field   regexOperator
> > > > > 'regex_pattern';
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks
for helping,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message