drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Paris <nipari...@gmail.com>
Subject Re: REGEX search Operator
Date Tue, 09 Feb 2016 18:54:12 GMT
John,
I realized I'd make a modification in order your query work. Then I updated
the github project.
select count(1) from view_mydata where srcday = '2016-02-05' and
contains(domain_name, '\\.com$'); will work now. (just redeploy the jars)

I will try to make :
select count(1) from view_mydata where srcday = '2016-02-05' and
contains(domain_name, '\.com$'); working too.

I keep you aware new version


2016-02-09 19:22 GMT+01:00 Nicolas Paris <niparisco@gmail.com>:

> John,
>
> About the escape, I will explore that question.
> About your query, you may try this pattern :
> select count(1) from view_mydata where srcday = '2016-02-05' and
> contains(domain_name, '.*\\.com$');
>
>
> 2016-02-09 17:19 GMT+01:00 John Omernik <john@omernik.com>:
>
>> I copied both files and it appears to work, but after some testing, I am
>> getting inconsistent results, see below. I ran three queries. first a like
>> looking for domain names that end in .com (domain_name like '%.com' that
>> returned a count of 9.8 million.  Then I tried the contains, with '\.com$'
>> which is ends in dot com.... this failed (this goes to my earlier comments
>> about really wishing we did not do double escaping as normal... for users,
>> double escaping is NOT normal, thus doing that to meet Java's issues is
>> hard... not sure how to handle it, it may be a tough issue, but it really
>> seems like something worth exploring).
>>
>> I then did contains(domain_name, '\\.com$)  This took quite a bit longer,
>> and returned 0, so I am not really sure how the function is working at
>> this
>> point.  Thoughts?
>>
>> John
>>
>>
>>
>> > select count(1) from view_mydata where srcday = '2016-02-05' and
>> domain_name like '%.com';
>> +----------+
>> |  EXPR$0  |
>> +----------+
>> | 9810609  |
>> +----------+
>> 1 row selected (123.673 seconds)
>>
>>
>> > select count(1) from view_mydata where srcday = '2016-02-05' and
>> contains(domain_name, '\.com$');
>> Error: SYSTEM ERROR: ExpressionParsingException: Expression has syntax
>> error! line 1:79:mismatched input '<EOF>' expecting CParen
>>
>> Fragment 1:13
>>
>> [Error Id: 8e46bed4-f9ba-444f-a3aa-2f57db5ae34f on node3:31010]
>> (state=,code=0)
>>
>> > select count(1) from view_mydata where srcday = '2016-02-05' and
>> contains(domain_name, '\\.com$');
>> +---------+
>> | EXPR$0  |
>> +---------+
>> | 0       |
>> +---------+
>> 1 row selected (201.391 seconds)
>>
>>
>>
>> On Tue, Feb 9, 2016 at 9:34 AM, Nicolas Paris <niparisco@gmail.com>
>> wrote:
>>
>> > Hi John,
>> >
>> > They are actualy two jars to put in the folder (generated in /target).
>> Have
>> > you restarted drill after ?
>> >
>> >
>> >
>> >
>> >
>> > 2016-02-09 16:20 GMT+01:00 John Omernik <john@omernik.com>:
>> >
>> > > Nicolas, not really sure what's happening here. it compiled fine, but
>> > when
>> > > I run it I get this error. The jar is distributed to my bits, I
>> validated
>> > > that... it's in the DRILL_HOME/jars/3rdparty folder on every bit...
>> do I
>> > > need to do something more than that?
>> > >
>> > >
>> > >
>> > > select count(1) from view_myview where srcday = '2016-02-05' and
>> > > contains(domain_name, 'com');
>> > > Error: SYSTEM ERROR: IllegalArgumentException: resource
>> > > /org/apache/drill/contrib/function/SimpleContains.java relative to
>> > > org.apache.drill.contrib.function.SimpleContains not found.
>> > >
>> > > Fragment 1:44
>> > >
>> > > [Error Id: 30c11047-9896-4e16-a207-e3cce79c9db5 on node1:31010]
>> > >
>> > >   (java.lang.IllegalArgumentException) resource
>> > > /org/apache/drill/contrib/function/SimpleContains.java relative to
>> > > org.apache.drill.contrib.function.SimpleContains not found.
>> > >     com.google.common.base.Preconditions.checkArgument():119
>> > >     com.google.common.io.Resources.getResource():203
>> > >     org.apache.drill.exec.expr.fn.FunctionInitializer.get():127
>> > >     org.apache.drill.exec.expr.fn.FunctionInitializer.checkInit():99
>> > >     org.apache.drill.exec.expr.fn.FunctionInitializer.getMethod():81
>> > >     org.apache.drill.exec.expr.fn.DrillFuncHolder.meth():94
>> > >     org.apache.drill.exec.expr.fn.DrillSimpleFuncHolder.setupBody():50
>> > >     org.apache.drill.exec.expr.fn.DrillSimpleFuncHolder.renderEnd():80
>> > >
>> > >
>> > >
>> >
>> org.apache.drill.exec.expr.EvaluationVisitor$EvalVisitor.visitFunctionHolderExpression():203
>> > >
>> > >
>> > >
>> >
>> org.apache.drill.exec.expr.EvaluationVisitor$ConstantFilter.visitFunctionHolderExpression():1078
>> > >
>> > >
>> > >
>> >
>> org.apache.drill.exec.expr.EvaluationVisitor$CSEFilter.visitFunctionHolderExpression():816
>> > >
>> > >
>> > >
>> >
>> org.apache.drill.exec.expr.EvaluationVisitor$CSEFilter.visitFunctionHolderExpression():796
>> > >
>> >  org.apache.drill.common.expression.FunctionHolderExpression.accept():47
>> > >
>> > >
>> > >
>> >
>> org.apache.drill.exec.expr.EvaluationVisitor$EvalVisitor.visitBooleanAnd():690
>> > >
>> > >
>> > >
>> >
>> org.apache.drill.exec.expr.EvaluationVisitor$EvalVisitor.visitBooleanOperator():172
>> > >
>> > >
>> > >
>> >
>> org.apache.drill.exec.expr.EvaluationVisitor$ConstantFilter.visitBooleanOperator():1092
>> > >
>> > >
>> > >
>> >
>> org.apache.drill.exec.expr.EvaluationVisitor$CSEFilter.visitBooleanOperator():836
>> > >
>> > >
>> > >
>> >
>> org.apache.drill.exec.expr.EvaluationVisitor$CSEFilter.visitBooleanOperator():796
>> > >     org.apache.drill.common.expression.BooleanOperator.accept():36
>> > >
>> > >
>> > >
>> >
>> org.apache.drill.exec.expr.EvaluationVisitor$EvalVisitor.visitReturnValueExpression():551
>> > >
>> > >
>> >
>> org.apache.drill.exec.expr.EvaluationVisitor$EvalVisitor.visitUnknown():344
>> > >
>> > >
>> > >
>> >
>> org.apache.drill.exec.expr.EvaluationVisitor$ConstantFilter.visitUnknown():1328
>> > >
>> > >
>> >
>> org.apache.drill.exec.expr.EvaluationVisitor$CSEFilter.visitUnknown():1027
>> > >
>> > >
>> org.apache.drill.exec.expr.EvaluationVisitor$CSEFilter.visitUnknown():796
>> > >
>> > >
>> > >
>> >
>> org.apache.drill.exec.physical.impl.filter.ReturnValueExpression.accept():56
>> > >     org.apache.drill.exec.expr.EvaluationVisitor.addExpr():105
>> > >     org.apache.drill.exec.expr.ClassGenerator.addExpr():227
>> > >
>> > >
>> > >
>> >
>> org.apache.drill.exec.physical.impl.filter.FilterRecordBatch.generateSV2Filterer():187
>> > >
>> > >
>> > >
>> >
>> org.apache.drill.exec.physical.impl.filter.FilterRecordBatch.setupNewSchema():109
>> > >
>>  org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():78
>> > >     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>> > >     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>> > >     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>> > >
>>  org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>> > >
>> > >
>> > >
>> >
>> org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext():94
>> > >     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>> > >     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>> > >     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>> > >
>>  org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>> > >
>> > >
>> > >
>> >
>> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():132
>> > >     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>> > >     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>> > >     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>> > >
>> > >
>> > >
>> >
>> org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch.buildSchema():100
>> > >     org.apache.drill.exec.record.AbstractRecordBatch.next():142
>> > >     org.apache.drill.exec.physical.impl.BaseRootExec.next():104
>> > >
>> > >
>> > >
>> >
>> org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():93
>> > >     org.apache.drill.exec.physical.impl.BaseRootExec.next():94
>> > >     org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():256
>> > >     org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():250
>> > >     java.security.AccessController.doPrivileged():-2
>> > >     javax.security.auth.Subject.doAs():415
>> > >     org.apache.hadoop.security.UserGroupInformation.doAs():1595
>> > >     org.apache.drill.exec.work.fragment.FragmentExecutor.run():250
>> > >     org.apache.drill.common.SelfCleaningRunnable.run():38
>> > >     java.util.concurrent.ThreadPoolExecutor.runWorker():1145
>> > >     java.util.concurrent.ThreadPoolExecutor$Worker.run():615
>> > >     java.lang.Thread.run():745 (state=,code=0)
>> > >
>> > > On Fri, Feb 5, 2016 at 2:39 AM, Nicolas Paris <niparisco@gmail.com>
>> > wrote:
>> > >
>> > > > John,
>> > > >
>> > > > Sorry for that, this already work as expected.
>> > > > Give it a try, this is so easy to deploy
>> > > >
>> > > > SELECT first_name FROM cp.`employee.json` WHERE
>> > > contains(first_name,'\w+')
>> > > > LIMIT 5;
>> > > > first_name |
>> > > > -----------|
>> > > > Sheri      |
>> > > > Derrick    |
>> > > > Michael    |
>> > > > Maya       |
>> > > > Roberta    |
>> > > >
>> > > >
>> > > > 2016-02-04 20:41 GMT+01:00 John Omernik <john@omernik.com>:
>> > > >
>> > > > > Ya, do you see where I am coming from here? Let's let the users
>> > submit
>> > > > > regex in the pure form if possible, and code the nuances of java
>> > regex
>> > > > > behind the scenes. I think it would be a great way to make Drill
>> very
>> > > > > accessible and desirable.  I think what happened in Hive is the
>> regex
>> > > > > commands started with the users having the escape and now there
>> are
>> > > just
>> > > > to
>> > > > > many things that using the escaped regex and the project doesn't
>> want
>> > > to
>> > > > > adjust.
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Thu, Feb 4, 2016 at 1:38 PM, Nicolas Paris <
>> niparisco@gmail.com>
>> > > > wrote:
>> > > > >
>> > > > > > You mean:
>> > > > > > userRegex=>javaRegex
>> > > > > > "\d" => "\\d"
>> > > > > > "\w" => "\\w"
>> > > > > > "\n" => "\n"
>> > > > > > I can do that thanks to regex I guess.
>> > > > > > I will give a try
>> > > > > >
>> > > > > >
>> > > > > > 2016-02-04 19:37 GMT+01:00 John Omernik <john@omernik.com>:
>> > > > > >
>> > > > > > > So my question on the double escape, is there no way
to handle
>> > that
>> > > > so
>> > > > > > the
>> > > > > > > user can use single escaped regex? I know many folks
who use
>> big
>> > > data
>> > > > > > > platform to test large complex regexes for things like
>> security
>> > > > > > appliances,
>> > > > > > > and having to convert the regex seems like a lot of
work if
>> you
>> > > > > consider
>> > > > > > > every user has to do that.  If there was a way to do
it in
>> Drill,
>> > > > that
>> > > > > > > would save countless people hours and save many mistakes.
>> > > > > > >
>> > > > > > > On Thu, Feb 4, 2016 at 12:03 PM, Nicolas Paris <
>> > > niparisco@gmail.com>
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > John, Jason,
>> > > > > > > >
>> > > > > > > > 2016-02-04 18:47 GMT+01:00 John Omernik <john@omernik.com>:
>> > > > > > > >
>> > > > > > > > > I'd be curios on how you are implemeting
the regex...
>> using
>> > > > Java's
>> > > > > > > regex
>> > > > > > > > > libraries? etc.
>> > > > > > > > >
>> > > > > > > > ​Yeah, I use
>> > > > > > > > java.util.regex
>> > > > > > > > ​
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > > I know one thing with Hive that always bothered
me was the
>> > need
>> > > > to
>> > > > > > > double
>> > > > > > > > > escape things.
>> > > > > > > > >
>> > > > > > > > > '\d\d\d\d-\d\d-\d\d'  needed to be
>> > '\\d\\d\\d\\d-\\d\\d-\\d\\d'
>> > > > of
>> > > > > we
>> > > > > > > can
>> > > > > > > > > avoid that it would be AWESOME.
>> > > > > > > > >
>> > > > > > > > ​My guess is this comes from java way to handle
strings. All
>> > > > > langages I
>> > > > > > > > have used need to double escape.​
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > > On Thu, Feb 4, 2016 at 11:37 AM, Jason Altekruse
<
>> > > > > > > > altekrusejason@gmail.com
>> > > > > > > > > >
>> > > > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > ​code is here:
>> > https://github.com/parisni/drill-simple-contains
>> > > > > > > > It's disturbing how it is simple...
>> > > > > > > > ​
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > > > I think you should actually just put
the function in
>> > > > > > > > > ​​
>> > > > > > > > > Drill itself. System
>> > > > > > > > > > native functions are implemented in
the same interface
>> as
>> > > UDFs,
>> > > > > > > because
>> > > > > > > > > our
>> > > > > > > > > > mechanism for evaluating them is very
efficient (we code
>> > > > generate
>> > > > > > > code
>> > > > > > > > > > blocks by linking together the bodies
of the individual
>> > > > functions
>> > > > > > to
>> > > > > > > > > > evaluate a complete expression).
>> > > > > > > > >
>> > > > > > > > ​well the folder tree is quite impressive (
>> > > > > > > https://github.com/apache/drill
>> > > > > > > > ).
>> > > > > > > > ​
>> > > > > > > >
>> > > > > > > > ​what folder is supposed to be "
>> > > > > > > > ​
>> > > > > > > > Drill itself"
>> > > > > > > > ​ ?​
>> > > > > > > > ​
>> > > > > > > >
>> > > > > > > > > > You can open a JIRA, marking it a feature
request. You
>> can
>> > > > open a
>> > > > > > > poll
>> > > > > > > > > > request against the apache github repo,
making sure you
>> > > follow
>> > > > > the
>> > > > > > > > > standard
>> > > > > > > > > > format for your commit message, prefixing
with the JIRA
>> > > number
>> > > > in
>> > > > > > the
>> > > > > > > > > > format
>> > > > > > > > > > Example:
>> > > > > > > > > > DRILL-XXXX: Feature description
>> > > > > > > > > >
>> > > > > > > > > > This will automatically link the PR
to your JIRA.
>> > > > > > > > >
>> > > > > > > > ​Ok I will try thanks​
>> > > > > > > >
>> > > > > > > > ​a lot​
>> > > > > > > >
>> > > > > > > > > > - Jason
>> > > > > > > > > >
>> > > > > > > > > > On Thu, Feb 4, 2016 at 8:44 AM, Nicolas
Paris <
>> > > > > niparisco@gmail.com
>> > > > > > >
>> > > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > > Jason, I have it working,
>> > > > > > > > > > >
>> > > > > > > > > > > Just tell me the way to proceed
to PR.
>> > > > > > > > > > > 1. where do I put my maven project
? Witch folder in
>> my
>> > > drill
>> > > > > > > github
>> > > > > > > > > > fork?
>> > > > > > > > > > > 2. do I need a jira ? how proceed
?
>> > > > > > > > > > >
>> > > > > > > > > > > For now, I only published it on
my github account in a
>> > > > separate
>> > > > > > > > project
>> > > > > > > > > > >
>> > > > > > > > > > > Thanks
>> > > > > > > > > > >
>> > > > > > > > > > > 2016-02-04 16:52 GMT+01:00 Jason
Altekruse <
>> > > > > > > altekrusejason@gmail.com
>> > > > > > > > >:
>> > > > > > > > > > >
>> > > > > > > > > > > > Awesome, thanks!
>> > > > > > > > > > > >
>> > > > > > > > > > > > On Thu, Feb 4, 2016 at 7:44
AM, Nicolas Paris <
>> > > > > > > niparisco@gmail.com
>> > > > > > > > >
>> > > > > > > > > > > wrote:
>> > > > > > > > > > > >
>> > > > > > > > > > > > > Well I am creating a
udf
>> > > > > > > > > > > > > good exercise
>> > > > > > > > > > > > > I hope a PR soon
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > 2016-02-04 16:37 GMT+01:00
Jason Altekruse <
>> > > > > > > > > altekrusejason@gmail.com
>> > > > > > > > > > >:
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > > I didn't realize
that we were lacking this
>> > > > functionality.
>> > > > > > As
>> > > > > > > > the
>> > > > > > > > > > > > > > repeated_contains
operator handles wildcards it
>> > makes
>> > > > > sense
>> > > > > > > to
>> > > > > > > > > add
>> > > > > > > > > > > > such a
>> > > > > > > > > > > > > > function to drill.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > It should be simple
to implement, would someone
>> > like
>> > > to
>> > > > > > open
>> > > > > > > a
>> > > > > > > > > JIRA
>> > > > > > > > > > > and
>> > > > > > > > > > > > > > submit a PR for
this?
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > - Jason
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > On Tue, Feb 2, 2016
at 8:56 AM, John Omernik <
>> > > > > > > john@omernik.com
>> > > > > > > > >
>> > > > > > > > > > > wrote:
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > I would like
to see something like this as
>> well,
>> > > even
>> > > > > if
>> > > > > > > it's
>> > > > > > > > > an
>> > > > > > > > > > > > > included
>> > > > > > > > > > > > > > > UDF like REGEX(field,
pattern) using Java's
>> > library
>> > > > for
>> > > > > > > regex
>> > > > > > > > > > like
>> > > > > > > > > > > > Hive
>> > > > > > > > > > > > > > > does.  That
would be EXTREMELY helpful.
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > On Tue, Feb
2, 2016 at 6:55 AM, Nicolas Paris
>> <
>> > > > > > > > > > niparisco@gmail.com
>> > > > > > > > > > > >
>> > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > ANSI
SQL doesn't define regex operator.
>> > > > > > > > > > > > > > > > > Drill
neither.
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > ​Drill
has SQL functions extension like
>> > > > > > > > "REPEATED_CONTAINS"​
>> > > > > > > > > > that
>> > > > > > > > > > > > > looks
>> > > > > > > > > > > > > > > to
>> > > > > > > > > > > > > > > > handle
regex. regex operator could be
>> replaced
>> > > with
>> > > > > one
>> > > > > > > new
>> > > > > > > > > SQL
>> > > > > > > > > > > > > > > extension ?
>> > > > > > > > > > > > > > > > I guess
I could create my own functions in
>> > java,
>> > > > > right
>> > > > > > ?
>> > > > > > > > > Maybe
>> > > > > > > > > > > push
>> > > > > > > > > > > > > it
>> > > > > > > > > > > > > > > into
>> > > > > > > > > > > > > > > > github
then ?
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > Doesn't
it enough 'LIKE' operator?
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > ​Sadly
not, I'am looking for complex pattern
>> > > > > matching.
>> > > > > > ​
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > --
>> > > > > > > > > > > > > > > > > Miura,
Masahide
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > -----Original
Message-----
>> > > > > > > > > > > > > > > > > From:
Nicolas Paris [mailto:
>> > > niparisco@gmail.com]
>> > > > > > > > > > > > > > > > > Sent:
Tuesday, February 02, 2016 9:04 PM
>> > > > > > > > > > > > > > > > > To:
user@drill.apache.org
>> > > > > > > > > > > > > > > > > Subject:
REGEX search Operator
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > Hello,
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > I
can't find any reference in the
>> > documentation
>> > > > > > about a
>> > > > > > > > > regex
>> > > > > > > > > > > > > > operator.
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > I
would like to be able to query this way
>> :
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > SELECT
*
>> > > > > > > > > > > > > > > > > FROM
xxx
>> > > > > > > > > > > > > > > > > WHERE
 text_field   regexOperator
>> > > > > 'regex_pattern';
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > Thanks
for helping,
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message