lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-7256) PatternReplaceCharFilter can make Lucene hang
Date Thu, 09 Jun 2016 13:28:21 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-7256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322492#comment-15322492
] 

Robert Muir commented on LUCENE-7256:
-------------------------------------

I don't think we should do that, it won't help. Nor should we offer apis in lucene that pretend
to take timeouts (like ExitableDirectoryReader). Especially in this case where it will not
work. I hate that its tests are time-based and fail sporatically.

There is nothing we can do to fix things like this with java's regex stuff. We can't protect
against the user being stupid: http://mail.openjdk.java.net/pipermail/core-libs-dev/2016-March/039269.html

> PatternReplaceCharFilter can make Lucene hang
> ---------------------------------------------
>
>                 Key: LUCENE-7256
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7256
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 5.4.1
>         Environment: alpine linux v3.3
>            Reporter: Tom Fotherby
>            Priority: Minor
>
> I'm using ElasticSearch (v2.2.0 , Lucene v5.4.1) and it's [Pattern Replace Char Filter|https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-replace-charfilter.html]
(Lucenes PatternReplaceCharFilter) . I need to filter out urls from my query text before it
is tokenised. But I found that some input strings cause ElasticSearch to "hang" (slowly eating
more CPU and memory) until the system crashes.
> ----
> *Example*
> {code}
> // Character filters are used to "tidy up" a string *before* it is tokenized.
> 'char_filter' => [
>     'url_removal_pattern' => [
>         'type'        => 'pattern_replace',
>         'pattern'     => '(?mi)\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»""'']))',
>         'replacement' => '',
>     ],
> {code}
> This filter was working fine for some weeks until suddenly ElasticSearch started crashing.
We found someone was trying to do a javascript injection attack in our search box.
> I pasted the regex and the attack string into https://regex101.com 
> * Regexp: 
>  * {code}(?mi)\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s!()\[\]{};:\'".,<>?«»""''])){code}
> * Test string: 
>  * {code}twitter.com/widgets.js\";fjs.parentNode.insertBefore(js,fjs);}}(document,\"script\",\"twitter-wjs\"{code}
> https://regex101.com shows the problem to be "Catastrophic backtracking"
> bq. Catastrophic backtracking has been detected and the execution of your expression
has been halted. To find out more what this is, please read the following article: [Runaway
Regular Expressions|http://www.regular-expressions.info/catastrophic.html].
> It would be great if Lucene could detect "Catastrophic backtracking" and throw a error
or return null.
> ----
> As an aside, I created a unit test for our PHP application that uses the same regexp
and test string. (PHP can understand the same regexp, even though it's obviously for Java
in the ElasticSearch case) . Interestingly in php, the regex results in `null` which is the
documented response of [preg_replace|http://php.net/manual/en/function.preg-replace.php] when
a error occurs. If PHP can return a error rather than crashing - surely Lucene / Java can
too :trollface: ?
> {code}
> namespace app\tests\unit;
> use \yii\codeception\TestCase;
> class TagsControllerTest extends TestCase
> {
>     public function testRegexForURLDetection()
>     {
>         $regex = '(?mi)\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»""'']))';
>         // Test the Catastrophic backtracking problem
>         $testString = "twitter.com/widgets.js\";fjs.parentNode.insertBefore(js,fjs);}}(document,\"script\",\"twitter-wjs\"";
>         // This shows the regex is not working for our test string - it gives null but
should give 'hello '
>         $this->assertEquals(null, preg_replace("/$regex/", '', "hello $testString"));
>     }
> }
> {code}
> ----
> (I originally [opened a ticket|https://github.com/elastic/elasticsearch/issues/17934]
to the ElasticSearch project but got told opening it here would be more appropriate - sorry
if I'm wrong)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message