tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Niall Pemberton (JIRA)" <j...@apache.org>
Subject [jira] Updated: (TIKA-106) Remove dependency on Jakarta ORO - use JDK 1.4 Regex
Date Tue, 27 Nov 2007 03:40:43 GMT

     [ https://issues.apache.org/jira/browse/TIKA-106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Niall Pemberton updated TIKA-106:
---------------------------------

    Attachment: TIKA-106-remove-ORO-dependency-v1.patch

Attaching a patch to remove Jakarta ORO dependency. Also changes the RegexUtils method from
extract() --> extractLinks() which uses a pre-compiled regular expression Pattern (thread
safe according to the Javadocs). I've added a test case - which is basically copied from Nutch's
OutlinkExtractor test case[1]. Also this needs the "utils" directory adding to test source
before the patch can be applied.

[1] http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/test/org/apache/nutch/parse/TestOutlinkExtractor.java

> Remove dependency on Jakarta ORO - use JDK 1.4 Regex
> ----------------------------------------------------
>
>                 Key: TIKA-106
>                 URL: https://issues.apache.org/jira/browse/TIKA-106
>             Project: Tika
>          Issue Type: Task
>          Components: general
>            Reporter: Niall Pemberton
>            Priority: Minor
>         Attachments: TIKA-106-remove-ORO-dependency-v1.patch
>
>
> Jakarta ORO is only used in one place in Tika - the RegexUtils's extract() method (which
is only called in one place in ParserPostProcessor). JDK 1.4 introduced built in regular expression
support and changing the RegexUtils to use this would remove the need for Jakarta ORO as a
dependency.
> From the comments in RegexUtils it apears that this code was copied from Nutch's OutlinkExtractor[1]
- there seems to have been a similar move in Nutch back in March in r516754[2] - however it
was reverted the next day in r517015[3] - I couldn't really see anything on the Nutch dev
list to explain this, except possibly this post http://tinyurl.com/2s2y9r
> [1] http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/java/org/apache/nutch/parse/OutlinkExtractor.java
> [2] http://svn.apache.org/viewvc?view=rev&revision=516754
> [3] http://svn.apache.org/viewvc?view=rev&revision=517015

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message