nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dennis Kubes (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-662) Upgrade Nutch to use Lucene 2.4
Date Fri, 21 Nov 2008 14:20:46 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649679#action_12649679
] 

Dennis Kubes commented on NUTCH-662:
------------------------------------

The upgrade to Lucene 2.4 causes a weird problem that might need some discussion.  The o.a.n.indexer.FsDirectory$DfsIndexOutput
class is used to interact with an index stored on DFS.  The 2.4 version of Lucene in the ChecksumIndexOutput.prepareCommit
method and finalizeCommit methods do a pseudo two-phase commit.  To do this it writes an intential
mismatched checksum (long = checkum - 1) then flushes and seeks back and writes the correct
checksum in the same spot.  They say this is to ensure the commit.  Because DFS doesn't have
append functionality we can't write to it, seek back to a position, and write again.  DFS
is write only.

To handle this problem in the attached patch, I first write out to a local temporary file
that is deleted upon exit, then when close is called on the IndexOutput, that file is written
out to DFS all at once.  I don't know if this is the best way to do this or if there is a
better way, but it does handle the new write and seek functionality of lucene 2.4.  The previous
implementation of DfsIndexOutput simply threw an UnsupportedOperationException when the seek
method was called.  This was fine before 2.4 as lucene wasn't calling that method during writing
to DFS.  In 2.4 it does and unit tests were failing because of it.  What does everybody think
about this implementation?

Other than that I don't see any major issues in upgrading to 2.4.  Some people have said performance
we down in 2.4.  My thoughts are, that might be the case but those will be fixed and it would
be good to be on the most recent lucene version as we move to a 1.0 release for Nutch.  Also
we have been using 2.4 in production for a month now without any issues.

> Upgrade Nutch to use Lucene 2.4
> -------------------------------
>
>                 Key: NUTCH-662
>                 URL: https://issues.apache.org/jira/browse/NUTCH-662
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: lucene-core-2.4.0.jar, lucene-misc-2.4.0.jar, NUTCH-662-20081121-1.patch
>
>
> Upgrade nutch to use Lucene 2.4.  This release changes the lucene file format.  New indexes
created by this lucene version will NOT be readable by older versions.  Lucene 2.4 can read
and update older index formats although updating an older format will convert it to the new
format.  There are also some performance and functionality improvments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message