nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <>
Subject [jira] Commented: (NUTCH-466) Flexible segment format
Date Mon, 02 Apr 2007 12:50:32 GMT


Andrzej Bialecki  commented on NUTCH-466:

> I thought that the map will be from class names to directory names.

Well, then you would have to pass the whole class name in an RPC call - I think we should
come up with a way that uses at most one byte to select the right part.

> Do you think that we sould also move HitDetailer, HitSummarizer, HitContent and Searcher
to this plugin system

Yes, that was my plan - the same way we did it with indexing plugins - although I intend to
create a separate issue regarding the use of separate index / page / summary servers, to avoid
complicating this patch too much..

> Flexible segment format
> -----------------------
>                 Key: NUTCH-466
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>          Components: searcher
>    Affects Versions: 1.0.0
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
> In many situations it is necessary to store more data associated with pages than it's
possible now with the current segment format. Quite often it's a binary data. There are two
common workarounds for this: one is to use per-page metadata, either in Content or ParseData,
the other is to use an external independent database using page ID-s as foreign keys.
> Currently segments can consist of the following predefined parts: content, crawl_fetch,
crawl_generate, crawl_parse, parse_text and parse_data. I propose a third option, which is
a natural extension of this existing segment format, i.e. to introduce the ability to add
arbitrarily named segment "parts", with the only requirement that they should be MapFile-s
that store Writable keys and values. Alternatively, we could define a SegmentPart.Writer/Reader
to accommodate even more sophisticated scenarios.
> Existing segment API and searcher API (NutchBean, DistributedSearch Client/Server) should
be extended to handle such arbitrary parts.
> Example applications:
> * storing HTML previews of non-HTML pages, such as PDF, PS and Office documents
> * storing pre-tokenized version of plain text for faster snippet generation
> * storing linguistically tagged text for sophisticated data mining
> * storing image thumbnails
> etc, etc ...
> I'm going to prepare a patchset shortly. Any comments and suggestions are welcome.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message