nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shay Lawless" <seamuslawl...@gmail.com>
Subject Full List of Metadata Fields
Date Wed, 06 Dec 2006 15:31:39 GMT
Hi all,

I'm using NutchWax (Version 0.7.0-200611082313) and Wera (Version
0.5.0-200611082313) to Index a collection of ARC files generated by a web
crawl using the Heritrix web crawler (Version 1.4.0).

When I check the metadata tag on the wera front-end the following list of
tags are displayed

ARC Identifier
URL
Time of Archival
Last Modified Time
Mime-Type
File Status
Content Checksum
HTTP Header

When I click on the explain link in the NutchWax front-end the following
list of tags are displayed

Segment
Digest
Date
ARCDate
Encoding
Collection
ARCName
ARCOffset
ContentLength
PrimaryType
subType
URL
Title
Boost

Is there a full list of the metadata fields that NutchWax/Nutch creates when
indexing? I'm particularly interested in tags relating to the actual content
on each page i.e. content type, description etc etc
When searching does NutchWax/Nutch search across such tags or just across
the parsed text of each page for occurances of keywords etc?

Any help you can provide would be greatly appreciated!

Shay

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message