nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: does nutch follow HEAD <link> element?
Date Fri, 16 Jun 2006 23:07:26 GMT
AJ Chen wrote:
> I'm about to use nutch to crawl semantic data. Links to semantic data 
> files
> (RDF, OWL, etc.) can be placed in two places: (1) HEAD <link>; (2) 
> BODY <a
> href...>.  Does nutch crawler follows the HEAD <link>?

Yes. Please see parse-html/..../ for details.

> I'm also creating a semantic data publishing tool, I would appreciate any
> suggestion regarding the best way to make RDF files visible to nutch
> crawler.

Well, Nutch is certainly not a competitor to an RDF triple-store ;) It 
may be used to collect RDF files, and then the map-reduce jobs can be 
used to massively process these files to annotate large numbers of 
target resources (e.g. add metadata to pages in the crawldb). You could 
also load them to a triple store and use that to annotate resources in 
Nutch, to provide a better searching experience (e.g. searching by 
concept, by semantic relationships, finding similar concepts in other 
ontologies, etc).

In the end, the model that Nutch supports the best is the Lucene model, 
which is an unordered bag of documents with multiple fields 
(properties). If you can translate your required model into this, then 
you're all set. Nutch/Hadoop provides also a scalable processing 
framework, which is quite useful for enhancing the existing data with 
data from external sources (e.g. databases, triplestore, ontologies, 
semantic nets and such).

In some cases, when this external infrastructure is efficient enough, 
it's possible to combine it on-the-fly (I have successfully used this 
approach with WordNet, Wikipedia and DMOZ), in other cases you will need 
to do some batch pre-processing to make this external metadata available 
as a part of Nutch documents ... again, the framework of map/reduce and 
DFS is very useful for that (and I have used this approach too, even 
with the same data as above).

Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

View raw message