nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "chee wu " <chee...@gmail.com>
Subject Re: 'RegexIndexingFilter'
Date Tue, 30 Jan 2007 03:04:06 GMT
I have had the same questions, and I think there should have a filed in the "Document"  Object
to tell indexer just skip indexing,but I didn't find it.So I used a very rude way.Hope the
other guys can provide a better method.
1. Set  the return Document to "null" in the method "filter(Document doc...)" in your own
IndexingFilter.
 2. In the method "Indexer.reduce()" add some statements to deal with null doc right after
the statements where filters were called. The modified cod fragments  might  be like this:
try {
   // run indexing filters
   doc = this.filters.filter(doc, parse, (UTF8) key, fetchDatum,
     inlinks);
  } catch (IndexingException e) {
   if (LOG.isWarnEnabled()) {
    LOG.warn("Error indexing " + key + ": " + e);
   }
   return;
  }  
  if (doc == null) {
   if (LOG.isWarnEnabled()) {
    LOG.warn("Skip indexing: " + key);
   }
   return;
  }


----- Original Message ----- 
From: "Tobias Zahn" <Tobias-Zahn@arcor.de>
To: <nutch-dev@lucene.apache.org>
Sent: Tuesday, January 30, 2007 2:57 AM
Subject: 'RegexIndexingFilter'


> Good evening!
> I have found out that it is impossible to index only some specific file
> types with nutch. Needing this feature, I thought of implementing an
> 'RegexIndexingFilter', if that would be the right thing to do so.
> I have read some sourcecode, but I couldn't find out how to tell the
> indexer that he shouldn't index a file.
> 
> Hoping that I am on the right way I hope for your opinions, ideas and
> your help.
> 
> TIA,
> Tobias Zahn
>
Mime
View raw message