lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Wechner <michael.wech...@wyona.org>
Subject Re: <no-index> or <index>
Date Thu, 30 Jan 2003 23:59:03 GMT
Erik Hatcher wrote:

> If you look at the contributions/ant area of the Lucene sandbox in 
> CVS  you'll see my HtmlDocument class which uses JTidy.
>
> Rather than making up some invalid HTML tag, I'd recommend you 
> separate  your navigation section with a <div> or <span> with a 
> special  class="navigation" or something like that.  Then use JTidy to 
> ignore  such tags that have that class.  Then you get valid, clean 
> HTML and the  ability to filter it for indexing. 


Well, I haven't  found out how to use JTidy to ignore such tags that 
have such a class. So I just
added some code to your class HtmlDocument within the getBodyText method:

                  if(child.getNodeName().equals("span")){
                      org.w3c.dom.Attr 
attribute=((Element)child).getAttributeNode("class");
                      if(attribute != null){
                         if(attribute.getValue().equals("lucene-no-index")){
                           
System.out.println("HtmlDocument.getBodyText(): ignore span!");
                           break;
                           }
                         }
                       System.out.println("HtmlDocument.getBodyText(): 
accept span!");
                       }

This way text will be ignored within <span 
class="lucene-no-index">...</span>
It's not "perfect", but it's working very well for the moment.

Two remarks:

1) I noticed that demo/HTMLDocument (resp. demo/html/HTMLParser) sets:

      contents= title + body

  and your class HtmlDocument

     contents=body


2) I got two Javadoc warnings, because @return was empty within 
HtmlDocument (getDocument() and Document())


Thanks very much for your help

Michael





>
>
>     Erik
>
>
>
> On Thursday, January 30, 2003, at 04:56  AM, Michael Wechner wrote:
>
>> Hi
>>
>> I am looking for an HTMLParser which skips text tagged by
>>
>> <no-index>  or something similar. This way I could exclude for
>> instance a "global navigation section" within the HTML
>>
>> <no-index>
>> International<br>
>> Business<br>
>> Science<br>
>> ...
>> </no-index>
>>
>> It seems that the current demo/HTMLParser  
>> (http://lucene.sourceforge.net/cgi-bin/faq/ 
>> faqmanager.cgi?file=chapter.indexing&toc=faq#q11)
>> is not capable of doing something like that.
>>
>> Any pointers are very welcome.
>>
>> Thanks a lot
>>
>> Michael
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message