lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Wechner <michael.wech...@wyona.org>
Subject Re: <no-index> or <index>
Date Fri, 31 Jan 2003 00:04:26 GMT
Ronnie Kolehmainen wrote:

>Michael,
>
>the HtmlDocument class supports ignoring tags, ie all text inside specified
>tag names is ignored. Look at the setIgnoreTags(String [] ignoredtags)
>method. Remember to also include "script" and "style" in this array along
>with your custom tag names.
>

I am not able to find the method setIgnoreTags() (I have updated my 
jakarta-lucene and
jakarta-lucene-sandbox). Or would that have been within the attachment? 
I guess the attachments
are skiped by the mailing list server.

I am now using Erik's code from sandbox.

Anyway, thanks a lot for your help

Michael

>
>Hope this is any help for you.
>
>See below for the message from an old thread.
>
>/Ronnie
>
>
>  
>
>>Hi
>>
>>I am looking for an HTMLParser which skips text tagged by
>>
>><no-index>  or something similar. This way I could exclude for
>>instance a "global navigation section" within the HTML
>>
>><no-index>
>>International<br>
>>    
>>
><Business<br>
>  
>
>>Science<br>
>>...
>></no-index>
>>    
>>
><
>  
>
>>It seems that the current demo/HTMLParser
>>(http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.inde
>>    
>>
>xing&toc=faq#q11)
>  
>
>>is not capable of doing something like that.
>>
>>Any pointers are very welcome.
>>
>>Thanks a lot
>>
>>Michael
>>
>>    
>>
>
>Message sent on dec 9 2002:
>
>
>HI,
>
>these are the classes i use. I only use them to extract the text stuff, so
>they don't have methods for getting document title and such. However text
>extraction has worked fine for me.
>
>The HtmlParser main method takes a file path as argument and outputs the
>contents to a file named html.txt - useful when testing.
>
>/Ronnie
>
>
>  
>
>>-----Ursprungligt meddelande-----
>>Fran: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
>>Skickat: den 7 december 2002 17:12
>>Till: Lucene Users List
>>Amne: Re: SV: Indexing HTML
>>
>>
>>I have had good experiences with nekoHTML parser.
>>
>>Otis
>>
>>--- Leo Galambos <galambos@com-os2.ms.mff.cuni.cz> wrote:
>>    
>>
>>>>I'm not sure this is a solution to your problem. However, it seems
>>>>        
>>>>
>>>that the
>>>      
>>>
>>>>HTMLParser used by the IndexHTML class has problems parsing the
>>>>        
>>>>
>>>document
>>>      
>>>
>>>>(there is a test class included in the jar):
>>>>
>>>>
>>>>        
>>>>
>>>>>java -cp C:\projects\lucene\jakarta-lucene\bin\lucene-demos.jar
>>>>>          
>>>>>
>>>>org.apache.lucene.demo.html.Test f01529.txt
>>>>Title: Webcz.cz - Power of search
>>>>Parse Aborted: Encountered "\'" at line 106, column 27.
>>>>Was expecting one of:
>>>>    <ArgName> ...
>>>>    <TagEnd> ...
>>>>/Ronnie
>>>>        
>>>>
>>>Hi Ronnie!
>>>
>>>I know about it and the exception is handled well (see log file
>>>below). I
>>>have found a better example than 1529, try this:
>>>http://com-os2.ms.mff.cuni.cz/bugs/f00034.txt This file cannot go
>>>throught
>>>Lucene HTML parser (I have tried 1.2 and IBM JDK 1.3.1r3). The file
>>>is
>>>specific, i.e. it has two titles, two base tags etc.
>>>
>>>I have not debugger here, so I cannot find the line where is the bug.
>>>If
>>>you try your magic, please, let me know about the patch. :) THX
>>>
>>>-g-
>>>
>>>
>>>
>>>adding save/d00320/f01516.html
>>>Parse Aborted: Lexical error at line 68, column 11.  Encountered:
>>>"\u0178"
>>>(376), after : ""
>>>:
>>>adding save/d00320/f01527.html
>>>Parse Aborted: Encountered "=" at line 83, column 48.
>>>Was expecting one of:
>>>    <ArgName> ...
>>>    <TagEnd> ...
>>>
>>>adding save/d00320/f01528.html
>>>
>>>
>>>
>>>--
>>>To unsubscribe, e-mail:
>>><mailto:lucene-user-unsubscribe@jakarta.apache.org>
>>>For additional commands, e-mail:
>>><mailto:lucene-user-help@jakarta.apache.org>
>>>
>>>      
>>>
>>__________________________________________________
>>Do you Yahoo!?
>>Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
>>http://mailplus.yahoo.com
>>
>>--
>>To unsubscribe, e-mail:
>><mailto:lucene-user-unsubscribe@jakarta.apache.org>
>>For additional commands, e-mail:
>><mailto:lucene-user-help@jakarta.apache.org>
>>
>>
>>    
>>
>
>  
>
>------------------------------------------------------------------------
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message