lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vaijanathrao <>
Subject Re: REPOST from another list: Question related to improving search results
Date Sat, 02 May 2009 11:07:22 GMT
Hi Aditya,

You can you any HTML parser if you are getting/crawling an page from wikipedia and ignore
those sections which are repetitive.
If you are using Jericho parser here is what you can do.

URL u = new URL("any english wikipedia page");
		Source src = new Source(u.openConnection().getInputStream());
		TextExtractor textExtractor=new TextExtractor(src) {
			public boolean excludeElement(StartTag startTag) {
				return startTag.getName()==HTMLElementName.HEAD 
				|| "printfooter".equalsIgnoreCase(startTag.getAttributeValue("class"))
				|| "footer".equalsIgnoreCase(startTag.getAttributeValue("id"))
				|| "references".equalsIgnoreCase(startTag.getAttributeValue("class"))	
				|| "infobox sisterproject".equalsIgnoreCase(startTag.getAttributeValue("class"))
				|| "siteSub".equalsIgnoreCase(startTag.getAttributeValue("id"))
				|| "dablink".equalsIgnoreCase(startTag.getAttributeValue("class"))
				|| "portlet".equalsIgnoreCase(startTag.getAttributeValue("class"))
				|| "jump-to-nav".equalsIgnoreCase(startTag.getAttributeValue("id"))
				|| "mw-hidden-cats-hidden".equalsIgnoreCase(startTag.getAttributeValue("class"))
				|| "generated-sidebar portlet".equalsIgnoreCase(startTag.getAttributeValue("class"))
		String parsedText = textExtractor.setIncludeAttributes(false).toString();

Though above code does not remove all the repetitve things, so you need to dig a little more
in the page to get those. If you are not crawling the wiki page and are using XML dump, take
any mediawiki parser which will give the html and you can use the above code, but yeah it
will be duplication effort.

--Thanks and Regards
Vaijanath N. Rao

----- Original Message -----
From: "Aditya" <>
Sent: Saturday, May 2, 2009 4:19:33 PM GMT +05:30 Chennai, Kolkata, Mumbai, New Delhi
Subject: REPOST from another list: Question related to improving search results



New to this group.




Generally sites like wikipeadia have a template and every page follows it.
These templates contains the word that occurs in every page. 


For example wikipedia template has the list of language in the left panel.
Now these words gets indexed every time since they are not (cannot be) stop

if user for example search for "Galego", every wikipedia page will be in the
search result which is wrong as every wikipedia page does not talk about


Any takes on this one for how to solve this problem?


Best Regards,



To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message