lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Re: WikipediaTokenizer for Removing Unnecesary Parts
Date Tue, 23 Jul 2013 15:37:24 GMT
Are you actually seeing that output from the WikipediaTokenizerFactory?? 
Really? Even if you use the Solr Admin UI analysis page?

You should just see the text tokens plus the URLs for links.

-- Jack Krupansky

-----Original Message----- 
From: Furkan KAMACI
Sent: Tuesday, July 23, 2013 10:53 AM
To: solr-user@lucene.apache.org
Subject: WikipediaTokenizer for Removing Unnecesary Parts

Hi;

I have indexed wikipedia data with Solr DIH. However when I look data that
is indexed at Solr I something like that as well:

{| style="text-align: left; width: 50%; table-layout: fixed;" border="0"
|- valign="top"
| style="width: 50%"|
:*[[Ubuntu]]
:*[[Fedora]]
:*[[Mandriva]]
:*[[Linux Mint]]
:*[[Debian]]
:*[[OpenSUSE]]
|
*[[Red Hat]]
*[[Mageia]]
*[[Arch Linux]]
*[[PCLinuxOS]]
*[[Slackware]]
|}

However I want to remove them before indexing. I know that there is a
WikipediaTokenizer in Lucene but how can I remove unnecessary parts ( as
like links, style, etc..) with Solr? 


Mime
View raw message