lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Furkan KAMACI <>
Subject WikipediaTokenizer for Removing Unnecesary Parts
Date Tue, 23 Jul 2013 14:53:32 GMT

I have indexed wikipedia data with Solr DIH. However when I look data that
is indexed at Solr I something like that as well:

{| style="text-align: left; width: 50%; table-layout: fixed;" border="0"
|- valign="top"
| style="width: 50%"|
:*[[Linux Mint]]
*[[Red Hat]]
*[[Arch Linux]]

However I want to remove them before indexing. I know that there is a
WikipediaTokenizer in Lucene but how can I remove unnecessary parts ( as
like links, style, etc..) with Solr?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message