manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wunderlich, Tobias" <>
Subject Indexing Wikipedia/MediaWiki
Date Fri, 16 Sep 2011 07:53:29 GMT
Hey folks,

I am currently working on a project to create a basic search platform using Solr and ManifoldCF.
One of the content-repositories I need to index is a wiki (MediaWiki) and that's where I ran
into a wall. I tried using the web-connector, but simply crawling the sites resulted in a
lot of content I don't need (navigation-links, ...) and not every information I wanted was
gathered (author, last modified, ...). The only metadata I got was the one included in head/meta,
which wasn't relevant.

Is there another way to get the wiki's data and more important is there a way to get the right
data into the right field? I know that there is a way to export the wiki-sites in xml with
wiki-syntax, but I don't know how that would help me. I could simply use solr's DataImportHandler
to index a complete wiki-dump, but it would be nice to use the same framework for every repository,
especially since manifold manages all the recrawling.

Does anybody have some experience in this direction, or any idea for a solution?

Thanks in advance,

View raw message