manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <>
Subject Re: Indexing Wikipedia/MediaWiki
Date Fri, 16 Sep 2011 09:31:45 GMT
It might be worth exploring sitemaps.

It may be possible to create a connector, much like the RSS connector,
that you can point at a site map and it would just pick up the pages.
In fact, I think it would be straightforward to modify the RSS
connector to understand sitemap format.

If you can do a little research to figure out if this might work for
you, I'd be willing to do some work and try to implement it.


On Fri, Sep 16, 2011 at 3:53 AM, Wunderlich, Tobias
<> wrote:
> Hey folks,
> I am currently working on a project to create a basic search platform using
> Solr and ManifoldCF. One of the content-repositories I need to index is a
> wiki (MediaWiki) and that’s where I ran into a wall. I tried using the
> web-connector, but simply crawling the sites resulted in a lot of content I
> don’t need (navigation-links, …) and not every information I wanted was
> gathered (author, last modified, …). The only metadata I got was the one
> included in head/meta, which wasn’t relevant.
> Is there another way to get the wiki’s data and more important is there a
> way to get the right data into the right field? I know that there is a way
> to export the wiki-sites in xml with wiki-syntax, but I don’t know how that
> would help me. I could simply use solr’s DataImportHandler to index a
> complete wiki-dump, but it would be nice to use the same framework for every
> repository, especially since manifold manages all the recrawling.
> Does anybody have some experience in this direction, or any idea for a
> solution?
> Thanks in advance,
> Tobias

View raw message