manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Indexing Solr with the web crawler
Date Thu, 20 Jan 2011 14:21:41 GMT
Hi Erlend,

(1) The best way to find out what ManifoldCF thinks it is doing is to
look at the Simple History report in the UI.

(2) The Web Connector in ManifoldCF does not have the ability, at this
time, to extract links from Word docs, pdfs, etc., but Solr can
extract *content* from these documents if you configure it to use
Tika.  The document is sent to Solr in binary form, and Tika extracts
whatever metadata it can find.  ManifoldCF does not get involved in
that at all.  Usually, setting up Solr with anonymous fields is the
way to go in this case.

If this is an open site, I'll crawl it here myself momentarily and let
you know what I find.

Karl



On Thu, Jan 20, 2011 at 9:08 AM, Erlend GarĂ¥sen <e.f.garasen@usit.uio.no> wrote:
>
> I have started the Jetty server, configured the web crawler, a Solr
> connector and created a job. First I try to crawl the following site:
> http://ridder.uio.no/
> which contains nothing but an index.html with links to different kinds of
> document types (pdf, html, doc etc.).
>
> I have three questions.
>
> 1. Why do I now have a lot of these lines in the above host's access_log
> after the crawler has been started?
> 193.157.137.137 - - [20/Jan/2011:14:28:54 +0100] "GET / HTTP/1.1" 200 588
> "-" "ApacheManifoldCFWebCrawler;"
> 193.157.137.137 - - [20/Jan/2011:14:29:54 +0100] "GET / HTTP/1.1" 200 588
> "-" "ApacheManifoldCFWebCrawler;"
> 193.157.137.137 - - [20/Jan/2011:14:30:55 +0100] "GET / HTTP/1.1" 200 588
> "-" "ApacheManifoldCFWebCrawler;"
> 193.157.137.137 - - [20/Jan/2011:14:31:55 +0100] "GET / HTTP/1.1" 200 588
> "-" "ApacheManifoldCFWebCrawler;"
>
> What is the crawler trying to do which it probably cannot do? Why is it
> fetching the same URL over and over again?
>
> 2. How can I index Solr when I don't know which fields ManifoldCF's web
> crawler collects? There is a field mapper in the job configuration, but I
> only know about the fields I have configured in Solr's schema.xml.
>
> 3. Will the web crawler parse document types such as PDF, doc, rtf etc.? If
> it does not use Apache Tika, is it possible to configure the web crawler to
> use Tika for document parsing and language detection?
>
> Erlend
>
> --
> Erlend GarĂ¥sen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>

Mime
View raw message