manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Indexing Solr with the web crawler
Date Thu, 20 Jan 2011 14:27:43 GMT
Hmm, right now I'm behind a firewall, unfortunately, so I won't be
able to try this myself until this evening.  But if you post the
output of your simple history report I can help interpret it for you.

Karl

On Thu, Jan 20, 2011 at 9:21 AM, Karl Wright <daddywri@gmail.com> wrote:
> Hi Erlend,
>
> (1) The best way to find out what ManifoldCF thinks it is doing is to
> look at the Simple History report in the UI.
>
> (2) The Web Connector in ManifoldCF does not have the ability, at this
> time, to extract links from Word docs, pdfs, etc., but Solr can
> extract *content* from these documents if you configure it to use
> Tika.  The document is sent to Solr in binary form, and Tika extracts
> whatever metadata it can find.  ManifoldCF does not get involved in
> that at all.  Usually, setting up Solr with anonymous fields is the
> way to go in this case.
>
> If this is an open site, I'll crawl it here myself momentarily and let
> you know what I find.
>
> Karl
>
>
>
> On Thu, Jan 20, 2011 at 9:08 AM, Erlend Garåsen <e.f.garasen@usit.uio.no> wrote:
>>
>> I have started the Jetty server, configured the web crawler, a Solr
>> connector and created a job. First I try to crawl the following site:
>> http://ridder.uio.no/
>> which contains nothing but an index.html with links to different kinds of
>> document types (pdf, html, doc etc.).
>>
>> I have three questions.
>>
>> 1. Why do I now have a lot of these lines in the above host's access_log
>> after the crawler has been started?
>> 193.157.137.137 - - [20/Jan/2011:14:28:54 +0100] "GET / HTTP/1.1" 200 588
>> "-" "ApacheManifoldCFWebCrawler;"
>> 193.157.137.137 - - [20/Jan/2011:14:29:54 +0100] "GET / HTTP/1.1" 200 588
>> "-" "ApacheManifoldCFWebCrawler;"
>> 193.157.137.137 - - [20/Jan/2011:14:30:55 +0100] "GET / HTTP/1.1" 200 588
>> "-" "ApacheManifoldCFWebCrawler;"
>> 193.157.137.137 - - [20/Jan/2011:14:31:55 +0100] "GET / HTTP/1.1" 200 588
>> "-" "ApacheManifoldCFWebCrawler;"
>>
>> What is the crawler trying to do which it probably cannot do? Why is it
>> fetching the same URL over and over again?
>>
>> 2. How can I index Solr when I don't know which fields ManifoldCF's web
>> crawler collects? There is a field mapper in the job configuration, but I
>> only know about the fields I have configured in Solr's schema.xml.
>>
>> 3. Will the web crawler parse document types such as PDF, doc, rtf etc.? If
>> it does not use Apache Tika, is it possible to configure the web crawler to
>> use Tika for document parsing and language detection?
>>
>> Erlend
>>
>> --
>> Erlend Garåsen
>> Center for Information Technology Services
>> University of Oslo
>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>>
>

Mime
View raw message