manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erlend GarĂ¥sen <>
Subject Indexing Solr with the web crawler
Date Thu, 20 Jan 2011 14:08:22 GMT

I have started the Jetty server, configured the web crawler, a Solr 
connector and created a job. First I try to crawl the following site:
which contains nothing but an index.html with links to different kinds 
of document types (pdf, html, doc etc.).

I have three questions.

1. Why do I now have a lot of these lines in the above host's access_log 
after the crawler has been started? - - [20/Jan/2011:14:28:54 +0100] "GET / HTTP/1.1" 200 
588 "-" "ApacheManifoldCFWebCrawler;" - - [20/Jan/2011:14:29:54 +0100] "GET / HTTP/1.1" 200 
588 "-" "ApacheManifoldCFWebCrawler;" - - [20/Jan/2011:14:30:55 +0100] "GET / HTTP/1.1" 200 
588 "-" "ApacheManifoldCFWebCrawler;" - - [20/Jan/2011:14:31:55 +0100] "GET / HTTP/1.1" 200 
588 "-" "ApacheManifoldCFWebCrawler;"

What is the crawler trying to do which it probably cannot do? Why is it 
fetching the same URL over and over again?

2. How can I index Solr when I don't know which fields ManifoldCF's web 
crawler collects? There is a field mapper in the job configuration, but 
I only know about the fields I have configured in Solr's schema.xml.

3. Will the web crawler parse document types such as PDF, doc, rtf etc.? 
If it does not use Apache Tika, is it possible to configure the web 
crawler to use Tika for document parsing and language detection?


Erlend GarĂ¥sen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

View raw message