manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erlend GarĂ¥sen <e.f.gara...@usit.uio.no>
Subject Re: Indexing Solr with the web crawler
Date Thu, 20 Jan 2011 14:50:30 GMT
On 20.01.11 15.21, Karl Wright wrote:
> Hi Erlend,

Hi Karl,

Thank you for replying and for your comments. It's very appreciated.

> (1) The best way to find out what ManifoldCF thinks it is doing is to
> look at the Simple History report in the UI.

It says:

01-20-2011 15:14:18.914 	document ingest (solr_indexer) 
http://ridder.uio.no/
	500 	588 	9 	lazy loading error
01-20-2011 15:14:18.800 	fetch 	http://ridder.uio.no/
	200 	588 	103 	
01-20-2011 15:13:18.581 	document ingest (solr_indexer) 
http://ridder.uio.no/
	500 	588 	16 	lazy loading error
01-20-2011 15:13:18.448 	fetch 	http://ridder.uio.no/
	200 	588 	111


> (2) The Web Connector in ManifoldCF does not have the ability, at this
> time, to extract links from Word docs, pdfs, etc., but Solr can
> extract *content* from these documents if you configure it to use
> Tika.  The document is sent to Solr in binary form, and Tika extracts
> whatever metadata it can find.  ManifoldCF does not get involved in
> that at all.  Usually, setting up Solr with anonymous fields is the
> way to go in this case.

Thanks for clarifying. I can try to configure Solr to parse these 
documents. Nutch did a good job except that it cannot detect whether a 
document was modified in order to send an update/delete commando to 
Solr. That function is crucial for us.

I'm unsure about what you mean by anonymous fields in Solr. I cannot 
define the fields I need in schema.xml as I want? I have created 
duplicate fields for title and content in order to use different 
stemmers (I need to support English and Norwegian). In Nutch there is a 
simple configuration file for mapping fields from Nutch to Solr.

> If this is an open site, I'll crawl it here myself momentarily and let
> you know what I find.

Please do that. It's just my workstation with an Apache server running. 
It's open.

BTW, I think I have set things up correctly for the crawler:
Seeds: http://ridder.uio.no/
Inclusions: ^http://ridder.uio.no/.* (checked for "include only hosts 
matching seeds)

I havent't filled out the "expiration interval (if continuous)." under 
the scheduling folder. Is this the reason why ManifoldCF is recrawling 
the page every minute?

Erlend

-- 
Erlend GarĂ¥sen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Mime
View raw message