manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Kelleher <mj.kelle...@gmail.com>
Subject Document Processing
Date Mon, 05 Dec 2011 18:45:29 GMT
I am crawling a bunch of HTML pages within a site, that will be sent to 
Solr for indexing.  I want to extract some content out of the pages, 
each piece of content to be stored as its own field BEFORE indexing in Solr.

My guess would be that I should use a Document processing pipeline in 
Solr like UIMA, or something of the like.

However, to limit the amount of load on Solr, I was wondering if there 
was a way to "hook" into the Solr connector to create these additional 
fields / handle this processing.  Maybe this would be an "extended" Solr 
connector that I would create.

Or should this really be done within Solr, because Solr already handles 
this kind of processing?

Any guidance / help would be great.

thanks.

Mime
View raw message