manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "阿部 慎一朗" <>
Subject Support for content extraction?
Date Wed, 02 Mar 2011 11:23:35 GMT
I want to use the output into solr by ManifoldCF.
My crawling target is files of windows shares repository.
I think that this framework can obtain paths, security, and metadata of those files by executing
But, It can not extract text content in crawling files, and can not be attributes of solr
output, probably. For example, text data of MS excel or PDF documents.
It need to include framework like Tika, if it implements text content exrtraction on ManifoldCF.
Is this idea correct? Or any ideas, please. Thanks.

View raw message