Hi Paul,

That ticket applies only to the JCIFS connector, and other connectors that have to map extensions to mime types.  The Web connector does not have to do that.

The Web connector has certain mime types it knows it can extract links from, but as far as content, it leaves that up to the output connection.  Here's the code:

>>>>>>
    // There are presumably mime types we can extract links from that we can't index?
    if (interestingMimeTypeMap.get(contentType) != null)
      return true;
   
    boolean rval = activities.checkMimeTypeIndexable(contentType);
    if (rval == false && Logging.connectors.isDebugEnabled())
      Logging.connectors.debug("Web: For document '"+documentIdentifier+"', not fetching because output connector does not want mimetype '"+contentType+"'");
    return rval;
<<<<<<

You can tell if this is what is happening to your document by turning on connector debug (in properties.xml: <property name="org.apache.manifoldcf.connectors" value="DEBUG"/>).  But if you are using the Solr connector, you can select the mime types desired on one of the job tabs.

Karl




On Wed, Jan 15, 2014 at 5:39 AM, Paul Bieles <paulbieles@hotmail.com> wrote:
Many thanks for the reply Karl...
 
I discovered the following issue - https://issues.apache.org/jira/i#browse/CONNECTORS-768 extending this might help us resolve the problem.  Would it be a good idea to have this list in a config file, that way it could be extended easier?
 
Paul
 

Date: Tue, 14 Jan 2014 12:36:20 -0500
Subject: Re: ManifoldCF SOLR request default Content-Type
From: daddywri@gmail.com
To: user@manifoldcf.apache.org


Hi Paul,

When there is no content type on a web crawl, the ManifoldCF web connector does not default anything -- it sets null as the content type.

The Solr output connector also does not default anything; it returns null to SolrJ when SolrJ requests the content type.  What SolrJ does under those conditions is anyone's guess, but I suspect that that is where the application/octet content type is getting set.  I'd have to look at that code to be sure.

Karl



On Tue, Jan 14, 2014 at 12:29 PM, Paul Bieles <paulbieles@hotmail.com> wrote:
Does ManifoldCF default Content-Type to application/octet-stream for file types that it doesn't know? If so, is there a way to set it to something else? The reason I ask is I've got a load of kml files that I'm pushing into solr.
 
Cheers,
 
Paul