nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Created] (NUTCH-2198) Indexing binary content by index-html causes Solr Exception
Date Sat, 09 Jan 2016 13:17:40 GMT
Sebastian Nagel created NUTCH-2198:
--------------------------------------

             Summary: Indexing binary content by index-html causes Solr Exception
                 Key: NUTCH-2198
                 URL: https://issues.apache.org/jira/browse/NUTCH-2198
             Project: Nutch
          Issue Type: Bug
          Components: indexer
    Affects Versions: 2.3.1
            Reporter: Sebastian Nagel
             Fix For: 2.4


(reported by [~kalanya] in NUTCH-2168)
If raw binary is indexed using the plugin index-html this may cause an exception in Solr:
{noformat}
2016-01-05 12:28:00,152 INFO html.HtmlIndexingFilter - Html indexing for: http://ujiapps.uji.es/com/investigacio/img/ciencia11.jpg
2016-01-05 12:28:00,163 INFO html.HtmlIndexingFilter - Html indexing for: http://ujiapps.uji.es/serveis/cd/bib/reservori/2015/e-llibres/
2016-01-05 12:28:00,164 INFO solr.SolrIndexWriter - Adding 250 documents
2016-01-05 12:28:00,531 INFO solr.SolrIndexWriter - Adding 250 documents
2016-01-05 12:28:00,842 WARN mapred.LocalJobRunner - job_local1207147570_0001
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
[was class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char #137317,
byte #139263)
{noformat}

The index-html plugin tries to treat any raw content as readable content converting it to
a String based on the platform-dependent charset (cf. [Scanner API docus|http://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html]):
{code:title=HtmlIndexingFilter.java}
            Scanner scanner = new Scanner(arrayInputStream);
            scanner.useDelimiter("\\Z");//To read all scanner content in one String
            String data = "";
            if (scanner.hasNext()) {
                data = scanner.next();
            }
            doc.add("rawcontent", StringUtil.cleanField(data));
{code}

The field "rawcontent" is of type "string":
{code:xml|title=conf/schema.xml}
    <!-- fields for index-html plugin
         Note: although raw document content may be binary,
               index-html adds a String to the index field -->
    <field name="rawcontent" type="string" stored="true" indexed="false"/>
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message