lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gary Taylor (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SOLR-7174) DIH using BinFileDataSource, FileListEntityProcessor and TikaEntityProcessor only reads first document
Date Fri, 27 Feb 2015 12:26:04 GMT
Gary Taylor created SOLR-7174:
---------------------------------

             Summary: DIH using BinFileDataSource, FileListEntityProcessor and TikaEntityProcessor
only reads first document
                 Key: SOLR-7174
                 URL: https://issues.apache.org/jira/browse/SOLR-7174
             Project: Solr
          Issue Type: Bug
          Components: contrib - DataImportHandler
    Affects Versions: 5.0
         Environment: Windows 7.  Ubuntu 14.04.
            Reporter: Gary Taylor


Downloaded Solr 5.0.0, on a Windows 7 PC.   I ran "solr start" and then "solr create -c hn2"
to create a new core.

I want to index a load of epub files that I've got in a directory. So I created a data-import.xml
(in solr\hn2\conf):

<dataConfig>
    <dataSource type="BinFileDataSource" name="bin" />
    <document>
        <entity name="files" dataSource="null" rootEntity="false"
            processor="FileListEntityProcessor"
            baseDir="c:/Users/gt/Documents/epub" fileName=".*epub"
            onError="skip"
            recursive="true">
            <field column="fileAbsolutePath" name="id" />
            <field column="fileSize" name="size" />
            <field column="fileLastModified" name="lastModified" />

            <entity name="documentImport" processor="TikaEntityProcessor"
                url="${files.fileAbsolutePath}" format="text" dataSource="bin" onError="skip">
                <field column="file" name="fileName"/>
                <field column="Author" name="author" meta="true"/>
                <field column="title" name="title" meta="true"/>
                <field column="text" name="content"/>
            </entity>
        </entity>
    </document>
</dataConfig>

In my solrconfig.xml, I added a requestHandler entry to reference my data-import.xml:

  <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
      <lst name="defaults">
          <str name="config">data-import.xml</str>
      </lst>
  </requestHandler>

I renamed managed-schema to schema.xml, and ensured the following doc fields were setup:

      <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false"
/>
      <field name="fileName" type="string" indexed="true" stored="true" />
      <field name="author" type="string" indexed="true" stored="true" />
      <field name="title" type="string" indexed="true" stored="true" />

      <field name="size" type="long" indexed="true" stored="true" />
      <field name="lastModified" type="date" indexed="true" stored="true" />

      <field name="content" type="text_en" indexed="false" stored="true" multiValued="false"/>
      <field name="text" type="text_en" indexed="true" stored="false" multiValued="true"/>

    <copyField source="content" dest="text"/>

I copied all the jars from dist and contrib\* into server\solr\lib.

Stopping and restarting solr then creates a new managed-schema file and renames schema.xml
to schema.xml.back

All good so far.

Now I go to the web admin for dataimport (http://localhost:8983/solr/#/hn2/dataimport//dataimport)
and try and execute a full import.

But, the results show "Requests: 0, Fetched: 58, Skipped: 0, Processed:1" - ie. it only adds
one document (the very first one) even though it's iterated over 58!

No errors are reported in the logs. 

I can repeat this on Ubuntu 14.04 using the same steps, so it's not Windows specific.

-----------------

If I change the data-import.xml to use FileDataSource and PlainTextEntityProcessor and parse
txt files, eg: 

<dataConfig>  
	<dataSource type="FileDataSource" name="bin" />
	<document>
		<entity name="files" dataSource="null" rootEntity="false"
			processor="FileListEntityProcessor"
			baseDir="c:/Users/gt/Documents/epub" fileName=".*txt">
			<field column="fileAbsolutePath" name="id" />
			<field column="fileSize" name="size" />
			<field column="fileLastModified" name="lastModified" />

			<entity name="documentImport" processor="PlainTextEntityProcessor"
				url="${files.fileAbsolutePath}" format="text" dataSource="bin">
				<field column="plainText" name="content"/>
			</entity>
		</entity>
	</document> 
</dataConfig> 

This works.  So it's a combo of BinFileDataSource and TikaEntityProcessor that is failing.


On Windows, I ran Process Monitor, and spotted that only the very first epub file is actually
being read (repeatedly).


With verbose and debug on when running the DIH, I get the following response:

....
  "verbose-output": [
    "entity:files",
    [
      null,
      "----------- row #1-------------",
      "fileSize",
      2609004,
      "fileLastModified",
      "2015-02-25T11:37:25.217Z",
      "fileAbsolutePath",
      "c:\\Users\\gt\\Documents\\epub\\issue018.epub",
      "fileDir",
      "c:\\Users\\gt\\Documents\\epub",
      "file",
      "issue018.epub",
      null,
      "---------------------------------------------",
      "entity:documentImport",
      [
        "document#1",
        [
          "query",
          "c:\\Users\\gt\\Documents\\epub\\issue018.epub",
          "time-taken",
          "0:0:0.0",
          null,
          "----------- row #1-------------",
          "text",
          "< ... parsed epub text - snip ... >"
          "title",
          "Issue 18 title",
          "Author",
          "Author text",
          null,
          "---------------------------------------------"
        ],
        "document#2",
        []
      ],
      null,
      "----------- row #2-------------",
      "fileSize",
      4428804,
      "fileLastModified",
      "2015-02-25T11:37:36.399Z",
      "fileAbsolutePath",
      "c:\\Users\\gt\\Documents\\epub\\issue019.epub",
      "fileDir",
      "c:\\Users\\gt\\Documents\\epub",
      "file",
      "issue019.epub",
      null,
      "---------------------------------------------",
      "entity:documentImport",
      [
        "document#2",
        []
      ],
      null,
      "----------- row #3-------------",
      "fileSize",
      2580266,
      "fileLastModified",
      "2015-02-25T11:37:41.188Z",
      "fileAbsolutePath",
      "c:\\Users\\gt\\Documents\\epub\\issue020.epub",
      "fileDir",
      "c:\\Users\\gt\\Documents\\epub",
      "file",
      "issue020.epub",
      null,
      "---------------------------------------------",
      "entity:documentImport",
      [
        "document#2",
        []
      ],
....
....





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message