lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anton Shokhrin <anton.shokh...@me.com>
Subject Can't get the DIH to recurse to index messages in Outlook PST file
Date Tue, 20 Jan 2015 03:36:24 GMT
Hi List,

My SOLR instance is setup to index PST files with DIH, TikaEntityProcessor and OutlookPSTParser.
After running import, I can see that the index contains the top level information of the PST
file (e.g. unique id of each message, header, PST file size) but the messages themselves are
missing. I suspect that I need to instruct SOLR to recurse to the next level during indexing
inside DIH config file but I don’t know how. My DIH config file looks like so:

<dataSource name="bin" type="BinFileDataSource" />
<document>
	<entity name="files" dataSource="bin" rootEntity="false" processor="FileListEntityProcessor"
baseDir=“/PST_Path" fileName=".*" onError="abort” recursive=“true”>
		<entity pk="uri" name="file" dataSource="bin" processor="TikaEntityProcessor" url="${files.fileAbsolutePath}"
format="xml" rootEntity="true" onError="skip" recursive="true" parser="org.apache.tika.parser.mbox.OutlookPSTParser”>
			<!—- I think I need to insert another entity here to parse/index the actual messages
but I don’t know how to craft one —>
		</entity>
	</entity>
</document>

Any ideas?

Thank you,
Anton
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message