manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steph van Schalkwyk <st...@remcam.net>
Subject Re: How to determine the set of all possible fields in MCF output?
Date Sat, 14 Oct 2017 23:50:02 GMT
When you run TIKA standalone on a file, you can see all the emitted fields
for that particular document type as well as added metadata.
<code>

import java.io.File;import java.io.FileInputStream;import java.io.IOException;
import org.apache.tika.exception.TikaException;import
org.apache.tika.metadata.Metadata;import
org.apache.tika.parser.AutoDetectParser;import
org.apache.tika.parser.ParseContext;import
org.apache.tika.parser.Parser;import
org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class ParserExtraction {
	
   public static void main(final String[] args) throws
IOException,SAXException, TikaException {

      //Assume sample.txt is in your current directory
      File file = new File("sample.txt");

      //parse method parameters
      Parser parser = new AutoDetectParser();
      BodyContentHandler handler = new BodyContentHandler();
      Metadata metadata = new Metadata();
      FileInputStream inputstream = new FileInputStream(file);
      ParseContext context = new ParseContext();

      //parsing the file
      parser.parse(inputstream, handler, metadata, context);
      System.out.println("File content : " + Handler.toString());
   }https://www.tutorialspoint.com/tika/tika_content_extraction.htm




*Steph van Schalkwyk*
Principal, Remcam Search Engines
+1.314.452. <+1+314+452+2896>2896    steph@remcam.net   http://remcam.net
<http://www.remcam.net/> Skype: svanschalkwyk
<https://mail.google.com/mail/u/0/#>
<http://linkedin.com/in/vanschalkwyk>

On Sat, Oct 14, 2017 at 6:17 PM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Phil,
>
> You are correct in asserting that in MCF it is the sum total of all the
> connections that the document passes through that determine its attribute
> set.  That includes transformation connections as well as the repository
> connection.
>
> Tika is one connection that does add a lot of fields and these depend not
> only on the configuration of the Tika connection, but also on the kind of
> document being extracted.  If you want to figure out the sum total of
> what's possible, you will need to consult the Tika documentation.  And yes,
> the field names Tika generates are created based on what Tika finds in the
> document.
>
> Alternatively, you can configure your job to send output to a null output
> connection.  This connection records all attribute information for each
> document in the simple history, so you can get an idea what to expect.
>
> I'm a little confused about your statement that Tika runs even when it's
> not in a job's pipeline.  That's not actually true, so I'm wondering what
> you are seeing.
>
> Thanks,
> Karl
>
>
> On Sat, Oct 14, 2017 at 6:39 PM, Phillip Rhodes <motley.crue.fan@gmail.com
> > wrote:
>
>> Hi all, I've been working with MCF the past few days and am very happy
>> with what it lets me do, and I have a pipeline going from my
>> repository to Solr which works fine.  But there is one point I clearly
>> don't understand, which is:
>>
>> How do you know exactly what fields are going to be output in a given
>> configuration?  I found that i had to resort to trial and error to
>> tweak my Solr schema to avoid "undefined field xxxxx" errors from
>> Manifold when trying to write to Solr.  Now to be fair, clearly I
>> could just ignore any fields I don't specifically know I want, but I'd
>> like to understand how this works.
>>
>> Is it the case that the initial set of fields depends on the
>> repository connector?  I found that I seemed to get some Alfresco
>> specific stuff when reading from Alfresco, as opposed to what I got
>> from a simple dummy file-system repo I was initially experimenting
>> with.
>>
>> It also seems that Tika adds some fields, (actually a lot of fields)
>> even when you don't have a Tika transform wired in explicitly?   Is it
>> the case that you need to put in an explicit Tika transform if you
>> want to control which fields are contributed by Tika?
>>
>> And on that point, is there a master list of possible fields that TIka
>> will emit, or is Tika just transforming the names of metadata fields
>> in the documents it encounters, and programmatically generating a
>> field name?
>>
>>
>> Any and all help on understanding how this works is greatly appreciated...
>>
>>
>> Phil
>> ~~~~
>> This message optimized for indexing by NSA PRISM
>>
>
>

Mime
View raw message