tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: TIKA-1164
Date Mon, 11 Jul 2016 12:36:08 GMT
Right.  Use Path instead of File.

From: scatherine.ext@gouv.mc [mailto:scatherine.ext@gouv.mc]
Sent: Monday, July 11, 2016 3:42 AM
To: Allison, Timothy B. <tallison@mitre.org>
Cc: dev@tika.apache.org
Subject: RE: TIKA-1164


Hi Timothy,

Thanks

When I use directly TikaInputStream.get(), it's fine but this method is deprecated in Tika
1.13 and it seems remove in Tika 2.0.

Regards

Samuel Catherine
Intervenant pour le compte de la Direction Informatique
scatherine.ext@gouv.mc<mailto:scatherine.ext@gouv.mc>
+377 98 98 48 93


[Inactive hide details for "Allison, Timothy B." ---08/07/2016 21:26:26---Y, this makes sense.
        Detector detector = TikaC]"Allison, Timothy B." ---08/07/2016 21:26:26---Y, this makes
sense.         Detector detector = TikaConfig.getDefaultConfig().getDetector();

De : "Allison, Timothy B." <tallison@mitre.org<mailto:tallison@mitre.org>>
A : "dev@tika.apache.org<mailto:dev@tika.apache.org>" <dev@tika.apache.org<mailto:dev@tika.apache.org>>,
"scatherine.ext@gouv.mc<mailto:scatherine.ext@gouv.mc>" <scatherine.ext@gouv.mc<mailto:scatherine.ext@gouv.mc>>
Date : 08/07/2016 21:26
Objet : RE: TIKA-1164

________________________________



Y, this makes sense.

       Detector detector = TikaConfig.getDefaultConfig().getDetector();
       File file = new File("testPDFVarious.pdf");
       try (FileInputStream is = new FileInputStream(file)) {
           try (InputStream tis = TikaInputStream.get(is)) {
               System.out.println("length: " + file.length());
               System.out.println("avail before: " + tis.available());
               System.out.println("DETECTED: " + detector.detect(tis, new Metadata()));
               System.out.println("avail after tis: " + tis.available());
               System.out.println("avail after is: " + is.available());
           }
       }

length: 205491
avail before: 205491
DETECTED: application/pdf
avail after tis: 205491
avail after is: 139955

The original input stream is not buffered, and so there is no way to reset it, so y, the detector
has to read quite a few bytes to do detection.

Note, though, that the TikaInputStream or even a BufferedInputStream will be correctly reset
and will have all bytes available.

Btw, it is better to call TikaInputStream.get() directly on the file.  If a parser needs to
copy the original inputstream to a temp file, it can avoid that copy, if you've created your
TikaInputSTream directly from the file.

TikaInputStream tis = TikaInputStream.get(file)

-----Original Message-----
From: Chris Mattmann [mailto:mattmann@apache.org]
Sent: Friday, July 8, 2016 10:14 AM
To: scatherine.ext@gouv.mc<mailto:scatherine.ext@gouv.mc>; dev@tika.apache.org<mailto:dev@tika.apache.org>
Subject: Re: TIKA-1164

Hi Samuel,

I myself haven’t had a chance to look into this yet - maybe someone else on the dev list?

Cheers,
Chris




On 7/8/16, 5:33 AM, "scatherine.ext@gouv.mc<mailto:scatherine.ext@gouv.mc>" <scatherine.ext@gouv.mc<mailto:scatherine.ext@gouv.mc>>
wrote:

>Hi,
>
>Excuse me to this mail but have you seen my problem ?
>
>Regards,
>
>Samuel Catherine
>
>
>
>Samuel
> CATHERINE---05/07/2016 10:31:31---Hi Chris, Ok thanks for the forward.
>
>De : Samuel CATHERINE/Monaco-Gouvernement/MC A : "Mattmann, Chris A
>(3980)" <chris.a.mattmann@jpl.nasa.gov<mailto:chris.a.mattmann@jpl.nasa.gov>>@MCGOUV
>Cc : "dev@tika.apache.org<mailto:dev@tika.apache.org>" <dev@tika.apache.org<mailto:dev@tika.apache.org>>
Date : 05/07/2016
>10:31 Objet : Re: TIKA-1164
>
>________________________________________
>
>
>Hi Chris,
>
>Ok thanks for the forward.
>To help you, when I work only with InputStream (like Rest Service), I haven't got the
problem.
>The case become when i used a File converted in FileInputStream.
>
>FileInputStream content=new FileInputStream(file);
>
>content.avalailable()
>//is ok after definition but is ko after the
>detector.detect(TikaInputStream.get(content),md)
>
>Regards,
>
>Samuel Catherine
>
>
>
>
>"Mattmann,
> Chris A (3980)" ---04/07/2016 17:45:47---Hi Samuel I am forwarding your email to dev@tika.a.o<mailto:dev@tika.a.o>
and moving dev-owner@t.a.o<mailto:dev-owner@t.a.o> to BCC.
>
>De : "Mattmann, Chris A (3980)" <chris.a.mattmann@jpl.nasa.gov<mailto:chris.a.mattmann@jpl.nasa.gov>>
A :
>"scatherine.ext@gouv.mc<mailto:scatherine.ext@gouv.mc>" <scatherine.ext@gouv.mc<mailto:scatherine.ext@gouv.mc>>
Cc :
>"dev@tika.apache.org<mailto:dev@tika.apache.org>" <dev@tika.apache.org<mailto:dev@tika.apache.org>>
Date : 04/07/2016 17:45
>Objet : Re: TIKA-1164 ________________________________________
>
>
>
>Hi Samuel I am forwarding your email to dev@tika.a.o<mailto:dev@tika.a.o> and moving
>dev-owner@t.a.o<mailto:dev-owner@t.a.o> to BCC.
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398) NASA Jet
>Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov<mailto:chris.a.mattmann@nasa.gov>
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Director, Information Retrieval and Data Science Group (IRDS) Adjunct
>Associate Professor, Computer Science Department University of Southern
>California, Los Angeles, CA 90089 USA
>WWW: http://irds.usc.edu/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>
>
>
>
>On 7/4/16, 8:41 AM, "scatherine.ext@gouv.mc<mailto:scatherine.ext@gouv.mc>" <scatherine.ext@gouv.mc<mailto:scatherine.ext@gouv.mc>>
wrote:
>
>>Hi,
>>
>>I use Tika to detect MediaType and i have the same problem than the
>>JIRA TIKA-1164
>>https://issues.apache.org/jira/browse/TIKA-1164?page=com.atlassian.jir
>>a.plugin.system.issuetabpanels:all-tabpanel
>>But I use the version 1.13. How can I solve this problem, please ?
>>
>>MediaType mediaType=null;
>>        Metadata md =
>>new Metadata();
>>        md.set(Metadata.RESOURCE_NAME_KEY,
>>fileName);
>>        Detector detector =
>>TikaConfig.getDefaultConfig().getDetector();
>>
>>        try {
>>            mediaType =
>>detector.detect(TikaInputStream.get(content),
>>md);
>>
>>        } catch (IOException
>>e) {
>>
>>            mediaType =
>>null;
>>        }
>>
>>The contentsize (content.available()) change between before and after the detect call.
>>
>>Regards,
>>
>>Samuel Catherine
>>
>>
>

Mime
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message