tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: TIKA-1164
Date Fri, 08 Jul 2016 19:26:19 GMT
Y, this makes sense.

        Detector detector = TikaConfig.getDefaultConfig().getDetector();
        File file = new File("testPDFVarious.pdf");
        try (FileInputStream is = new FileInputStream(file)) {
            try (InputStream tis = TikaInputStream.get(is)) {
                System.out.println("length: " + file.length());
                System.out.println("avail before: " + tis.available());
                System.out.println("DETECTED: " + detector.detect(tis, new Metadata()));
                System.out.println("avail after tis: " + tis.available());
                System.out.println("avail after is: " + is.available());
            }
        }

length: 205491
avail before: 205491
DETECTED: application/pdf
avail after tis: 205491
avail after is: 139955

The original input stream is not buffered, and so there is no way to reset it, so y, the detector
has to read quite a few bytes to do detection.

Note, though, that the TikaInputStream or even a BufferedInputStream will be correctly reset
and will have all bytes available.

Btw, it is better to call TikaInputStream.get() directly on the file.  If a parser needs to
copy the original inputstream to a temp file, it can avoid that copy, if you've created your
TikaInputSTream directly from the file.

TikaInputStream tis = TikaInputStream.get(file)

-----Original Message-----
From: Chris Mattmann [mailto:mattmann@apache.org] 
Sent: Friday, July 8, 2016 10:14 AM
To: scatherine.ext@gouv.mc; dev@tika.apache.org
Subject: Re: TIKA-1164

Hi Samuel,

I myself haven’t had a chance to look into this yet - maybe someone else on the dev list?

Cheers,
Chris




On 7/8/16, 5:33 AM, "scatherine.ext@gouv.mc" <scatherine.ext@gouv.mc> wrote:

>Hi,
>
>Excuse me to this mail but have you seen my problem ?
>
>Regards,
>
>Samuel Catherine
>
>
>
>Samuel
> CATHERINE---05/07/2016 10:31:31---Hi Chris, Ok thanks for the forward.
>
>De : Samuel CATHERINE/Monaco-Gouvernement/MC A : "Mattmann, Chris A 
>(3980)" <chris.a.mattmann@jpl.nasa.gov>@MCGOUV
>Cc : "dev@tika.apache.org" <dev@tika.apache.org> Date : 05/07/2016 
>10:31 Objet : Re: TIKA-1164
>
>________________________________________
>
>
>Hi Chris,
>
>Ok thanks for the forward.
>To help you, when I work only with InputStream (like Rest Service), I haven't got the
problem.
>The case become when i used a File converted in FileInputStream.
>
>FileInputStream content=new FileInputStream(file);
>
>content.avalailable()
>//is ok after definition but is ko after the 
>detector.detect(TikaInputStream.get(content),md)
>
>Regards,
>
>Samuel Catherine
>
>
>
>
>"Mattmann,
> Chris A (3980)" ---04/07/2016 17:45:47---Hi Samuel I am forwarding your email to dev@tika.a.o
and moving dev-owner@t.a.o to BCC.
>
>De : "Mattmann, Chris A (3980)" <chris.a.mattmann@jpl.nasa.gov> A : 
>"scatherine.ext@gouv.mc" <scatherine.ext@gouv.mc> Cc : 
>"dev@tika.apache.org" <dev@tika.apache.org> Date : 04/07/2016 17:45 
>Objet : Re: TIKA-1164 ________________________________________
>
>
>
>Hi Samuel I am forwarding your email to dev@tika.a.o and moving 
>dev-owner@t.a.o to BCC.
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398) NASA Jet 
>Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Director, Information Retrieval and Data Science Group (IRDS) Adjunct 
>Associate Professor, Computer Science Department University of Southern 
>California, Los Angeles, CA 90089 USA
>WWW: http://irds.usc.edu/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>
>
>
>
>On 7/4/16, 8:41 AM, "scatherine.ext@gouv.mc" <scatherine.ext@gouv.mc> wrote:
>
>>Hi,
>>
>>I use Tika to detect MediaType and i have the same problem than the 
>>JIRA TIKA-1164 
>>https://issues.apache.org/jira/browse/TIKA-1164?page=com.atlassian.jir
>>a.plugin.system.issuetabpanels:all-tabpanel
>>But I use the version 1.13. How can I solve this problem, please ?
>>
>>MediaType mediaType=null;
>>        Metadata md =
>>new Metadata();
>>        md.set(Metadata.RESOURCE_NAME_KEY,
>>fileName);
>>        Detector detector = 
>>TikaConfig.getDefaultConfig().getDetector();
>>
>>        try {
>>            mediaType =
>>detector.detect(TikaInputStream.get(content),
>>md);
>>
>>        } catch (IOException
>>e) {
>>           
>>            mediaType =
>>null;
>>        }
>>
>>The contentsize (content.available()) change between before and after the detect call.
>>
>>Regards,
>>
>>Samuel Catherine
>>
>>
>

Mime
View raw message