tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: TikaInputStream parse the content and write to OutputStream
Date Thu, 18 May 2017 16:27:58 GMT
Please DO NOT use Apache Tika for malware scanning.  Please use a package that is designed
for malware detection.


From: Prateek Agarwal [mailto:pra.agar@gmail.com]
Sent: Thursday, May 18, 2017 8:17 AM
To: Allison, Timothy B. <tallison@mitre.org>; dev@tika.apache.org
Subject: Re: TikaInputStream parse the content and write to OutputStream

Thanks Allison,
The requirement is to upload a file to remote directory and we are suppose to provide an API
for upload, that internally does a malware scanning using content reading. The upload API
is done as shared in the below code and I'm trying to get the content using Apache Tika, but
I'm not sure if we can Identify a Malware based on a content?

Point 2)
Apart from above requirement, as I pass the Buffered Input Stream to Tika Input Stream, the
same Buffer Input Stream is being again read for Output Stream but the stream is closed after
the Tika parser has completed the task. Do we need to clone the Input Stream or Tika handles
it?

Code:

try (final BufferedInputStream input = new BufferedInputStream(pInputStream, bytesSize);

    final BufferedOutputStream output = new BufferedOutputStream(new FileOutputStream(pObjectFile),
bytesSize);

        final TikaInputStream stream = TikaInputStream.get(input)) {

    try {

        //parsing the file

        parser.parse(stream, handler, metadata, context);

        LOGGER.log(Level.INFO, "File content - {0}", handler.toString());

    } catch (IOException | SAXException | TikaException ex) {

        LOGGER.log(Level.SEVERE, null, ex);

    }

    byte[] buffer = new byte[bytesSize];

    // Tried inpt.read as well as stream.read, both are not working

    for (int length = 0; ((length = stream.read(buffer)) > 0);) {

        output.write(buffer, 0, length);

        bytesWritten += length;

    }

}




On Thu, May 18, 2017 at 4:06 PM Allison, Timothy B. <tallison@mitre.org<mailto:tallison@mitre.org>>
wrote:
While Apache Tika can be used to support forensic analysis/malware detection, it is NOT designed
to identify malware. DO NOT rely on Apache Tika to identify malware.

I'd recommend using clamav or a commercial antivirus program.

If you want to use Tika for another reason (text/metadata extraction/file type detection),
I'll be happy to answer your use question.  Let me know.

-----Original Message-----
From: Chris Mattmann [mailto:mattmann@apache.org<mailto:mattmann@apache.org>]
Sent: Wednesday, May 17, 2017 10:12 PM
To: dev@tika.apache.org<mailto:dev@tika.apache.org>
Cc: Prateek Agarwal <pra.agar@gmail.com<mailto:pra.agar@gmail.com>>
Subject: Re: TikaInputStream parse the content and write to OutputStream

[moving dev-owner@ to BCC]



Forwarding to the Tika list.







From: Prateek Agarwal <pra.agar@gmail.com<mailto:pra.agar@gmail.com>>
Date: Tuesday, May 16, 2017 at 6:35 AM
To: "dev-owner@tika.apache.org<mailto:dev-owner@tika.apache.org>" <dev-owner@tika.apache.org<mailto:dev-owner@tika.apache.org>>
Subject: TikaInputStream parse the content and write to OutputStream



Hi,

We have a Upload API that basically uploads file to a Server. Now as per new requirement I've
to scan the content for any malware and if not present store the file to the server. The basic
upload is working fine. Problem I'm facing is when I use Apache Tika.

1.    How do we get to know if the file is a malware?

2.    I'm able to get the content from Tika Parser, but the file that's stored is of zero
size on server. Do I have to clone the Input Stream, one for tika parser, one for output stream?

Code:
try (final BufferedInputStream input = new BufferedInputStream(pInputStream, bytesSize);
    final BufferedOutputStream output = new BufferedOutputStream(new FileOutputStream(pObjectFile),
bytesSize);
        final TikaInputStream stream = TikaInputStream.get(input)) {
    try {
        //parsing the file
        parser.parse(stream, handler, metadata, context);
        LOGGER.log(Level.INFO, "File content - {0}", handler.toString());
    } catch (IOException | SAXException | TikaException ex) {
        LOGGER.log(Level.SEVERE, null, ex);
    }
    byte[] buffer = new byte[bytesSize];
    // Tried inpt.read as well as stream.read, both are not working
    for (int length = 0; ((length = stream.read(buffer)) > 0);) {
        output.write(buffer, 0, length);
        bytesWritten += length;
    }
}


I've even asked the same Question of SOF

~

Prateek Agarwal



--

~
Prateek Agarwal
--
~
Prateek Agarwal
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message