jmeter-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sebb <seb...@gmail.com>
Subject Re: Add Apache Tika in JMeter to extract text from various file type
Date Tue, 06 Nov 2012 02:59:10 GMT
On 5 November 2012 20:05, Philippe Mouawad <philippe.mouawad@gmail.com> wrote:
> But wouln't this make setup more complex and error prone ?
> See nightly build experience, lot of people miss the fact they must copy
> lib folder in first zip.
>
> It would not work out of the box anymore as it does for now.

JMeter would work, provided that the the missing features were not used.

> Isn't too much work for just size concern ?

I don't think so otherwise I would not have raised the issue.

> Sebb what do you mean by catching exception ?

Exactly that.

AIUI only two jars are needed to use Tika; I assume that the other
jars are referenced automatically from tika-core or tika-parser.
We just need to catch whatever error is generated when Tika cannot
load the required jars.

> Is it at first time or every call , if so wouln't impact negatively
> performances ?

There would be no performance impact if the required jars are present,
unlike if we used dynamic loading.

If some jars are missing, then some functionality would not work.
This is similar what already happens if someone uses a 3rd party
add-on and forgets to install the jar.
However, hopefully we could improve the error reporting in the Tika case.

> Regards
> Philippe
>
>
> On Monday, November 5, 2012, sebb wrote:
>
>> On 5 November 2012 14:00, Milamber <milamber@apache.org <javascript:;>>
>> wrote:
>> >
>> >
>> > Le 05/11/2012 11:26, sebb a ecrit :
>> >
>> >> On 3 November 2012 19:23, Milamber<milamber@apache.org <javascript:;>>
>>  wrote:
>> >>>
>> >>> Hello,
>> >>>
>> >>> Currently, I work to add Apache Tika 1.2 [1] in JMeter to improve
>> >>> functional
>> >>> tests.
>> >>>
>> >>> With Tika, you can extract the text form various documents, like MS
>> >>> Office
>> >>> (Word, Excel, PowerPoint 97-2003, 2007-2010 (openxml), OpenOffice
>> >>> (writer,
>> >>> calc, impress), HTML, Gz, jar/zip files (list of content), and some
>> >>> "multimedia" files like mp3, mp4, flv, etc.
>> >>>
>> >>> In JMeter, Tika can be used by the View Results Tree to view the text
>> >>> data
>> >>> of this files, Regular extractor to catch some text from this files
and
>> >>> Response assertion to assert on the data.
>> >>>
>> >>> The inconvenient is: Apache Tika requires a big jar (25Mb) or a lot
of
>> >>> jar
>> >>> files (see below). With all jars in the binary package, the new size
>> (for
>> >>> tgz) is 45 Mb (JMeter 2.8 tgz : 23Mb)
>> >>>
>> >>> The question: are you agree to add Tika (and new capability to "extract
>> >>> text
>> >>> from Document") in JMeter with the new binary size?
>> >>>
>> >>> Secondary question: what the good way? : 1/ Add only tika-app.jar
>> (which
>> >>> include all dependencies) [2], or 2/ Add several jar files (tika-core,
>> >>> tika-parser, etc + dependencies) [3]
>> >>
>> >> I'm concerned that using Tika would double the size of JMeter.
>> >> Although the extra features would be useful, I suspect that most test
>> >> cases won't need the extra functionality.
>> >>
>> >> Would it be possible to make the Tika jars optional?
>> >> i.e. add the functionality, but if the jars are not present it is
>> >> disabled.
>> >
>> >
>> > Yes seems possible via a dynamic class control / loading
>> >
>> >
>> >
>> >>
>> >> If we accept that developers must download Tika, then it should be
>> >> easy enough to structure the add-on so that JMeter can fail gracefully
>> >> if the jars are missing.
>> >> But ideally developers would not need to download all the jars either.
>> >
>> >
>> > Currently, to compile the "tika" elements, we must have only these jars :
>> > tika-core.jar
>> > tika-parsers.jar
>>
>> That would be fine.
>>
>> > To the binary release, we needs had these jars (full list):
>> > apache-mime4j-core.jar
>> > apache-mime4j-dom.jar
>> > asm.jar
>> > aspectjrt.jar
>> > boilerpipe.jar
>> > commons-compress.jar
>> > dom4j.jar
>> > fontbox.jar
>> > geronimo-stax-api_1.0_spec.jar
>> > gson.jar
>> > isoparser.jar
>> > jempbox.jar
>> > juniversalchardet.jar
>> > log4j.jar
>> > metadata-extractor.jar
>> > netcdf.jar
>> > pdfbox.jar
>> > poi-ooxml-schemas.jar
>> > poi-ooxml.jar
>> > poi-scratchpad.jar
>> > poi.jar
>> > rome.jar
>> > slf4j-api.jar
>> > slf4j-log4j12.jar
>> > tagsoup.jar
>> > tika-core.jar
>> > tika-parsers.jar
>> > tika-xmp.jar
>> > vorbis-java-core.jar
>> > vorbis-java-tika.jar
>> > xmlbeans.jar
>> > xmpcore.jar
>> > xz.jar
>> >
>> > Or only the tika-app.jar (25Mb)
>> >
>> >
>> > So, we can add the "tika" functionalities with dynamic class loading, add
>> > some warning messages to indicate the download of tika-app.jar if you
>> want
>> > have the tika behavior
>> >
>> > For View Results Tree, when the "Document" combo list is choosed: a
>> message
>> > in Response data to indicate the missing tika-app.jar (with some
>> indication
>> > where download it)
>> >
>> > For RegExp and Response Assertion, if missing tika-app.jar, a warning
>> dialog
>> > to show the message when the radio button "Response as a Document" is
>> > selected
>> >
>> > And in all cases, a warning message in jmeter.log.
>>
>> Rather than use dynamic class loading, would it not be possible to
>> just catch the Exceptions that are thrown when the jars are missing?
>>
>> If the code builds OK with just tika-core.jar and tika-parsers.jar
>> this should be sufficient.
>>
>> >
>> >
>> >
>> >>
>> >
>>
>
>
> --
> Cordialement.
> Philippe Mouawad.

Mime
View raw message