tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Allison <talli...@apache.org>
Subject Re: 1.20?
Date Tue, 18 Dec 2018 18:09:38 GMT
Reports on mp4s, junrar, msaccess and a random subset of the
regression corpus are available here:
http://162.242.228.174/reports/reports_tika_1_20-rc1_subset.tgz


On Thu, Dec 13, 2018 at 5:34 PM Tim Allison <tallison@apache.org> wrote:
>
> Let me actually take a look before answering. Sorry!
>
> On Thu, Dec 13, 2018 at 5:30 PM Tim Allison <tallison@apache.org> wrote:
>>
>>  Thank you for reading the reports!!!
>>
>> The files are very likely broken.  I can take a look.  The change was
>> probably because of an "upgrade" to junrar.  Should I revert to the
>> version we used in 1.19.1?
>> On Thu, Dec 13, 2018 at 1:34 PM Luís Filipe Nassif <lfcnassif@gmail.com> wrote:
>> >
>> > Hi Tim,
>> >
>> > Reading your great reports, I also saw some new exceptions with RAR files
>> > in likely broken folder, but seems tika was able to extract some text from
>> > them before. Do you know if those files are really broken and why tika
>> > extracted text from them before?
>> >
>> > Thank you,
>> > Luis
>> >
>> > Em qui, 13 de dez de 2018 às 13:02, Tim Allison <tallison@apache.org>
>> > escreveu:
>> >
>> > > Reports are here:
>> > >
>> > > http://162.242.228.174/reports/tika_1_20-pre-rc1.zip
>> > >
>> > > I'm going to revert the mp4 parser, and commit the few dependency
>> > > upgrades I ran.
>> > >
>> > > The _major_ difference in content for ppt is explained by the
>> > > duplication of header/footer info.  To confirm this, note that the
>> > > values for "num_unique_tokens_a" and "num_unique_tokens_b" are
>> > > identical for nearly all ppt->ppt, but there are far more tokens in
>> > > "num_tokens_a" vs "num_tokens_b".
>> > >
>> > > I also see that we're losing content in x-java and x-groovy, etc., but
>> > > that's because we're now suppressing the style markup that our parser
>> > > was (incorrectly, IMHO, inserting) -- check the values in
>> > > "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 |
>> > > 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 |
>> > > weight: 3 | family: 2
>> > >
>> > > In short, I think we're good to go.  Will roll rc1 later today or
>> > > (more likely) tomorrow unless there are objections.
>> > > On Mon, Dec 10, 2018 at 9:37 PM Tim Allison <tallison@apache.org>
wrote:
>> > > >
>> > > > Any blockers on 1.20?  I'm going to kick off the regression tests
>> > > shortly.
>> > > > On Fri, Nov 30, 2018 at 7:39 PM <loompa@gmail.com> wrote:
>> > > > >
>> > > > > Hi,
>> > > > > On Wed, 21 Nov 2018 at 13:00, Tim Allison <tallison@apache.org>
wrote:
>> > > > >
>> > > > > > Dave,
>> > > > > >   Should I try to get the Docker plugin working again?
>> > > > > >
>> > > > >
>> > > > > That would be great. I think I may have went down the wrong path
>> > > building
>> > > > > an image at package time, as there doesn't seem to be an easy
way to
>> > > > > publish it as an Apache labelled org on Dockerhub unless it builds
from
>> > > > > source.
>> > > > >
>> > > > > I have some time over the weekend, so could update to where I
got to
>> > > and
>> > > > > see what you think.
>> > > > >
>> > > > > Cheers,
>> > > > > Dave
>> > >

Mime
View raw message