tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lewis John McGibbney <lewi...@apache.org>
Subject Re: Resource Sharing Tika Corpus with Any23
Date Sat, 01 Dec 2018 01:14:13 GMT
Hi Tim,
Thanks for the reply... answer inline

On 2018/11/30 19:22:23, Tim Allison <tallison@apache.org> wrote: 
> I think that'd be great.  Some questions:
> 1) Would you use the same input docs that we're using or would you
> need/want a new TB drive for your input/output?  

The same docs I suspect. We *could* contribute the documents we use in our test suite as well
however this is not really necessary for us to run Any23. Any23 will only attempt extractions
on a small subset of the documents in the corpus.

> How much space will
> you need for your eval framework including outputs?

I wouldn't imagine any more than maybe 5GB disk space in all. Any23 has the ability to run
Open Information Extraction (smart relationship extraction from text) and this tends to generate
more triples. If we decided to turn this on, then it would probably get towards the 5GB mark.
I wouldnt imagine any more than that thought Tim.

> 2) Would you be willing to coordinate with us and PDFBox and POI
> around release times?

I think so yes. If anything this would be an excellent thing for Any23. I think improved coordination
and communication between the communities would be a very positive step.

> 3) Would you be running your processing every so often (around your
> releases) or would it be constant aside from our releases? 

Most likely the former. I am aware that the service is billed to someones (your) card. So
we would be looking to do only what is polite and acceptable. Prior to releases e.g. during
review of a release candidate would be really cool. 

>  I ask
> because I'd like @Tobias Ospelt to have cycles for his fuzzing work
> when we're not getting ready for a release.

That sounds fine to me. 
Thank you for the response. 

View raw message