tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly
Date Fri, 23 May 2014 14:32:02 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007199#comment-14007199
] 

Chris A. Mattmann commented on TIKA-1302:
-----------------------------------------

[~tallison@apache.org] this is a good question -- the VM that lewis set up I believe is so
that anyone can try out Tika via the JAX-RS service. I would imagine if we do the large batch
of docs nightly test (which I think would be awesome, btw) we'll need to figure out the specs
we would need and then compare it to the VM that lewis just had set up. How much RAM, CPU,
disk etc do you think we'll need Tim?

> Let's run Tika against a large batch of docs nightly
> ----------------------------------------------------
>
>                 Key: TIKA-1302
>                 URL: https://issues.apache.org/jira/browse/TIKA-1302
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>
> Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and running again,
it might be fun to run Tika regularly against a large set of docs and report metrics.
> One excellent candidate corpus is govdocs1: http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?  
> [~willp-bl], have anything handy you'd like to contribute? 
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
>  ;) 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message