ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Miller <timothy.mil...@childrens.harvard.edu>
Subject Re: files vs strings in collection reader
Date Wed, 29 May 2013 17:25:43 GMT
This collection reader latency issue was harder to test than expected -- 
the first run took ~20 minutes to load and the second took a negligible 
amount of time, presumably due to caching effects. But given our other 
conversation on a "big data" direction using UIMA-AS there is a 
potential solution out there.

UIMA-AS doesn't require Collection Readers -- you just deploy some 
number of pipelines, and then can write a bit of code that can create 
and add CAS's to a queue, asynchronously if desired. So when we get 
something like that up and running, then we can give users/devs a rule 
of thumb that says if you're regularly processing more than ~10k 
documents it's probably better to use UIMA-AS anyways, and then you'll 
get the benefits of the asynchronous methods.


On 05/07/2013 03:49 PM, Tim Miller wrote:
> This sounds like a job for... science! I'll try some experiments and 
> see if it makes a difference.
> Tim
> On 05/07/2013 03:42 PM, Masanz, James J. wrote:
>> do you have any numbers of what sort of impact this will actually 
>> have?  Not clear to me what the savings would be from. Instantiating 
>> objects either way.  Should we be just initializing the ArrayList to 
>> something other than the default size?
>> -- James
>>> -----Original Message-----
>>> From: dev-return-1580-Masanz.James=mayo.edu@ctakes.apache.org 
>>> [mailto:dev-
>>> return-1580-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Tim
>>> Miller
>>> Sent: Tuesday, May 07, 2013 2:18 PM
>>> To: dev@ctakes.apache.org
>>> Subject: files vs strings in collection reader
>>> The FilesInDirectoryCollectionReader creates an arraylist of 
>>> java.io.File
>>> objects when it is initialized. For large datasets (~50k
>>> files) this is substantial time overhead and probably memory as well.
>>> Seems like it would be more efficient to use Strings instead of Files
>>> there and just open the File object when getNext() is called. It is 
>>> pretty
>>> easy to implement, any downside to making this switch?
>>> Tim

View raw message