From uima-dev-return-7683-apmail-incubator-uima-dev-archive=incubator.apache.org@incubator.apache.org Wed Jul 09 16:07:00 2008 Return-Path: Delivered-To: apmail-incubator-uima-dev-archive@locus.apache.org Received: (qmail 33452 invoked from network); 9 Jul 2008 16:07:00 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 9 Jul 2008 16:07:00 -0000 Received: (qmail 47345 invoked by uid 500); 9 Jul 2008 16:06:56 -0000 Delivered-To: apmail-incubator-uima-dev-archive@incubator.apache.org Received: (qmail 47321 invoked by uid 500); 9 Jul 2008 16:06:56 -0000 Mailing-List: contact uima-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: uima-dev@incubator.apache.org Delivered-To: mailing list uima-dev@incubator.apache.org Received: (qmail 47282 invoked by uid 99); 9 Jul 2008 16:06:56 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Jul 2008 09:06:56 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of twgoetz@gmx.de designates 213.165.64.20 as permitted sender) Received: from [213.165.64.20] (HELO mail.gmx.net) (213.165.64.20) by apache.org (qpsmtpd/0.29) with SMTP; Wed, 09 Jul 2008 16:06:03 +0000 Received: (qmail invoked by alias); 09 Jul 2008 16:06:24 -0000 Received: from blueice4n1.de.ibm.com (EHLO [9.152.14.84]) [195.212.29.187] by mail.gmx.net (mp061) with SMTP; 09 Jul 2008 18:06:24 +0200 X-Authenticated: #25330878 X-Provags-ID: V01U2FsdGVkX1+fAQ8q3/GfERB3JMetYRG5uyqcljqG91Wpr+mYuu dbQDzv34V2xMLR Message-ID: <4874E1DA.2030408@gmx.de> Date: Wed, 09 Jul 2008 18:05:46 +0200 From: Thilo Goetz User-Agent: Thunderbird 2.0.0.14 (Windows/20080421) MIME-Version: 1.0 To: uima-dev@incubator.apache.org Subject: Re: Delta CAS References: <2a5d14d10807080658m18904f1dvec8ecd422fadfe33@mail.gmail.com> <4873782B.8000202@gmx.de> <2a5d14d10807080943x24b6082ap529d86eb4b1ed7fc@mail.gmail.com> <4873BA9A.4000602@gmx.de> <4873C16F.9040808@schor.com> <4873CEC8.7000104@gmx.de> <487451F5.30907@gmx.de> <4874BA94.1090207@gmx.de> <4874CA5D.9080502@schor.com> In-Reply-To: <4874CA5D.9080502@schor.com> X-Enigmail-Version: 0.95.6 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Y-GMX-Trusted: 0 X-FuHaFi: 0.8 X-Virus-Checked: Checked by ClamAV on apache.org Marshall Schor wrote: > Some intermediate approach might help here - such as an application or > annotator being able to provide performance tuning hints to the > framework. For instance, a tokenizer might be able to guesstimate the > number of tokens, based on some average token size estimate divided into > the size of the document, and provide that as a hint. Tell me about it. We've built a whole framework to try and figure out ahead of time how much memory processing a certain document is going to take, so we know how many threads we can run in parallel before crashing the JVM. This turns out to be quite difficult if you don't know what kinds of documents you'll be getting, and you work with many different languages. --Thilo