uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bhavani Iyer" <bhavan...@gmail.com>
Subject Re: Delta CAS
Date Mon, 14 Jul 2008 16:25:37 GMT
OK I agree. What's required is something like the following.

  /** sets the high water mark and returns the marker object. */
  Marker getHighWaterMark();

 /** default to false (disabled) and enabled when high water mark is set via
above api. */
  boolean isDeltaCasJournalingEnabled();

  public interface Marker {
     boolean isAboveHighWaterMark(FeatureStructure fs);
  }

  The overhead is the call to isDeltaCasJournalingEnabled()

On Mon, Jul 14, 2008 at 10:40 AM, Thilo Goetz <twgoetz@gmx.de> wrote:

> Bhavani Iyer wrote:
>
>> OK sounds like the suggested improvements to the CAS heap design would
>> still
>> preserve the high water
>> mark mechanism for identifying new FSs as those added after the mark.  Is
>> this correct ?
>>
>
> No.  My conclusion was that we'll create a CAS API that returns
> a returns a marker object which may later be used to query the
> CAS about certain FSs and when they were created.  This object
> will be opaque to CAS users and transient in nature.  Please feel
> free to make a suggestion for such an API to make sure your
> requirements are covered.
>
> > If so, implementation can start. Should there a branch
>
>> created for this work ?
>>
>
> I don't see why we need a branch for this.
>
>
>> The other main concern discussed was the overhead for core UIMA use
>> without
>> remoting. There should be no
>> measureable overhead since there will be one int compare on calls to set
>> feature value and add to index
>> and no impact on accessing FS values.
>>
>
> Please explain your design.  I expect that there'll be a
> global setting, so at most a boolean is checked?
>
>
>
>> If the overhead turns out to an issue, we could still work around it with
>> a
>> separate class implementing
>> CAS with journaling or a wrapper class as suggested before.
>>
>> Bhavani
>>
>> On Thu, Jul 10, 2008 at 12:57 PM, Marshall Schor <msa@schor.com> wrote:
>>
>>  Thilo Goetz wrote:
>>>
>>>  Eddie Epstein wrote:
>>>>
>>>>  No opinions, but a few observations:
>>>>>
>>>>> 1M is way too big for some applications that need very small, but very
>>>>> many
>>>>> CASes.
>>>>>
>>>>>  I agree.
>>>>
>>>>  How about treating the 1st 1 mb segment with the same approach as the
>>> heap
>>> is now - providing the ability to start small, and expanding it (by
>>> reallocating and copying) until it gets to 1 mb?
>>>
>>> -Marshall
>>>
>>>
>>>  Large arrays may be bigger than whatever segment size is chosen, making
>>>>> segment management a bit more complicated.
>>>>>
>>>>> There will be holes at the top of every segment when the next FS
>>>>> doesn't
>>>>> fit.
>>>>>
>>>>>  Not necessarily.  Why couldn't you spread FSs and arrays
>>>> across segments?
>>>>
>>>>
>>>>  Eddie
>>>>>
>>>>> On Wed, Jul 9, 2008 at 2:37 PM, Marshall Schor <msa@schor.com>
wrote:
>>>>>
>>>>>  Here's a suggestion suggested by previous posts, and common hardware
>>>>>
>>>>>> design
>>>>>> for segmented memory.
>>>>>>
>>>>>> Take the int values that represent feature structure (fs) references.
>>>>>>  Today, these are positive numbers from 1 (I think) to around 4
>>>>>> billion.
>>>>>>  These values are used directly as an index into the heap.
>>>>>>
>>>>>> Change this to split the bits in these int values into two parts,
>>>>>> let's
>>>>>> call them upper and lower.  For example
>>>>>> xxxx xxxx xxxx yyyy yyyy yyyy yyyy yyyy
>>>>>>
>>>>>> where the xxx's are the upper bits (each x represents a hex digit),
>>>>>> and
>>>>>> the
>>>>>> y's the lower bits.  The y's in this case can represent numbers up
to
>>>>>> 1
>>>>>> million (approx), and the xxx's represent 4096 values.
>>>>>>
>>>>>> Then allocate the heap using multiple 1 meg entry tables, and store
>>>>>> each
>>>>>> one in the 4096 entry reference array.  The heap reference would
be
>>>>>> some
>>>>>> bit-wise shifting and indexed lookup in addition to what we have
now
>>>>>> and
>>>>>> would probably be very fast, and could be optimized for the xxx=0
case
>>>>>> to be
>>>>>> even faster.
>>>>>>
>>>>>> This breaks heaps of over 1 meg into separate parts, which would
make
>>>>>> them
>>>>>> more managable, I think, and keeps the high-water mark method viable,
>>>>>> too.
>>>>>>
>>>>>> Opinions?
>>>>>>
>>>>>> -Marshall
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message