samoa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthieu Morel <mmo...@apache.org>
Subject Re: New Instances
Date Mon, 26 Jan 2015 11:33:36 GMT
- discrete attributes are eventually mapped to double values, and
that's the appropriate input to instances, in my understanding. My
idea was to maintain the mapping in the feature extraction step, and
share it in some way with the processing topology.

- regarding performance in sparse instances, I haven't done any sort
of benchmark yet. The implementation can be changed while keeping the
same API.
>From what I see, on the one hand, in the current approach using an
index array, we have the extra constraints that 1/ this index array
must be sorted (adds building time), and 2/ we have to do a binary
search for the index value (log(n)).
On the other hand, there are some very efficient map implementations
that we could reuse. For example, CERN's colt package, actually
already imported in the mahout-collections ASF package.

I hope this answers your questions,

Matthieu


On Mon, Jan 26, 2015 at 7:30 AM, Albert Bifet <abifet@waikato.ac.nz> wrote:
> Nice and simple API! Some things to comment:
>
> - how can we manage discrete attributes, for example attribute class:
> "+","-"?
>
> - In sparse instances, is the performance of a map similar to the
> performance of two arrays, one for indices and one for values?
>
> Albert
>
> On Sat, Jan 24, 2015 at 1:38 AM, Matthieu Morel <matthieu.morel@gmail.com>
> wrote:
>
>> I took a shot at drafting a simplified API for instances.
>> https://github.com/matthieumorel/samoa/tree/new-instances
>>
>> As pointed out in this thread, the current API is too exhaustive, too
>> tied to a specific implementation, and too tied to WEKA/MOA APIs.
>>
>> In addition, I feel the header/information does not belong to the
>> instance. This is something which is used when parsing arff files
>> where static information about the stream is available from the start.
>> But for a real streaming use case, we should not make such assumption.
>> Whatever is known at the begining should be loaded by the topology,
>> but not included in the instances. Arff files can still be loaded and
>> generate instances in the new format. Only the headers should be
>> parsed separately.
>>
>> This proposal is a draft and single label only. It should be easy to
>> add functionality suggested by Albert for multi labels.
>>
>> Feel free to comment!
>>
>> Matthieu
>>
>>
>>
>>
>> On Wed, Jan 21, 2015 at 2:31 AM, Albert Bifet <abifet@waikato.ac.nz>
>> wrote:
>> > 1/ Learners as decision trees can deal with new instances that arrive
>> > with more label classes. New instances can arrive with new headers.
>> >
>> > 2/ To change class labels dynamically, we need to add a method
>> > "setValue(int, string)" in the Attribute class to add dynamically new
>> > attribute values.
>> >
>> > 3/ The current design is being compatible with the methods in weka
>> > instances. It could be nice to have a fresher design. I will need some
>> > help to have a simplified and fresher design of the instances as I'm a
>> > bit conditioned by the previous instance usage :)
>> >
>> > Thanks,
>> >
>> > Albert
>> >
>> >
>> >
>> > On Wed, Jan 21, 2015 at 2:33 AM, Olivier Van Laere
>> > <oliviervanlaere@gmail.com> wrote:
>> >> Hey Matthieu,
>> >>
>> >>> On Jan 20, 2015, at 1:47 AM, Matthieu Morel <matthieu.morel@gmail.com>
>> wrote:
>> >>>
>> >>> I'm confused. From what I see the number of classes is currently fixed
>> >>> in the instance header. As if it was static. I suppose you can work
>> >>> around that limitation with some hacks but I want to use a clean API
>> >>> for that.
>> >>>
>> >>> Or is there a recommended way I'm missing?
>> >>
>> >> Ah, I think I remember now what happened. As far as I encountered this
>> situation, the data had say an .arff format with a header stating the
>> number of class values, and the instance header was read from that, while
>> the instances were then read by the line.
>> >>
>> >> I worked around that by just storing the class label seen in the
>> instances on the fly when building a model, and ignored that field of the
>> instance header. Sorry for the confusion!
>> >>
>> >> Cheers,
>> >> Olivier
>> >>
>> >>
>>

Mime
View raw message