samoa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Albert Bifet <abi...@waikato.ac.nz>
Subject Re: New Instances
Date Mon, 26 Jan 2015 14:20:29 GMT
Hi Matthieu,

Thanks for your answers! I agree with using double values to store
attribute information. I think we need to define how to maintain the
mapping, as some learners need to know if attributes are discrete or
numeric, in order to learn and do predictions, and how many values  the
discrete attributes have.

Cheers, Albert

On Mon, Jan 26, 2015 at 7:33 PM, Matthieu Morel <mmorel@apache.org> wrote:

> - discrete attributes are eventually mapped to double values, and
> that's the appropriate input to instances, in my understanding. My
> idea was to maintain the mapping in the feature extraction step, and
> share it in some way with the processing topology.
>
> - regarding performance in sparse instances, I haven't done any sort
> of benchmark yet. The implementation can be changed while keeping the
> same API.
> From what I see, on the one hand, in the current approach using an
> index array, we have the extra constraints that 1/ this index array
> must be sorted (adds building time), and 2/ we have to do a binary
> search for the index value (log(n)).
> On the other hand, there are some very efficient map implementations
> that we could reuse. For example, CERN's colt package, actually
> already imported in the mahout-collections ASF package.
>
> I hope this answers your questions,
>
> Matthieu
>
>
> On Mon, Jan 26, 2015 at 7:30 AM, Albert Bifet <abifet@waikato.ac.nz>
> wrote:
> > Nice and simple API! Some things to comment:
> >
> > - how can we manage discrete attributes, for example attribute class:
> > "+","-"?
> >
> > - In sparse instances, is the performance of a map similar to the
> > performance of two arrays, one for indices and one for values?
> >
> > Albert
> >
> > On Sat, Jan 24, 2015 at 1:38 AM, Matthieu Morel <
> matthieu.morel@gmail.com>
> > wrote:
> >
> >> I took a shot at drafting a simplified API for instances.
> >> https://github.com/matthieumorel/samoa/tree/new-instances
> >>
> >> As pointed out in this thread, the current API is too exhaustive, too
> >> tied to a specific implementation, and too tied to WEKA/MOA APIs.
> >>
> >> In addition, I feel the header/information does not belong to the
> >> instance. This is something which is used when parsing arff files
> >> where static information about the stream is available from the start.
> >> But for a real streaming use case, we should not make such assumption.
> >> Whatever is known at the begining should be loaded by the topology,
> >> but not included in the instances. Arff files can still be loaded and
> >> generate instances in the new format. Only the headers should be
> >> parsed separately.
> >>
> >> This proposal is a draft and single label only. It should be easy to
> >> add functionality suggested by Albert for multi labels.
> >>
> >> Feel free to comment!
> >>
> >> Matthieu
> >>
> >>
> >>
> >>
> >> On Wed, Jan 21, 2015 at 2:31 AM, Albert Bifet <abifet@waikato.ac.nz>
> >> wrote:
> >> > 1/ Learners as decision trees can deal with new instances that arrive
> >> > with more label classes. New instances can arrive with new headers.
> >> >
> >> > 2/ To change class labels dynamically, we need to add a method
> >> > "setValue(int, string)" in the Attribute class to add dynamically new
> >> > attribute values.
> >> >
> >> > 3/ The current design is being compatible with the methods in weka
> >> > instances. It could be nice to have a fresher design. I will need some
> >> > help to have a simplified and fresher design of the instances as I'm a
> >> > bit conditioned by the previous instance usage :)
> >> >
> >> > Thanks,
> >> >
> >> > Albert
> >> >
> >> >
> >> >
> >> > On Wed, Jan 21, 2015 at 2:33 AM, Olivier Van Laere
> >> > <oliviervanlaere@gmail.com> wrote:
> >> >> Hey Matthieu,
> >> >>
> >> >>> On Jan 20, 2015, at 1:47 AM, Matthieu Morel <
> matthieu.morel@gmail.com>
> >> wrote:
> >> >>>
> >> >>> I'm confused. From what I see the number of classes is currently
> fixed
> >> >>> in the instance header. As if it was static. I suppose you can
work
> >> >>> around that limitation with some hacks but I want to use a clean
API
> >> >>> for that.
> >> >>>
> >> >>> Or is there a recommended way I'm missing?
> >> >>
> >> >> Ah, I think I remember now what happened. As far as I encountered
> this
> >> situation, the data had say an .arff format with a header stating the
> >> number of class values, and the instance header was read from that,
> while
> >> the instances were then read by the line.
> >> >>
> >> >> I worked around that by just storing the class label seen in the
> >> instances on the fly when building a model, and ignored that field of
> the
> >> instance header. Sorry for the confusion!
> >> >>
> >> >> Cheers,
> >> >> Olivier
> >> >>
> >> >>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message