samoa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gianmarco De Francisci Morales <g...@apache.org>
Subject Re: SAMOA getting started help
Date Mon, 18 May 2015 09:24:04 GMT
Hi Ilias,

Your data does not have a class, but you are trying to train a classifier.
What are you trying to predict exactly?
Even in SVM you need a class label for your training set.
And if you want some accuracy figures, you need a labeled test set as well
(a ground truth).

PrequentialEvaluation works by using unseen data as test data, and then
using it for training.
That is, each instance is first used as a test instance, then as a training
instance.
Therefore the n-th instance is tested on the model built on the previous
n-1 instances.
If the instance is unlabeled, it simply predicts its label (although it
doesn't store it anywhere as of now, but that's easy to fix).

Regarding other classifiers without docs, it's one of these 3 things:
1) they are not fully tested
2) they are not fully parallel (e.g. Naive Bayes right now)
3) simply, the doc is missing

Cheers,

--
Gianmarco

On 15 May 2015 at 16:38, Bertsimas Ilias <award100@gmail.com> wrote:

> Hi Gianmarco,
>
> I hope my reply doesn't create create a separate thread. Sorry in advance
> if it does, I forgot to subscribe before sending the original message.
>
> Here's an excerpt from my dataset in sparse array ARFF format:
>
> https://drive.google.com/file/d/0B1WaPw_KXbfkaVJ6T0lnMDFBdmc/view?usp=sharing
>
> I am coming from an SVM classification paradigm where you first train your
> model with a labelled data-set and then test it with a separate unlabelled
> data-set.
> How would that translate in the streaming online processing paradigm of
> SAMOA ?
>
> I noticed there are a lot of classifications tasks available that are not
> listed in the documentation is there a reason for that ?
>
> Kind Regards,
>
> Ilias Bertsimas.
>
>
> On 13 May 2015 at 14:03, Bertsimas Ilias <award100@gmail.com> wrote:
>
> > Hi all!
> >
> > I am in the process of running some tests for online machine learning in
> > data streams from social media. I came across apache-SAMOA and seemed
> like
> > a very interesting framework.
> > However it was not possible to figure out how to get it to test and train
> > using a sparse array of tf-idf feature vectors. I provide the data in the
> > standard WEKA arff format and although it run, the output is something
> > along the lines of:
> >
> > 2015-05-12 22:58:58,993 [main] INFO
> >>  com.yahoo.labs.samoa.evaluation.EvaluatorProcessor
> >> (EvaluatorProcessor.java:189) -
> >> com.yahoo.labs.samoa.evaluation.EvaluatorProcessorid = 0
> >> evaluation instances,classified instances,classifications correct
> >> (percent),Kappa Statistic (percent),Kappa Temporal Statistic (percent)
> >> 100.0,100.0,100.0,100.0,?
> >> 200.0,200.0,100.0,100.0,?
> >> 300.0,300.0,100.0,100.0,?
> >> 400.0,400.0,100.0,100.0,?
> >> 500.0,500.0,100.0,100.0,?
> >> 600.0,600.0,100.0,100.0,?
> >> 700.0,700.0,100.0,100.0,?
> >> 800.0,800.0,100.0,100.0,?
> >> 900.0,900.0,100.0,100.0,?
> >> 1000.0,1000.0,100.0,100.0,?
> >> 1100.0,1100.0,100.0,100.0,?
> >> 1200.0,1200.0,100.0,100.0,?
> >> 1300.0,1300.0,100.0,100.0,?
> >> 1400.0,1400.0,100.0,100.0,?
> >> 1500.0,1500.0,100.0,100.0,?
> >> 1600.0,1600.0,100.0,100.0,?
> >> 1700.0,1700.0,100.0,100.0,?
> >> 1800.0,1800.0,100.0,100.0,?
> >> 1900.0,1900.0,100.0,100.0,?
> >
> >
> >
> > I have read the documentation on the SAMOA project page but I wasn't able
> > to figure out how to get classification results per instance.
> > Could you please point me to the right direction in terms of acceptable
> > formats SAMOA can use as stream input ? Is there a need for a labeled
> > training set to be included in the data ?
> >
> > Any examples you could provide me with that are not already in the
> > documentation would be most welcome!
> >
> >
> > Kind Regards,
> >
> > Ilias Bertsimas.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message