samoa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gianmarco De Francisci Morales <g...@apache.org>
Subject Re: Samoa - Samza job execution
Date Wed, 02 Sep 2015 07:36:02 GMT
Hi Shekar,

I think the way to start is to define how the instance will be serialized
in JSON.
To do so, we need to answer a few questions:
- How are the attribute IDs represented?
Probably a simple int as the key is enough.
- How do we represent metadata?
I'd say that a single JSON instance at the beginning of the file could
contain the needed metadata.
For example, for each attribute we could have its domain (binary, nominal,
real, etc...).
For very large datasets this might be inefficient, so we might want to have
a default (real) and a way to express ranges, (e.g., attributes from 0 to
10000 are all real).

There might be other issues that I am overlooking now, but in practice we
need a 1:1 mapping from SAMOA instances to JSON.
Once this is set, implementing a reader should be straightforward.

The best way to start, imho, is to create a document where the format is
described in all its details.
See, e.g., https://github.com/JohnLangford/vowpal_wabbit/wiki/Input-format
for VW.
A simple Google doc would be good to start.

Hope this helps!

--
Gianmarco

On 2 September 2015 at 09:31, Shekar Tippur <ctippur@gmail.com> wrote:

> Gianmarco,
>
> I really want to take up Samoa supporting json. Can you please point me to
> somewhere I can start?
>
> - Shekar
>
> On Sun, Jul 12, 2015 at 12:20 AM, Gianmarco De Francisci Morales <
> gdfm@gdfm.me> wrote:
>
> > Hi,
> >
> > The only reason is that we inherited the format from MOA.
> > In practice, anything from which we can create an Instance from would be
> > good enough.
> > For example I'd like to support VW and svmLib formats.
> >
> > One caveat is that some algorithms require knowledge of the metadata for
> > the datasets to preallocate some data structure.
> > I would like to remove this dependency in the future, by having the
> > algorithms completely adaptable.
> > Though it's not as easy as it sounds :)
> >
> > Cheers,
> >
> > --
> > Gianmarco
> >
> > On 11 July 2015 at 16:46, Shekar Tippur <ctippur@gmail.com> wrote:
> >
> > > Gianmarco
> > >
> > > Thanks for the response.  Can you please specify the format? Can you
> > please
> > > explain the reason for keeping it in a specific format?
> > > I would like contribute to kafka enhancement. I will look into the code
> > > base you pointed out.
> > >
> > > Shekar
> > > On Jul 11, 2015 1:36 AM, "Gianmarco De Francisci Morales" <
> > gdfm@apache.org
> > > >
> > > wrote:
> > >
> > > > Hi Shekar,
> > > >
> > > > At the moment we do not support JSON data.
> > > > The current readers support ARFF format, which is a CSV with some
> > header.
> > > > http://www.cs.waikato.ac.nz/ml/weka/arff.html
> > > > Adding support for JSON is doable, but it should conform to a very
> > > specific
> > > > format.
> > > >
> > > > About Kafka, we support it as a transport via Samza, but we don't
> have
> > a
> > > > reader for it right now.
> > > > Adding it would be very valuable. If you wanted to work on it I'd be
> > > happy
> > > > to help.
> > > > Have a look at org.apache.samoa.streams.fs.HDFSFileStreamSource,
> > > > and org.apache.samoa.streams.ArffFileStream for some examples.
> > > >
> > > > Cheers,
> > > >
> > > >
> > > > --
> > > > Gianmarco
> > > >
> > > > On 10 July 2015 at 01:18, Shekar Tippur <ctippur@gmail.com> wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > I am trying to use Samoa/Samza combination to apply ML for a
> dataset
> > I
> > > > have
> > > > > in JSON format.
> > > > >
> > > > > This is the document I am following:
> > > > >
> > > > >
> > > >
> > >
> >
> https://samoa.incubator.apache.org/documentation/Executing-SAMOA-with-Apache-Samza.html
> > > > >
> > > > > Couple of questions:
> > > > > 1. How do I point the input event to a Stream/Topic in Kafka? The
> > data
> > > is
> > > > > in JSON.
> > > > > 2. If I want to use historical data that is stored in a file, how
> do
> > I
> > > > > point the job to read from a file and serialise as json?
> > > > >
> > > > > bin/samoa samza target/SAMOA-Samza-0.3.0-SNAPSHOT.jar
> > > > > "PrequentialEvaluation -l classifiers.ensemble.Bagging -s (??)"
> > > > >
> > > > > - Shekar
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message