samoa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jayadeep J <>
Subject Re: Avro Support for SAMOA
Date Wed, 21 Oct 2015 08:39:17 GMT
Hi Gianmarco,

Thanks for your reply. Regarding the points you mentioned,

1) W.r.t  Sparse & Dense instances, I am trying to understand what you
meant by "prototypes". Did you mean creating custom Avro data types like
'SparseNumeric', 'SparseNominal','DenseInstance' e.t.c ? If yes, the actual
data stored in the file (JSON encoded) may become heavy. For e.g for the
iris data-set, if we decide to use a 'SparseNumeric' type for


the data may look like this,

The complexity of a user with an existing Avro data to convert into a
'SAMOA compatible Avro' may become painful. Wouldn't it be easier if we
just distinguish it inside the code , say if at least one attribute in the
metadata uses the generic Avro optionality (e.g ["null", "double"]), then
we do readInstanceSparse() in the Loader and map correspondingly ? Or is
there some other complexity that I have not looked at?

2) Yes . Skipping the Date-type attributes will make it easier !

Regarding the engineering aspects,

We can have the Avro dependecy in the deployable jar of SAMOA. In the code,
may be

1) We could have an Avro equivalent of & ArffLoader
2) May be a different Reader altogether for handling binary stream
3) A user option to switch between JSON/Binary encoding

If there is a better way to do it, kindly advice.


On Tue, Oct 20, 2015 at 12:57 PM, Gianmarco De Francisci Morales <> wrote:

> Hi Jayadeep,
> I think it's pretty cool!
> If we get both Avro and Kafka support right, we can connect to almost
> anything.
> The document looks very comprehensive, you seem to have given a lot of
> thought to it.
> I am not extremely familiar with Avro myself, I've just used it a couple
> of times, but I'll try to provide some suggestions.
> - The general idea of where and how to store data and meta-data seems
> right.
> - In general, all attributes in a sparse instance are optional, and all
> attributes in a dense instance are required. Maybe we want to be more
> granular than this in the future, but it seems that Avro supports a
> superset of these settings. We may want to have some defaults "prototypes"
> in order to make mapping the current dense/sparse instances easy.
> - Right now we are not making use of Date-type attributes in SAMOA (there
> is no such thing in samoa-instances), so if it makes it easier we could
> skip supporting it. Ideally we could have algorithms that respect
> event-time as provided by timestamps in the instances (as opposed to
> processing the event whenever it arrives), however we are not there yet :)
> All the rest seems pretty straightforward.
> Moving to the more software-engineering oriented aspects, where would we
> have dependencies for Avro? And how should they be deployed? Would they
> simply go inside the deployable uber-jar of SAMOA?
> Thanks,
> --
> Gianmarco
> On 19 October 2015 at 11:24, Jayadeep J <> wrote:
>> Hi Gianmarco / All,
>> I am working on an integration of SAMOA with Apache Avro. Basically I
>> want to use data stored in Avro Files to be used as input to SAMOA.
>> As I understand, current SAMOA readers only support ARFF format. Do you
>> think such a feature would be useful to SAMOA in general ? Avro allows two
>> encodings for the data: Binary & JSON. Hence an Avro support may allow
>> users with JSON data also to use SAMOA.
>> Based on the input given by @gdfm to @ctippur, I have prepared an Input
>> Format document in Google Docs.
>> Would it be possible for you to have a look and provide your valuable
>> suggestions ? Thanks
>> Thanks
>> Jay


Jayadeep J
Mob: (+91) - 9176669142

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message