samoa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jayadeep J <jayade...@gmail.com>
Subject Re: Avro Support for SAMOA
Date Wed, 21 Oct 2015 08:39:17 GMT
Hi Gianmarco,

Thanks for your reply. Regarding the points you mentioned,

1) W.r.t  Sparse & Dense instances, I am trying to understand what you
meant by "prototypes". Did you mean creating custom Avro data types like
'SparseNumeric', 'SparseNominal','DenseInstance' e.t.c ? If yes, the actual
data stored in the file (JSON encoded) may become heavy. For e.g for the
iris data-set, if we decide to use a 'SparseNumeric' type for
'sepallength',

{"name":
"sepallength","type":["null",{"name":"SparseNumeric","type":"record","fields":[{"name":"field","type":["null","int","double","long"]}]}]},

the data may look like this,
{"sepallength":null,"sepalwidth":3.5,"petallength":1.4,"petalwidth":0.2,"class":"setosa"}
{"sepallength":{"com.yahoo.labs.samoa.avro.iris.SparseNumeric":{"field":{"double":4.7}}},"sepalwidth":1.4,"petallength":4.9,"petalwidth":0.2,"class":"virginica"}

The complexity of a user with an existing Avro data to convert into a
'SAMOA compatible Avro' may become painful. Wouldn't it be easier if we
just distinguish it inside the code , say if at least one attribute in the
metadata uses the generic Avro optionality (e.g ["null", "double"]), then
we do readInstanceSparse() in the Loader and map correspondingly ? Or is
there some other complexity that I have not looked at?

2) Yes . Skipping the Date-type attributes will make it easier !

Regarding the engineering aspects,

We can have the Avro dependecy in the deployable jar of SAMOA. In the code,
may be

1) We could have an Avro equivalent of ArffFileStream.java & ArffLoader
2) May be a different Reader altogether for handling binary stream
3) A user option to switch between JSON/Binary encoding

If there is a better way to do it, kindly advice.

Thanks
Jay
https://github.com/jayadeepj

On Tue, Oct 20, 2015 at 12:57 PM, Gianmarco De Francisci Morales <
gdfm@apache.org> wrote:

> Hi Jayadeep,
>
> I think it's pretty cool!
> If we get both Avro and Kafka support right, we can connect to almost
> anything.
>
> The document looks very comprehensive, you seem to have given a lot of
> thought to it.
> I am not extremely familiar with Avro myself, I've just used it a couple
> of times, but I'll try to provide some suggestions.
>
> - The general idea of where and how to store data and meta-data seems
> right.
> - In general, all attributes in a sparse instance are optional, and all
> attributes in a dense instance are required. Maybe we want to be more
> granular than this in the future, but it seems that Avro supports a
> superset of these settings. We may want to have some defaults "prototypes"
> in order to make mapping the current dense/sparse instances easy.
> - Right now we are not making use of Date-type attributes in SAMOA (there
> is no such thing in samoa-instances), so if it makes it easier we could
> skip supporting it. Ideally we could have algorithms that respect
> event-time as provided by timestamps in the instances (as opposed to
> processing the event whenever it arrives), however we are not there yet :)
>
> All the rest seems pretty straightforward.
>
> Moving to the more software-engineering oriented aspects, where would we
> have dependencies for Avro? And how should they be deployed? Would they
> simply go inside the deployable uber-jar of SAMOA?
>
> Thanks,
>
> --
> Gianmarco
>
> On 19 October 2015 at 11:24, Jayadeep J <jayadeepj@gmail.com> wrote:
>
>> Hi Gianmarco / All,
>>
>> I am working on an integration of SAMOA with Apache Avro. Basically I
>> want to use data stored in Avro Files to be used as input to SAMOA.
>>
>> As I understand, current SAMOA readers only support ARFF format. Do you
>> think such a feature would be useful to SAMOA in general ? Avro allows two
>> encodings for the data: Binary & JSON. Hence an Avro support may allow
>> users with JSON data also to use SAMOA.
>>
>> Based on the input given by @gdfm to @ctippur, I have prepared an Input
>> Format document in Google Docs.
>>
>>
>> https://docs.google.com/document/d/1EiyuXOZFKk7MTs-gWaEJq5PVHYyiphhateTaDJMKuR8/edit?usp=sharing
>>
>>
>> Would it be possible for you to have a look and provide your valuable
>> suggestions ? Thanks
>>
>>
>> Thanks
>> Jay
>> https://github.com/jayadeepj
>>
>
>


-- 
Thanks
Jay


Jayadeep J
Mob: (+91) - 9176669142

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message