samoa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gianmarco De Francisci Morales <g...@apache.org>
Subject Re: Avro Support for SAMOA
Date Mon, 26 Oct 2015 07:09:05 GMT
Hi Jay,

1) I agree custom data types would be overkill.
I was thinking of the second option you mentioned, distinguishing it inside
the code.
So the parser code would expect either all values to be optional, or all
values to be required.

I think the plan you have in mind is quite reasonable.
I don't have other suggestions right now.

Thanks,

--
Gianmarco

On 21 October 2015 at 11:39, Jayadeep J <jayadeepj@gmail.com> wrote:

> Hi Gianmarco,
>
> Thanks for your reply. Regarding the points you mentioned,
>
> 1) W.r.t  Sparse & Dense instances, I am trying to understand what you
> meant by "prototypes". Did you mean creating custom Avro data types like
> 'SparseNumeric', 'SparseNominal','DenseInstance' e.t.c ? If yes, the actual
> data stored in the file (JSON encoded) may become heavy. For e.g for the
> iris data-set, if we decide to use a 'SparseNumeric' type for
> 'sepallength',
>
> {"name":
> "sepallength","type":["null",{"name":"SparseNumeric","type":"record","fields":[{"name":"field","type":["null","int","double","long"]}]}]},
>
> the data may look like this,
>
> {"sepallength":null,"sepalwidth":3.5,"petallength":1.4,"petalwidth":0.2,"class":"setosa"}
>
> {"sepallength":{"com.yahoo.labs.samoa.avro.iris.SparseNumeric":{"field":{"double":4.7}}},"sepalwidth":1.4,"petallength":4.9,"petalwidth":0.2,"class":"virginica"}
>
> The complexity of a user with an existing Avro data to convert into a
> 'SAMOA compatible Avro' may become painful. Wouldn't it be easier if we
> just distinguish it inside the code , say if at least one attribute in the
> metadata uses the generic Avro optionality (e.g ["null", "double"]), then
> we do readInstanceSparse() in the Loader and map correspondingly ? Or is
> there some other complexity that I have not looked at?
>
> 2) Yes . Skipping the Date-type attributes will make it easier !
>
> Regarding the engineering aspects,
>
> We can have the Avro dependecy in the deployable jar of SAMOA. In the
> code, may be
>
> 1) We could have an Avro equivalent of ArffFileStream.java & ArffLoader
> 2) May be a different Reader altogether for handling binary stream
> 3) A user option to switch between JSON/Binary encoding
>
> If there is a better way to do it, kindly advice.
>
> Thanks
> Jay
> https://github.com/jayadeepj
>
> On Tue, Oct 20, 2015 at 12:57 PM, Gianmarco De Francisci Morales <
> gdfm@apache.org> wrote:
>
>> Hi Jayadeep,
>>
>> I think it's pretty cool!
>> If we get both Avro and Kafka support right, we can connect to almost
>> anything.
>>
>> The document looks very comprehensive, you seem to have given a lot of
>> thought to it.
>> I am not extremely familiar with Avro myself, I've just used it a couple
>> of times, but I'll try to provide some suggestions.
>>
>> - The general idea of where and how to store data and meta-data seems
>> right.
>> - In general, all attributes in a sparse instance are optional, and all
>> attributes in a dense instance are required. Maybe we want to be more
>> granular than this in the future, but it seems that Avro supports a
>> superset of these settings. We may want to have some defaults "prototypes"
>> in order to make mapping the current dense/sparse instances easy.
>> - Right now we are not making use of Date-type attributes in SAMOA (there
>> is no such thing in samoa-instances), so if it makes it easier we could
>> skip supporting it. Ideally we could have algorithms that respect
>> event-time as provided by timestamps in the instances (as opposed to
>> processing the event whenever it arrives), however we are not there yet :)
>>
>> All the rest seems pretty straightforward.
>>
>> Moving to the more software-engineering oriented aspects, where would we
>> have dependencies for Avro? And how should they be deployed? Would they
>> simply go inside the deployable uber-jar of SAMOA?
>>
>> Thanks,
>>
>> --
>> Gianmarco
>>
>> On 19 October 2015 at 11:24, Jayadeep J <jayadeepj@gmail.com> wrote:
>>
>>> Hi Gianmarco / All,
>>>
>>> I am working on an integration of SAMOA with Apache Avro. Basically I
>>> want to use data stored in Avro Files to be used as input to SAMOA.
>>>
>>> As I understand, current SAMOA readers only support ARFF format. Do you
>>> think such a feature would be useful to SAMOA in general ? Avro allows two
>>> encodings for the data: Binary & JSON. Hence an Avro support may allow
>>> users with JSON data also to use SAMOA.
>>>
>>> Based on the input given by @gdfm to @ctippur, I have prepared an Input
>>> Format document in Google Docs.
>>>
>>>
>>> https://docs.google.com/document/d/1EiyuXOZFKk7MTs-gWaEJq5PVHYyiphhateTaDJMKuR8/edit?usp=sharing
>>>
>>>
>>> Would it be possible for you to have a look and provide your valuable
>>> suggestions ? Thanks
>>>
>>>
>>> Thanks
>>> Jay
>>> https://github.com/jayadeepj
>>>
>>
>>
>
>
> --
> Thanks
> Jay
>
>
> Jayadeep J
> Mob: (+91) - 9176669142
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message