gora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Mora <jhnmora...@gmail.com>
Subject Re: Kudu datastore reports
Date Wed, 10 Jul 2019 21:17:23 GMT
Hi Alfonso,

Thanks so much for your time and support for this project. I will work on
your comments. Responses inline :)


El mar., 9 jul. 2019 a las 16:38, Alfonso Nishikawa (<
alfonso.nishikawa@gmail.com>) escribió:

> Hi, John.
>
> Sorry for the delay, I am changing work and I have been very busy :( I
> will try to answer your questions :)
>
> *> In the Employee example there is a field called 'dateOfBirth'. I tried
> to map that field with the UNIXTIME_MICROS datatype of Kudu (I intuitively
> assumed this is a date.). However, in the java world the Employee field is
> a Long value and the kudu datatype is a Timestamp. So, I was wondering
> whether I should force the usage of the UNIXTIME_MICROS datatype for this
> field or just use a LONG datatype in Kudu.*
>
> In Avro 1.8 were introduced "Logical Types" so there is a "date" type with
> an underlying "int" [1]. It's the first time I read about because until the
> last version upgrade of Avro this weren't there. I would suggest to ignore
> "dates" and map dateOfBirth as long, since in any case -in avro- the value
> is the unix epoch. After this first approach, a design improvement would be
> great, though :)
>
> - Would be good to have in the mapping a "timestamp" type so KuduStore
> converts between the Entity long field <-> Kudu timestamp storage?
> - Is there any other approach?
>

I think that Entity long field <-> Kudu timestamp conversion that the best
alternative right now. Because, I would add more compatible datatypes to
the mapping parameters which users can use. And this conversion should not
be dificult to implement in my opinion. Also, the new Date datatype of avro
could be implemented in newer versions because it would need further
analysis in other datastores too. I will work on that.


>
>
> *> What is the Gora's policy regarding flush()? *
> *> KuduClient has multiple flushing modes
> <https://kudu.apache.org/apidocs/org/apache/kudu/client/SessionConfiguration.FlushMode.html>and
> also can set time interval
> <https://kudu.apache.org/releases/1.2.0/apidocs/org/apache/kudu/client/KuduSession.html#setFlushInterval-int->
> for automatic flush.*
> *> Should theses behaviors be configurable using gora.properties file? or
> just use the default configurations.*
>
> What we do in HBase is configure an autoflush option in gora.properties
> [2] which is used when instanced the Table, but at the same time we
> implement the flush() method to force the flush [3]. I would suggest to
> follow that example, but adding the flushing options of Kudu. What flushing
> mode (and time interval if it applies) do you suggest?
>

Well,  IMHO the default flush mode (auto flush sync) will do the job for
most use cases. But I will add a configuration in gora.properties for
selecting the other modes and specifying a autoflush time  if needed  by
the user.


>
> *> Also, while reviewing the datastore interface I noticed this method
> 'getPartitions(Query<K, T> query)'. What is the expected behavior of this
> method?, should I use the partition definition in the xml mapping file for
> this?.*
>
> The method getPartitions(Query) is related to Hadoop. Apache Gora
> integrates with Hadoop implementing a custom Map and Reduce that allows to
> get/write Entities directly.
> You can take a look at HBase's implementation [4], which relies o.a.h.hbase.mapreduce.TableInputFormatBase
> [5] to compute the splits (start key---end key) with the location of the
> split to create a colection of partitions [6].
>
> So, if Kudu is allowed to perform computation using local kudu splits,
> then this method does the needed preparation to allow to "send the
> computation to where the data is locally".
>
> In any case, you can see that:
>
>    - MongoDB store implementation does not implement splitting [7]
>    - Cassandra store implementation does not implement splitting [8]
>    - Aerospike store implementation does not implement splitting [9]
>    - Accumulo store implementation* does* implement splitting [10]
>
> If Kudu has a method to get the different splits for a table and its
> locations, then you will be able to implement the full feature.
>
> This is Hadoop related and it is not trivial. I haven't elaborated much,
> so if you find you need more information let me know :)
>
>
>
I will check whether Kudu has these features in order to implement this
method. If not I will use the default implementation found in other
backends.


> About Queries, what I can tell is that Hbase only implements "Start key" +
> "End key" because it has only 2 operations: "get" and "scan", and the
> querying is for "scan" operation, were you want an interval (or all) of the
> rows. Does Kudu have more querying functionality?
>
>
Yes, Kudu implements a Scanner for querying data among with conditional
predicates for filtering. I am using those classes.


> About other topic, I am trying to install Kudu in standalone (all in 1
> node). Do you use a Cloudera installation or do you have a standalone
> installation? How do you do it? I found some instructions, but they talk
> about compiling Kudu [11]. I was looking for something like HBase, that it
> is unzip + execute "hbase start".
>
>
I am using an embedded mini-cluster which comes with compiled binaries and
can be used with maven[1] for testing my code. Once I get it mature enough
I think I will be testing the datastore with a docker container [2]. I
could not find a unzip+execute bundle either and I am kinda noob for
compiling it myself.

[1]
https://kudu.apache.org/docs/developing.html#_jvm_based_integration_testing
[2] https://hub.docker.com/r/usuresearch/apache-kudu/


> Good job and thank you!! :)
>
> Regards,
>
> Alfonso Nishikawa
>
>
> [1] - https://avro.apache.org/docs/1.8.0/spec.html#Logical+Types
> [2] -
> https://github.com/apache/gora/blob/apache-gora-0.9/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java#L175
> [3] -
> https://github.com/apache/gora/blob/apache-gora-0.9/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java#L458
> [4] -
> https://github.com/apache/gora/blob/apache-gora-0.9/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java#L472
> [5] -
> https://github.com/apache/gora/blob/apache-gora-0.9/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java#L479
> [6] -
> https://github.com/apache/gora/blob/apache-gora-0.9/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java#L517
> [7] -
> https://github.com/apache/gora/blob/apache-gora-0.9/gora-mongodb/src/main/java/org/apache/gora/mongodb/store/MongoStore.java#L533
> [8] -
> https://github.com/apache/gora/blob/apache-gora-0.9/gora-cassandra/src/main/java/org/apache/gora/cassandra/store/CassandraStore.java#L292
> [9] -
> https://github.com/apache/gora/blob/apache-gora-0.9/gora-aerospike/src/main/java/org/apache/gora/aerospike/store/AerospikeStore.java#L369
> [10] -
> https://github.com/apache/gora/blob/apache-gora-0.9/gora-accumulo/src/main/java/org/apache/gora/accumulo/store/AccumuloStore.java#L902
> [11] - https://kudu.apache.org/docs/installation.html
>
>
> El lun., 8 jul. 2019 a las 3:42, John Mora (<jhnmora000@gmail.com>)
> escribió:
>
>> Hi all.
>>
>> As every week I updated my report in the Wiki[1]. Also, I pushed my last
>> commits to my branch [2]. Please give it a look if you have time.
>>
>> This week, I will be continue working in the Queries implementation,
>> please reach me out if you have any suggestions.
>>
>> Also, while reviewing the datastore interface I noticed this method
>> 'getPartitions(Query<K, T> query)'. What is the expected behavior of this
>> method?, should I use the partition definition in the xml mapping file for
>> this?.
>>
>> Cheers,
>> John.
>>
>> [1]
>> https://cwiki.apache.org/confluence/display/GORA/GORA-485+Apache+Kudu+datastore+for+Gora+Reports
>> [2] https://github.com/jhnmora000/gora/tree/GORA-485
>>
>>
>> El dom., 30 jun. 2019 a las 16:56, John Mora (<jhnmora000@gmail.com>)
>> escribió:
>>
>>> Hi all.
>>>
>>> I received my first evaluation from the Google Summer of Code program
>>> with a positive result. Thanks so much for your support and confidence to
>>> the project and me.
>>>
>>> I updated my report of this week in the Wiki[1]. Also, I pushed my last
>>> commits to my branch [2].
>>>
>>> This week, I will be reviewing my the serialization/ deserialization
>>> process in order to identify optimizations specific for Kudu. Because I
>>> used a generic methods of other backends which probably could be better
>>> tuned for kudu. Also, I will start working on the Queries implementation.
>>>
>>> BTW, I added a question to the wiki about Date types. Please give it a
>>> look if you have time.
>>>
>>> [1]
>>> https://cwiki.apache.org/confluence/display/GORA/GORA-485+Apache+Kudu+datastore+for+Gora+Reports
>>> [2] https://github.com/jhnmora000/gora/tree/GORA-485
>>>
>>> Cheers,
>>> John
>>>
>>> El jue., 27 jun. 2019 a las 21:02, John Mora (<jhnmora000@gmail.com>)
>>> escribió:
>>>
>>>> Hi Carlos.
>>>>
>>>> Thanks for the reminder. I submitted the form yesterday. :D
>>>>
>>>> Best,
>>>> John.
>>>>
>>>> El jue., 27 jun. 2019 a las 17:34, carlos muñoz (<carlosrmng@gmail.com>)
>>>> escribió:
>>>>
>>>>> Hi John
>>>>>
>>>>> The first Google Summer of Code evaluation is due on June 28th. Please
>>>>> make sure you submit your Mentors' evaluation on time.
>>>>>
>>>>> Regards,
>>>>> Carlos
>>>>>
>>>>> El dom., 23 jun. 2019 a las 18:29, John Mora (<jhnmora000@gmail.com>)
>>>>> escribió:
>>>>>
>>>>>> Hi all.
>>>>>>
>>>>>> FYI, I updated my report of this week on the Wiki[1]. Also, I pushed
>>>>>> my last commits to my branch [2].
>>>>>>
>>>>>> As I mentioned in the reports I would like to know how datastores
>>>>>> deal with flush(), should it work always manually executed?.
>>>>>>
>>>>>> Finally, This week I will be implementing object
>>>>>> serialization/deserialization in the methods put, get, delete, exists.
Do
>>>>>> you have any suggestions on how to proceed with this task?.
>>>>>>
>>>>>> Footnote: Thanks for the feedback Carlos, I fixed the problem.
>>>>>>
>>>>>> [1]
>>>>>> https://cwiki.apache.org/confluence/display/GORA/GORA-485+Apache+Kudu+datastore+for+Gora+Reports
>>>>>> [2] https://github.com/jhnmora000/gora/tree/GORA-485
>>>>>>
>>>>>> Cheers,
>>>>>> John
>>>>>>
>>>>>>
>>>>>> El lun., 17 jun. 2019 a las 22:58, carlos muñoz (<
>>>>>> carlosrmng@gmail.com>) escribió:
>>>>>>
>>>>>>> Hi John
>>>>>>>
>>>>>>> Your last changes look good to me. Keep it up. But, I noticed
that
>>>>>>> you have created an Enumeration for datatypes, which is very
similar to the
>>>>>>> kudu-client's [2]. Probably you should replace [1] for [2] in
order to
>>>>>>> avoid code duplication.
>>>>>>>
>>>>>>> [1]
>>>>>>> https://github.com/jhnmora000/gora/blob/GORA-485/gora-kudu/src/main/java/org/apache/gora/kudu/mapping/Column.java#L76
>>>>>>> [2] https://kudu.apache.org/apidocs/org/apache/kudu/Type.html
>>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>> Carlos
>>>>>>>
>>>>>>> El sáb., 15 jun. 2019 a las 12:01, John Mora (<jhnmora000@gmail.com>)
>>>>>>> escribió:
>>>>>>>
>>>>>>>> Hi all.
>>>>>>>>
>>>>>>>> I updated my report of this week on the Wiki[1]. I noticed
that my
>>>>>>>> code is lacking some javadoc documentation I think I will
be working on
>>>>>>>> that this week, also I would like to enable and check schema
management
>>>>>>>> tests (createSchema, existsSchema, etc.).
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://cwiki.apache.org/confluence/display/GORA/GORA-485+Apache+Kudu+datastore+for+Gora+Reports
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> John.
>>>>>>>>
>>>>>>>>
>>>>>>>> El mar., 11 jun. 2019 a las 0:11, John Mora (<jhnmora000@gmail.com>)
>>>>>>>> escribió:
>>>>>>>>
>>>>>>>>> Hi Alfonso.
>>>>>>>>>
>>>>>>>>> Thanks so much for your feedback. I am working on your
comments.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> John
>>>>>>>>>
>>>>>>>>> El lun., 10 jun. 2019 a las 16:11, Alfonso Nishikawa
(<
>>>>>>>>> alfonso.nishikawa@gmail.com>) escribió:
>>>>>>>>>
>>>>>>>>>> Hi, John.
>>>>>>>>>>
>>>>>>>>>> Regarding your questions at the report [1]:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - How to represent partitioning configurations
on the mapping
>>>>>>>>>>    file.
>>>>>>>>>>
>>>>>>>>>> This was discussed in other emails, isn't it? :)
>>>>>>>>>>
>>>>>>>>>>    - KuduTestHarness requires the Maven plugin os-maven-plugin,
>>>>>>>>>>    which needs Maven 3.1.1+, is it a problem for
Apache Gora?
>>>>>>>>>>
>>>>>>>>>> I believe it is not a problem. My Ubuntu comes with
3.6.0, far
>>>>>>>>>> from 3.1.1, and I assume everyone uses Maven 3 in
a quite new version :)
>>>>>>>>>>
>>>>>>>>>> [1] -
>>>>>>>>>> https://cwiki.apache.org/confluence/display/GORA/GORA-485+Apache+Kudu+datastore+for+Gora+Reports
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> Alfonso Nishikawa
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> El lun., 10 jun. 2019 a las 21:07, Alfonso Nishikawa
(<
>>>>>>>>>> alfonso.nishikawa@gmail.com>) escribió:
>>>>>>>>>>
>>>>>>>>>>> Hi, John.
>>>>>>>>>>>
>>>>>>>>>>> Thank you!
>>>>>>>>>>> Things I have seen:
>>>>>>>>>>>
>>>>>>>>>>> - The version of a maven dependency [1] should
go on the
>>>>>>>>>>> Dependency Management of the root pom [2]. Same
for [3] and from there,
>>>>>>>>>>> should not set the version there.
>>>>>>>>>>> - Set test dependencies' scope to test, at [4]
and from there.
>>>>>>>>>>> - Set the indentation to 2 spaces for the pom
[5]
>>>>>>>>>>> - Missing "t" in "localhost" at [6].
>>>>>>>>>>> - Port 13 for Kudu? That is "Daytime Protocol"
RFC 867 and you
>>>>>>>>>>> will need root permission to run it. The default
port for kudu is 7051,
>>>>>>>>>>> isn't it?
>>>>>>>>>>> - I would ask you to add the same functionality
to load the
>>>>>>>>>>> mapping from configuration as in HBase's store
[7] in you KuduStore [8].
>>>>>>>>>>> This will have implications on your readMapping
at [9], so take a look at
>>>>>>>>>>> the one for HBase at [10]
>>>>>>>>>>> - I know it is in other backends, but avoid RuntimeExceptions
>>>>>>>>>>> (at least in Java since we have the checked ones)
like in [11]. You can
>>>>>>>>>>> wrap them in GoraException. An example is [12]
>>>>>>>>>>>
>>>>>>>>>>> And nothing more :)
>>>>>>>>>>> Keep going, good job.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> [1] -
>>>>>>>>>>> https://github.com/jhnmora000/gora/blob/GORA-485/gora-kudu/pom.xml#L98
>>>>>>>>>>> [2] -
>>>>>>>>>>> https://github.com/jhnmora000/gora/blob/GORA-485/pom.xml#L890
>>>>>>>>>>> [3] -
>>>>>>>>>>> https://github.com/jhnmora000/gora/blob/GORA-485/gora-kudu/pom.xml#L121
>>>>>>>>>>> [4] -
>>>>>>>>>>> https://github.com/jhnmora000/gora/blob/GORA-485/gora-kudu/pom.xml#L180
>>>>>>>>>>> [5] -
>>>>>>>>>>> https://github.com/jhnmora000/gora/blob/GORA-485/gora-kudu/pom.xml
>>>>>>>>>>> [6] -
>>>>>>>>>>> https://github.com/jhnmora000/gora/blob/GORA-485/gora-kudu/src/test/resources/gora.properties#L18
>>>>>>>>>>> [7] -
>>>>>>>>>>> https://github.com/jhnmora000/gora/blob/master/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java#L92
>>>>>>>>>>> [8] -
>>>>>>>>>>> https://github.com/jhnmora000/gora/blob/GORA-485/gora-kudu/src/main/java/org/apache/gora/kudu/store/KuduStore.java#L53
>>>>>>>>>>> [9] -
>>>>>>>>>>> https://github.com/jhnmora000/gora/blob/GORA-485/gora-kudu/src/main/java/org/apache/gora/kudu/mapping/KuduMappingBuilder.java#L81
>>>>>>>>>>> [10] -
>>>>>>>>>>> https://github.com/jhnmora000/gora/blob/master/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java#L822
>>>>>>>>>>> [11] -
>>>>>>>>>>> https://github.com/jhnmora000/gora/blob/GORA-485/gora-kudu/src/main/java/org/apache/gora/kudu/mapping/KuduMappingBuilder.java#L141
>>>>>>>>>>> [12] -
>>>>>>>>>>> https://github.com/jhnmora000/gora/blob/master/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java#L268
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>>
>>>>>>>>>>> Alfonso Nishikawa
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> El sáb., 8 jun. 2019 a las 20:26, John Mora
(<
>>>>>>>>>>> jhnmora000@gmail.com>) escribió:
>>>>>>>>>>>
>>>>>>>>>>>> Hi all.
>>>>>>>>>>>>
>>>>>>>>>>>> I have just updated my weekly reports on
Cwiki [1]. This next
>>>>>>>>>>>> week I think I should be focusing on the
create schema operation and
>>>>>>>>>>>> solving the issue of the partitioning configurations
in the mapping file.
>>>>>>>>>>>>
>>>>>>>>>>>> Please let me know if you have suggestions,
my last commits are
>>>>>>>>>>>> available here [2]
>>>>>>>>>>>>
>>>>>>>>>>>> [1]
>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/GORA/GORA-485+Apache+Kudu+datastore+for+Gora+Reports
>>>>>>>>>>>> [2] https://github.com/jhnmora000/gora/tree/GORA-485
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> John
>>>>>>>>>>>>
>>>>>>>>>>>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message