asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Carey <dtab...@gmail.com>
Subject Re: Metadata changes
Date Tue, 15 Dec 2015 23:41:17 GMT
It seems like we have at least one major "compat-buster" release that we 
might want to make before we ratchet ourselves up to the 
must-be-backwards-compatible level of releases...?  (It would fix the 
metadata issues that UCR has raised and lay the versioning foundation.)

On 12/14/15 7:30 PM, Murtadha Hubail wrote:
>> On Dec 14, 2015, at 7:09 PM, Till Westmann <tillw@apache.org> wrote:
>>
>> On 14 Dec 2015, at 18:55, Murtadha Hubail wrote:
>>
>>> I think the backward compatibility discussion goes beyond metadata indexes and
a complete plan that considers everything in storage should be developed to support upgrading
and patching. Just as an example when we did the repacking from edu.uci to org.apache, all
existing instances on edu.uci wouldn’t work on new binaries due to Java serialization on
edu.uci classes.
>> Good point. Do you know if we fixed that or did we just leave it as-is?
>>
> It is still as is for LocalResource class but it is in my TODO queue. Even this change
will break all existing instances and will require adding a serialization version attribute
in the new serialization format to support backward compatibility.
>
>>> Having said that, I would go with the right long term solution for metadata indexes
which would’ve been a result of the backward compatibility plan if we had one.
>> I tend to agree here. I think that we’ll need a backwards compatibility story,
even if we choose to be schema-less for all metadata.
>> 1) Even if the metadata is all flexible, we’ll be able to read the old metadata,
but we’ll need to keep code around to read all versions of the metadata.
>> 2) If we need to change the file format for the data we’ll also need a way to realize
that (and that would probably affect the metadata as well).
>>
>> I think that it might be a good start to add version identifiers to persisted data
structures, so that we’d at least be able to distinguish different versions (and potentially
have the ability to provide some migration - of needed).
> Agreed.
>
>> Thoughts?
>>
>> Cheers,
>> Till
>>
>>>> On Dec 14, 2015, at 6:19 PM, Ildar Absalyamov <ildar.absalyamov@gmail.com>
wrote:
>>>>
>>>> As for general topic of backwards compatibility I think going “fully open”
might be the best longterm solution.
>>>> Once in a while the topic of changing metadata keeps reappearing and there
is no guarantee it will not strike back again. Opening up metadata will release ourselves
from burden of producing migration tools and shipping them with the new version of the binaries
with revised catalog.
>>>> The performance (mainly storage) impacts of that solution will be tolerable
especially considering how much data is usually stored in metadata.
>>>> Moreover, being big proponents of semi-structured data, it does make perfect
sense for us to eat our own dog food here.
>>>>
>>>>> On Dec 14, 2015, at 18:04, Ildar Absalyamov <ildar.absalyamov@gmail.com>
wrote:
>>>>>
>>>>> I guess the main argument for 2 would be eliminating broken metadata
records prior to backwards compatibility cutoff.
>>>>> The last thing what we want to do is to be stuck with wrong implementation
for compatibility reasons. Once the functionality needed for 3 is there we can again introduce
those indexes without building sophisticated migration subsystem.
>>>>>
>>>>>> On Dec 14, 2015, at 17:55, Mike Carey <dtabass@gmail.com> wrote:
>>>>>>
>>>>>> SO - it seems like 3 is the right long-term answer, but not doable
now?
>>>>>> (If it was doable now, it would obviously be the ideal choice of
the three.)
>>>>>> What would be the argument for doing 2 as opposed to 1 for now?
>>>>>> As for the question of backwards compatibility, I actually didn't
sense a consensus yet.
>>>>>> I would tentatively lean towards "right" over "backwards compatible"
for this change.
>>>>>> What are others thoughts on that?
>>>>>> (Soon we won't have that luxury, but right now maybe we do?)
>>>>>>
>>>>>> On 12/14/15 3:43 PM, Steven Jacobs wrote:
>>>>>>> We just had a UCR discussion on this topic. The issue is really
with the
>>>>>>> third "index" here. The code now is using one "index" to go in
two
>>>>>>> directions:
>>>>>>> 1) To find datatypes that use datatype A
>>>>>>> 2) To find datatypes that are used by datatype A.
>>>>>>>
>>>>>>> The way that it works now is hacked together, but designed for
performance.
>>>>>>> So we have three choices here:
>>>>>>>
>>>>>>> 1) Stick to the status quo, and leave the "indexes" as they are
>>>>>>> 2) Remove the Metadata secondary indexes, which will eliminate
the hack but
>>>>>>> cost some performance on Metadata
>>>>>>> 3) Implement the Metadata secondary indexes correctly as Asterix
indexes.
>>>>>>> For this solution to work with our dataset designs, we will need
to have
>>>>>>> the ability to index homogeneous lists. In addition, we will
have reverse
>>>>>>> compatibility issues unless we plan things out for the transition.
>>>>>>>
>>>>>>> What are the thoughts?
>>>>>>>
>>>>>>>
>>>>>>> Orthogonally, it seems that the consensus for storing the datatype
>>>>>>> dataverse in the dataset Metadata is to just add it as an open
field at
>>>>>>> least for now. Is that correct?
>>>>>>>
>>>>>>> Steven
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Dec 14, 2015 at 1:23 PM, Mike Carey <dtabass@gmail.com>
wrote:
>>>>>>>
>>>>>>>> Thoughts inlined:
>>>>>>>>
>>>>>>>> On 12/14/15 11:12 AM, Steven Jacobs wrote:
>>>>>>>>
>>>>>>>>> Here are the conclusions that Ildar and I have drawn
from looking at the
>>>>>>>>> secondary indexes:
>>>>>>>>>
>>>>>>>>> First of all it seems that datasets are local to node
groups, but
>>>>>>>>> dataverses can span node groups, which seems a little
odd to me.
>>>>>>>>>
>>>>>>>> Node groups are an undocumented but to-be-exploited-someday
feature that
>>>>>>>> allows datasets to be stored on less than all nodes in a
given cluster.  As
>>>>>>>> we face bigger clusters, we'll want to open up that possibility.
 We will
>>>>>>>> hopefully use them inside w/o having to make users manage
them manually
>>>>>>>> like parallel DB2 did/does.  Dataverses are really just a
namespace thing,
>>>>>>>> not a storage thing at all, so they are orthogonal to (and
unrelated to)
>>>>>>>> node groups.
>>>>>>>>
>>>>>>>>> There are three Metadata secondary indexes:  GROUPNAME_ON_DATASET_INDEX,
>>>>>>>>> DATATYPENAME_ON_DATASET_INDEX, DATATYPENAME_ON_DATATYPE_INDEX
>>>>>>>>>
>>>>>>>>> The first is used in only one case:
>>>>>>>>> When dropping a node group, check if there are any datasets
using this
>>>>>>>>> node
>>>>>>>>> group. If so, don't allow the drop
>>>>>>>>> BUT, this index has a field called "dataverse" which
is not used at all.
>>>>>>>>>
>>>>>>>> This one seems like a waste of space since we do this almost
never. (Not
>>>>>>>> much space, but unnecessary.)  If we keep it it should become
a proper
>>>>>>>> index.
>>>>>>>>
>>>>>>>>> The second is used when dropping a datatype. If there
is a dataset using
>>>>>>>>> this datatype, don't allow the drop.
>>>>>>>>> Similarly, this index has a "dataverse" which is never
used.
>>>>>>>>>
>>>>>>>> You're about to use the dataverse part, right?  :-)  This
index seems like
>>>>>>>> it will be useful but should be a proper index.
>>>>>>>>
>>>>>>>>> The third index is used to go in two cases, using two
different ideas of
>>>>>>>>> "keys"
>>>>>>>>> It seems like this should actually be two different indexes.
>>>>>>>>>
>>>>>>>> I don't think I understood this comment....
>>>>>>>>
>>>>>>>>
>>>>>>>>> This is my understanding so far. It would be good to
discuss what the
>>>>>>>>> "correct" version should be.
>>>>>>>>> Steven
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Dec 14, 2015 at 10:12 AM, Steven Jacobs <sjaco002@ucr.edu>
wrote:
>>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>> I'm implementing a change so that datasets can use
datatypes from
>>>>>>>>>> alternate data verses (previously the type and set
had to be from the
>>>>>>>>>> same
>>>>>>>>>> dataverse). Unfortunately this means another change
for Dataset Metadata
>>>>>>>>>> (which will now store the dataverse for its type).
>>>>>>>>>>
>>>>>>>>>> As such, I had a couple of questions:
>>>>>>>>>>
>>>>>>>>>> 1) Should this change be thrown into the release
branch, as it is another
>>>>>>>>>> Metadata change?
>>>>>>>>>>
>>>>>>>>>> 2) In implementing this change, I've been looking
at the Metadata
>>>>>>>>>> secondary indexes. I had a discussion with Ildar,
and it seems the thread
>>>>>>>>>> on Metadata secondary indexes being "hacked" has
been lost. Is this also
>>>>>>>>>> something that should get into the release? Is there
anyone currently
>>>>>>>>>> looking at it?
>>>>>>>>>>
>>>>>>>>>> Steven
>>>>>>>>>>
>>>>>>>>>>
>>>>> Best regards,
>>>>> Ildar
>>>>>
>>>> Best regards,
>>>> Ildar
>>>>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message