gora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alfonso Nishikawa (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (GORA-401) Serialization and deserialization of Persistent does not hold the entity dirty state
Date Wed, 17 Dec 2014 21:51:13 GMT

    [ https://issues.apache.org/jira/browse/GORA-401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14250632#comment-14250632

Alfonso Nishikawa commented on GORA-401:

Hi, [~drazzib], your question is much related, but not exactly the same. When you wrote that
I didn't understood because I was using an older version, but after upgrading, now I understand
you, and I comment the same here bellow in  (1).
The problem I comment arises after GORA-326, applied on August 19th. I will answer [~renato2099]
at the same time :)

Hi, [~renato2099]. When StateManager was deleted and {{__g__dirty}} field was introduced inside
the schema, Avro was serializing it at the same time as the rest of the fields and the dirty
state was traveling in a pack (albeit wrongly it was loosing the map's k-v dirty state). That
was, in my oppinion, a bad design. In GORA-326, {{__g__dirty}} was removed from the schema
fields and became an inmemory dirty state that is not serialized by Avro. In my opinion is
a better in design because it is not part of the fields in the schema (but still has flaws).

When an entity is sent from the Map phase to Reduce phase, it is serialized with the Avro
serializer, and loosing the dirty state is a bad thing. Let's see why:
# You load an entity specifying only a few fields (a subset of all fields), as we know we
can do. Fields not loaded have a null value (or default value for basic java types)
# After serializing and deserializing, every field becomes dirty.
# When you write, *all* fields gets persisted.

This, simply, was not the behavior when StateManager was in, nor before GORA-326. But there
are more important implications:

* Since every field *eventually* will be written with a null value, you will have to define
your schemas will all fields as "union null". Otherwise you will always have to read all the
* Nutch breaks horribly: after {{updatedb}} all content downloaded is deleted becasue updatedb
does not load that field. I don't know why no one noticed it :P
* If you want to update only one field, you have to read all the fields *always*. Before this
point, you could just read the interesting fields, update the interesting field and persist.
* If you create a new entity interested only in 1 field, you will have to assign a value to
all fields or define all of them as nullable.
* etc...

About the "two mappers reading the same entity in different machines and modifying entity
differently", the answer is not differente than before GORA-326: it depends on the situation,
and you can mess the same way as now it is.

Before GORA-326, the dirty fields were the ones being updated, and that is how I think should
be now too. (Obviously, if you wanted to delete a field, you wrote it blank).

I took a deep look at Nutch and I wrote the effect in the description of this issue, but I
find good if you take a look at Nutch by yourself. Anyway I feel a bit hurted noticing your
preconception about that the problem probably is other :(

What I suggest:

I find DirtyStateManager the best design approach, but since the dirty state managing has
been shifted to the fields' types, I find ok to reintroduce the {{PersistentDatumWriter/Reader}}.

(1) And about the question of [~drazzib], before introducing {{__g__dirty}} in the fields,
Maps were managing the key-values added and deleted. Now that incremental information is not
taken into account, forcing to read and write all the key-values everytime you read/write.
I find it wrong, since I that information was useful to not have to load the field (all k-v)
and delete some key-values (I used to do that), but well... now there are so many changes
to rollback, so ok.
If I had to choose between the StateManager and the state managed in the instance of Maps
I would vote for the StateManager because each backend could use one state manager properly
for each backend. But well... that maybe would come some day.


> Serialization and deserialization of Persistent does not hold the entity dirty state
> ------------------------------------------------------------------------------------
>                 Key: GORA-401
>                 URL: https://issues.apache.org/jira/browse/GORA-401
>             Project: Apache Gora
>          Issue Type: Bug
>          Components: gora-core
>    Affects Versions: 0.4, 0.5
>         Environment: Tested on gora-0.4, but seems logically to hold on gora-0.5
>            Reporter: Alfonso Nishikawa
>            Priority: Critical
>              Labels: serialization
>   Original Estimate: 35h
>  Remaining Estimate: 35h
> After removing __g__dirty field in GORA-326, dirty field is not serialized. In GORA-321
went from using {{[PersistentDatumWriter|https://github.com/apache/gora/blob/apache-gora-0.3/gora-core/src/main/java/org/apache/gora/avro/PersistentDatumWriter.java](/Reader)}}
to Avro's {{SpecificDatumWriter}}, delegating the serialization of the dirty field to Avro
(but really not desirable to have that field as a main field in the entities).
> The proposal is to reintroduce the {{PersistentDatumWriter/Reader}} which will serialize
the internal fields of the entities.
> This bug affects, for example, Nutch, which loads only some fields in it's phases, serializes
entities (from Map to Reduce), and when deserializes finds all fields as "dirty", independently
of what fields were modified in the Map, and overwrite all data in datastore (deleting much
things: downloaded content, parsed content, etc).
> This effect can be seen in {{TestPersistentSerialization#testSerderEmployeeTwoFields}},
when debuging in {{TestIOUtils#testSerializeDeserialize}}. Proper breakpoints an inspections
shows that, entities are "equal" when it's fields are equal. This is fine as "equal" definition,
but another test must be added to check that serialization an deserialization keeps the dirty

This message was sent by Atlassian JIRA

View raw message