tinkerpop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marko Rodriguez <okramma...@gmail.com>
Subject Re: [DISCUSS] Primitive Types, Complex Types, and their Entailments in TP4
Date Mon, 15 Apr 2019 19:07:34 GMT

> I think this does satisfy your requirements, though I don't think I
> understand all aspects the approach, especially the need for
> TinkerPop-specific types *for basic scalar values* like booleans, strings,
> and numbers. Since we are committed to the native data types supported by
> the JVM.

TinkerPop4 will have VM implementations on various language-platforms. For sure, Apache’s
distribution will have a JVM and .NET implementation. The purpose of TinkerPop-specific types
(and not JVM, Mono, Python, etc.) types is that we know its the same type across all VMs.

> To my mind, your approach is headed in the direction of a
> TinkerPop-specific notion of a *type*, in general, which captures the
> structure and constraints of a logical data type
> <https://www.slideshare.net/joshsh/a-graph-is-a-graph-is-a-graph-equivalence-transformation-and-composition-of-graph-data-models-129403012/42
> and which can be used for query planning and optimization. These include
> both scalar types as well as vertex, edge, and property types, as well as
> more generic constructs such as optionals, lists, records.

Yes — I’d like to be able to use some type of formal data type specification. You have
those skills. I don’t. My rudimentary (non-categorical) representation is just “common
useful data structures” — map, list, bool, string, etc. 

> Can a TList really only contain primitives? A list of vertices or edges
> would definitely be unusual, and typical PG implementations may not choose
> to support them, but language-agnostic VM possibly should. They would
> nicely capture RDF lists, in which list nodes typically do not have any
> properties (edges) other than rdf:first and rdf:rest.

A TList only supports primitives. However, a TRDFList could be a complex type for dealing
with RDF lists and would be contained with the TP4-VM. Adding complex types is okay — it
doesn’t break anything.

As a related concept — realize that TDocument has a TDocumentArray not a TList. This is
because TDocuments can have “lists” that contain primitives, documents, and lists.

> For hypergraphs, an inV and outV which may produce more than one vertex, is
> one way to go, but a labeled hypergraph should really have other projections
> <https://www.slideshare.net/joshsh/a-graph-is-a-graph-is-a-graph-equivalence-transformation-and-composition-of-graph-data-models-129403012/49
> in addition to inV, outV. That suggests a more generic step than inV or
> outV, which takes as an argument the name of the projection as well as the
> in/out element. E.g. project("in", v1), project("out", v1),
> project("subject", v1).

Hm. Yea, I’m not too strong with hypergraph thinking.

	g.V(1) // vertex
	g.V(1).outE(‘family’)  // hyperedges
	g.V(1).outE(‘family’).inV(‘father’) // ? perhaps inV/outV/bothV can take a String…

We should talk to the GRAKN.AI guys and see what they think.
	https://grakn.ai/ <https://grakn.ai/>
	https://dev.grakn.ai/docs/general/quickstart <https://dev.grakn.ai/docs/general/quickstart>
> For undirected graphs, we might as well just allow both in() and out()
> rather than throwing exceptions. You can think of an undirected edge as a
> pair of directed edges.


> Agreed that provider-specific structures (types) are OK, and should not be
> discouraged. Not only do different providers have their own data models,
> but specific applications have their own schemas. A structure like a
> metaproperty may be allowed in certain contexts and not others, and the
> same goes for instances of conventional structures like edges of a certain
> label.

Yes. I want to make sure we naturally/natively support property graphs, RDF graphs, hypergraphs,
tables, documents, etc. Property graphs (as specified by Neo4j) are not “special” in TP4.
Like Gremlin for languages, property graphs sit side-by-side w/ other data structures. If
we do this right, we will be heros!

> For multi-properties, there is a distinction to be made between multiple
> properties with the same key and element, and single collection-valued
> properties. This is something the PG Working Group has been grappling with.
> I think both should be allowed.

Agreed. This all gets back to a way to specify what the data structure is:

	JanusGraph: a single-labeled property graph with multi/meta-properties.
	Neo4j: a multi-labeled property graph with singleton properties (w/ list values supported).
	RDF: an unlabeled 1-property graph (named graph property?) with vertex-based literals.
	… ?.

Like Graph.Features in TP3.

> IMO it's OK if URIs, in an RDF context, become Strings in a TP context. You
> can think of URI as a constraint on String, which should be enforced at the
> appropriate time, but does not require a vendor-specific class. Can you
> concatenate two URIs? Sure... just concatenate the Strings, but also be
> aware that the result is not a URI.


Thanks for reading and providing good ideas.



> On Mon, Apr 15, 2019 at 5:06 AM Marko Rodriguez <okrammarko@gmail.com <mailto:okrammarko@gmail.com>>
> wrote:
>> Hello,
>> I have a consolidated approach to handling data structures in TP4. I would
>> appreciate any feedback you many have.
>>        1. Every object processed by TinkerPop has a TinkerPop-specific
>> type.
>>                - TLong, TInteger, TString, TMap, TVertex, TEdge, TPath,
>> TList, …
>>                - BENEFIT #1: A universal type system will protect us from
>> language platform peculiarities (e.g. Python long vs Java long).
>>                - BENEFIT #2: The serialization format is constrained and
>> consistent across all languages platforms. (no more coming across a
>> MySpecialClass).
>>        2. All primitive T-type data can be directly access via get().
>>                - TBoolean.get() -> java.lang.Boolean | System.Boolean |
>> ...
>>                - TLong.get() -> java.lang.Long | System.Int64 | ...
>>                - TString.get() -> java.lang.String | System.String | …
>>                - TList.get() -> java.lang.ArrayList | .. // can only
>> contain primitives
>>                - TMap.get() -> java.lang.LinkedHashMap | .. // can only
>> contain primitives
>>                - ...
>>        3. All complex T-types have no methods! (except those afforded by
>> Object)
>>                - TVertex: no accessible methods.
>>                - TEdge: no accessible methods.
>>                - TRow: no accessible methods.
>>                - TDocument: no accessible methods.
>>                - TDocumentArray: no accessible methods. // a document
>> list field that can contain complex objects
>>                - ...
>> REQUIREMENT #1: We need to be able to support multiple graphdbs in the
>> same query.
>>                - e.g., read from JanusGraph and write to Neo4j.
>> REQUIREMENT #2: We need to make sure complex objects can not be queried
>> client-side for properties/edges/etc. data.
>>                - e.g., vertices are universally assumed to be “detached."
>> REQUIREMENT #3: We no longer want to maintain a structure test suite.
>> Operational semantics should be verified via Bytecode ->
>> Processor/Structure.
>>                - i.e., the only way to read/write vertices is via
>> Bytecode as complex T-types don’t have APIs.
>> REQUIREMENT #4: We should support other database data structures besides
>> graph.
>>                - e.g., reading from MySQL and writing to JanusGraph.
>> ———
>> Assume the following TraversalSource:
>> g.withStructure(JanusGraphStructure.class, config1).
>>  withStructure(Neo4jStructure.class, conflg2)
>> Now, assume the following traversal fragment:
>>        outE(’knows’).has(’stars’,5).inV()
>> This would initially be written to Bytecode as:
>>        [[outE,knows],[has,stars,5],[inV]]
>> A decoration strategy realizes that there are two structures registered in
>> the Bytecode source instructions and would rewrite the above as:
>>        [choose,[[type,TVertex]],[[outE,knows],[has,stars,5],[inV]]]
>> A JanusGraph strategy would rewrite this as:
>> [choose,[[type,TVertex]],[[outE,knows],[has,stars,5],[inV]],[[type,JanusVertex]],[[jg:vertexCentric,out,knows,stars,5]]]
>> A Neo4j strategy would rewrite this as:
>> [choose,[[type,TVertex]],[[outE,knows],[has,stars,5],[inV]],[[type,JanusVertex]],[[jg:vertexCentric,out,knows,stars,5]],[[type,Neo4jVertex]],[[neo:outE,knows],[neo:has,stars,5],[neo:inV]]]
>> A finalization strategy would rewrite this as:
>> [choose,[[type,JanusVertex]],[[jg:vertexCentric,out,knows,stars,5]],[[type,Neo4jVertex]],[[neo:outE,knows],[neo:has,stars,5],[neo:inV]]]
>> Now, when a TVertex gets to this CFunction, it will check its type, if its
>> a JanusVertex, it goes down the JanusGraph-specific instruction branch. If
>> the type is Neo4jVertex, it goes down the Neo4j-specific instruction branch.
>> The last instruction of the root bytecode can not return a complex object.
>> If so, an exception is thrown. g.V() is illegal. g.V().id() is legal.
>> Complex objects do not exist outside the TP4-VM. Only primitives can leave
>> the VM-client barrier. If you want vertex property data (e.g.), you have to
>> access it and return it within the traversal — e.g., g.V().valueMap().
>>        BENEFIT #1: Language variant implementations are simple. Just
>> primitives.
>>        BENEFIT #2: The serialization specification is simple. Just
>> primitives. (also, note that Bytecode is just a TList of primitives! —
>> though TBytecode will exist.)
>>        BENEFIT #3: The concept of a “DetachedVertex” is universally
>> assumed.
>> It is completely up to the structure provider to use structure-specific
>> instructions for dealing with their particular TVertex. They will have to
>> provide CFunction implementations for out, in, both, has, outE, inE, bothE,
>> drop, property, value, id, label … (seems like a lot, but out/in/both could
>> be one parameterized CFunction).
>>        BENEFIT #1: No more structure/ API and structure/ test suite.
>>        BENEFIT #2: The structure provider has full control of where the
>> vertex data is stored (cached in memory or fetch from the db or a cut
>> vertex or …). No assumptions are made by the TP4-VM.
>>        BENEFIT #3: The structure provider can safely assume their
>> vertices will not be accessed outside the TP4-VM (outside the processor).
>> We can support TRow for relational databases. A TRow’s data is accessible
>> via the instructions has, hasKey, value, property, id, ... The location of
>> the data in TRow is completely up to the structure provider and its
>> strategy analysis (if only ’name’ is accessed, then SELECT ’name’ FROM...).
>> We can easily support TDocument for document databases. A TDocument’s data
>> is accessible via the instructions has, hasKey, value, property, id, … A
>> value() could return yet another TDocument (or a TDocumentArray containing
>> TDocuments).
>> Supporting a new complex type is simply a function of asking:
>>        “Does the TP4 VM instruction set have the requisite
>> instruction-types (semantically) to manipulate this structure?"
>> We are no longer playing the language-specific object API game. We are
>> playing the language-agnostic VM instruction game. The TP4-VM instruction
>> set is the sole determiner of what complex objects can be processed. (i.e.
>> what data structures can be processed without impedance mismatch).
>> ———
>> The TP4-VM (and, in turn, Gremlin) can naturally support:
>>        1. Property graphs: as currently supported in TP3.
>>        2. RDF graphs: id() is a URI | Literal. g.V(1).value(‘foaf:name’)
>> returns multi/meta-properties *or* g.V(1).out(‘foaf:name’) returns vertices
>> whose id()s are xsd:string literals.
>>        3. Hypergraphs: inV() can return more than one vertex.
>>        4. Undirected graphs: in() and out() throw exceptions. Only both()
>> works.
>>        5. Meta-properties: value(‘name’) can return a TVertexProperty  (a
>> special complex object that is structure provider specific — and that is
>> okay!).
>>        6. Multi-properties: value(‘name’) can return a TPropertyArray of
>> TVertexProperty objects.
>> This means that the same instruction can behave differently for different
>> structures. This is okay as there can be property graph, RDF, hypergraph,
>> etc. test suites.
>> Since complex objects don’t leave the TP4-VM barrier, providers can create
>> any complex objects they want — they just have to have corresponding
>> strategies to create provider-unique bytecode instructions (and thus,
>> CFunctions) for those complex objects.
>> Finally. there are a few of problems to work out:
>>        - There is no way to yield a “v[1]” or “e[3][v[1]-knows->v[2]]”
>> representation. Is that bad? Perhaps not.
>>        - What is the nature of a TPath? Its complex, but we want to
>> return it.
>>        - g.V().id() on an RDF graph can return a URI. Is a URI “simple”?
>> No, the set of simple types should never grow!…. thus, URI => String. Is
>> that wack?
>>        - Do we add g.R() and g.D() to Gremlin to type-support TRow and
>> TDocument objects. g.V() would be weird :( … Hmmmm?
>>                - However, there are only so many data structures……. or
>> are there? TMatrix, TXML, …. whoa.
>> Thanks for reading,
>> Marko.
>> http://rredux.com <http://rredux.com/> <http://rredux.com/ <http://rredux.com/>>

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message