tinkerpop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marko Rodriguez <okramma...@gmail.com>
Subject Re: [DISCUSS] Primitive Types, Complex Types, and their Entailments in TP4
Date Mon, 15 Apr 2019 18:18:58 GMT
Hello Stephen,

> I'd also wonder about how we treat subgraph() and tree()? could those be a
> List<TPath> somehow??

Yes, Tree is List<TPath>. Subgraph….hmmmm….shooting from the hip: you don’t get
back a graph, its stored in:

g.withProcessor(TinkerGraphStructure.class, config1)

That is, the subgraph is written to one of the registered structures. You can then query it
like any other registered structure. Remember, in TP4, we will support an arbitrary number
of structures associated with a Bytecode source.

> isn't a URI a complex type? that list is expected to grow? maybe all
> complex types have simple type representations?

The problem with every complex type having a simple type representation is that the serializer
will have to know about complex types (as objects). This is just more code for Python, JavaScript,
Java, etc. to maintain. If the serialization format is ONLY primitives, and primitives come
from a static set of ~10 types, then writing, testing, and maintaining serializers in other
languages will be trivial.

	Bytecode in [a nested list of primitives]
	Traversers out [a collection of coefficient wrapped primitives]

Everything communicated over the wire is primitive! Basic. (TTraverser will have to be primitive,
where get() returns a coefficient [bulk] and primitive [object] pair).

> sorry, if some of these questions/ideas are a bit half-cocked, but i read
> this really fast and won't be at my laptop for the rest of the day and
> wanted to get some thoughts out. i'm really really interested in seeing
> this aspect of TP done "right"….

No worries. Thanks for replying.

Some random ideas I was having.

	- TXML: Assume an XML database. out() would be the children tags. value() would be the tag
attribute value. label() would be the tag type. In other words, there is a clean mapping from
the instructions to XML.
	- TMatrix: Assume a database of nxm matricies. math() instruction will be augmented to support
matrix multiplication. A matrix is a table with rows and columns. We would need some nice
instructions for that.
	- TJPEG: Assume a database of graphics. Does our instruction set have instructions that are
useful for manipulating images? Probably need row/column type instructions like TMatrix.
	- TObject: Assume an object database. value() are primitive fields. out() is object fields.
id() is unique object identifier. label() is object class. has() is a primitive field filter.
	- TTimeSeries: ? I don’t know anything about time series databases, but the question remains…do
our instructions make sense for this data structure?
	- https://en.wikipedia.org/wiki/List_of_data_structures <https://en.wikipedia.org/wiki/List_of_data_structures>

The point being. I’m trying to think of odd ball data structures and then trying to see
if the TP4 instruction set is sufficiently general to encompass operations used by those structures.

The beautiful thing is that providers can create as many complex types as they want. These
types are always contained with the TP4-VM and thus require no changes to the serialization
format and respective objects in the deserializing language. Imagine, some XML database out
there is using the TP4-VM, with the XPath language compiling to TP4 bytecode, and is processing
their XML documents in real-time (Pipes/Rx), near-time (Flink/Akka), or batch-time (Spark/Hadoop).
The TP4-VM has a life beyond graph! What a wonderful asset to the entire space of data processing!

…now think of the RDF community using the TP4-VM. SPARQL will be W3C-compilant and can execute
in real-time, near-time, batch-time, etc. What a useful technology to adopt for your RDF triple-store.
I could see Stardog using TP4 for their batch processing. I could see Jena or OpenRDF importing
TP4 to provide different SPARQL execution engines to their triple-store providers.

The TP4 virtual machine may just turn out to be a technological masterpiece.

Marko.

http://rredux.com







> 
> On Mon, Apr 15, 2019 at 8:06 AM Marko Rodriguez <okrammarko@gmail.com <mailto:okrammarko@gmail.com>>
> wrote:
> 
>> Hello,
>> 
>> I have a consolidated approach to handling data structures in TP4. I would
>> appreciate any feedback you many have.
>> 
>>        1. Every object processed by TinkerPop has a TinkerPop-specific
>> type.
>>                - TLong, TInteger, TString, TMap, TVertex, TEdge, TPath,
>> TList, …
>>                - BENEFIT #1: A universal type system will protect us from
>> language platform peculiarities (e.g. Python long vs Java long).
>>                - BENEFIT #2: The serialization format is constrained and
>> consistent across all languages platforms. (no more coming across a
>> MySpecialClass).
>>        2. All primitive T-type data can be directly access via get().
>>                - TBoolean.get() -> java.lang.Boolean | System.Boolean |
>> ...
>>                - TLong.get() -> java.lang.Long | System.Int64 | ...
>>                - TString.get() -> java.lang.String | System.String | …
>>                - TList.get() -> java.lang.ArrayList | .. // can only
>> contain primitives
>>                - TMap.get() -> java.lang.LinkedHashMap | .. // can only
>> contain primitives
>>                - ...
>>        3. All complex T-types have no methods! (except those afforded by
>> Object)
>>                - TVertex: no accessible methods.
>>                - TEdge: no accessible methods.
>>                - TRow: no accessible methods.
>>                - TDocument: no accessible methods.
>>                - TDocumentArray: no accessible methods. // a document
>> list field that can contain complex objects
>>                - ...
>> 
>> REQUIREMENT #1: We need to be able to support multiple graphdbs in the
>> same query.
>>                - e.g., read from JanusGraph and write to Neo4j.
>> REQUIREMENT #2: We need to make sure complex objects can not be queried
>> client-side for properties/edges/etc. data.
>>                - e.g., vertices are universally assumed to be “detached."
>> REQUIREMENT #3: We no longer want to maintain a structure test suite.
>> Operational semantics should be verified via Bytecode ->
>> Processor/Structure.
>>                - i.e., the only way to read/write vertices is via
>> Bytecode as complex T-types don’t have APIs.
>> REQUIREMENT #4: We should support other database data structures besides
>> graph.
>>                - e.g., reading from MySQL and writing to JanusGraph.
>> 
>> ———
>> 
>> Assume the following TraversalSource:
>> 
>> g.withStructure(JanusGraphStructure.class, config1).
>>  withStructure(Neo4jStructure.class, conflg2)
>> 
>> Now, assume the following traversal fragment:
>> 
>>        outE(’knows’).has(’stars’,5).inV()
>> 
>> This would initially be written to Bytecode as:
>> 
>>        [[outE,knows],[has,stars,5],[inV]]
>> 
>> A decoration strategy realizes that there are two structures registered in
>> the Bytecode source instructions and would rewrite the above as:
>> 
>>        [choose,[[type,TVertex]],[[outE,knows],[has,stars,5],[inV]]]
>> 
>> A JanusGraph strategy would rewrite this as:
>> 
>> 
>> [choose,[[type,TVertex]],[[outE,knows],[has,stars,5],[inV]],[[type,JanusVertex]],[[jg:vertexCentric,out,knows,stars,5]]]
>> 
>> A Neo4j strategy would rewrite this as:
>> 
>> 
>> [choose,[[type,TVertex]],[[outE,knows],[has,stars,5],[inV]],[[type,JanusVertex]],[[jg:vertexCentric,out,knows,stars,5]],[[type,Neo4jVertex]],[[neo:outE,knows],[neo:has,stars,5],[neo:inV]]]
>> 
>> A finalization strategy would rewrite this as:
>> 
>> 
>> [choose,[[type,JanusVertex]],[[jg:vertexCentric,out,knows,stars,5]],[[type,Neo4jVertex]],[[neo:outE,knows],[neo:has,stars,5],[neo:inV]]]
>> 
>> Now, when a TVertex gets to this CFunction, it will check its type, if its
>> a JanusVertex, it goes down the JanusGraph-specific instruction branch. If
>> the type is Neo4jVertex, it goes down the Neo4j-specific instruction branch.
>> 
>>        REQUIREMENT #1 SOLVED
>> 
>> The last instruction of the root bytecode can not return a complex object.
>> If so, an exception is thrown. g.V() is illegal. g.V().id() is legal.
>> Complex objects do not exist outside the TP4-VM. Only primitives can leave
>> the VM-client barrier. If you want vertex property data (e.g.), you have to
>> access it and return it within the traversal — e.g., g.V().valueMap().
>>        BENEFIT #1: Language variant implementations are simple. Just
>> primitives.
>>        BENEFIT #2: The serialization specification is simple. Just
>> primitives. (also, note that Bytecode is just a TList of primitives! —
>> though TBytecode will exist.)
>>        BENEFIT #3: The concept of a “DetachedVertex” is universally
>> assumed.
>> 
>>        REQUIREMENT #2 SOLVED
>> 
>> It is completely up to the structure provider to use structure-specific
>> instructions for dealing with their particular TVertex. They will have to
>> provide CFunction implementations for out, in, both, has, outE, inE, bothE,
>> drop, property, value, id, label … (seems like a lot, but out/in/both could
>> be one parameterized CFunction).
>>        BENEFIT #1: No more structure/ API and structure/ test suite.
>>        BENEFIT #2: The structure provider has full control of where the
>> vertex data is stored (cached in memory or fetch from the db or a cut
>> vertex or …). No assumptions are made by the TP4-VM.
>>        BENEFIT #3: The structure provider can safely assume their
>> vertices will not be accessed outside the TP4-VM (outside the processor).
>> 
>>        REQUIREMENT #3 SOLVED
>> 
>> We can support TRow for relational databases. A TRow’s data is accessible
>> via the instructions has, hasKey, value, property, id, ... The location of
>> the data in TRow is completely up to the structure provider and its
>> strategy analysis (if only ’name’ is accessed, then SELECT ’name’ FROM...).
>> We can easily support TDocument for document databases. A TDocument’s data
>> is accessible via the instructions has, hasKey, value, property, id, … A
>> value() could return yet another TDocument (or a TDocumentArray containing
>> TDocuments).
>> 
>> Supporting a new complex type is simply a function of asking:
>> 
>>        “Does the TP4 VM instruction set have the requisite
>> instruction-types (semantically) to manipulate this structure?"
>> 
>> We are no longer playing the language-specific object API game. We are
>> playing the language-agnostic VM instruction game. The TP4-VM instruction
>> set is the sole determiner of what complex objects can be processed. (i.e.
>> what data structures can be processed without impedance mismatch).
>> 
>>        REQUIREMENT #4 SOLVED
>> 
>> ———
>> 
>> The TP4-VM (and, in turn, Gremlin) can naturally support:
>> 
>>        1. Property graphs: as currently supported in TP3.
>>        2. RDF graphs: id() is a URI | Literal. g.V(1).value(‘foaf:name’)
>> returns multi/meta-properties *or* g.V(1).out(‘foaf:name’) returns vertices
>> whose id()s are xsd:string literals.
>>        3. Hypergraphs: inV() can return more than one vertex.
>>        4. Undirected graphs: in() and out() throw exceptions. Only both()
>> works.
>>        5. Meta-properties: value(‘name’) can return a TVertexProperty  (a
>> special complex object that is structure provider specific — and that is
>> okay!).
>>        6. Multi-properties: value(‘name’) can return a TPropertyArray of
>> TVertexProperty objects.
>> 
>> This means that the same instruction can behave differently for different
>> structures. This is okay as there can be property graph, RDF, hypergraph,
>> etc. test suites.
>> 
>> Since complex objects don’t leave the TP4-VM barrier, providers can create
>> any complex objects they want — they just have to have corresponding
>> strategies to create provider-unique bytecode instructions (and thus,
>> CFunctions) for those complex objects.
>> 
>> Finally. there are a few of problems to work out:
>>        - There is no way to yield a “v[1]” or “e[3][v[1]-knows->v[2]]”
>> representation. Is that bad? Perhaps not.
>>        - What is the nature of a TPath? Its complex, but we want to
>> return it.
>>        - g.V().id() on an RDF graph can return a URI. Is a URI “simple”?
>> No, the set of simple types should never grow!…. thus, URI => String. Is
>> that wack?
>>        - Do we add g.R() and g.D() to Gremlin to type-support TRow and
>> TDocument objects. g.V() would be weird :( … Hmmmm?
>>                - However, there are only so many data structures……. or
>> are there? TMatrix, TXML, …. whoa.
>> 
>> Thanks for reading,
>> Marko.
>> 
>> http://rredux.com <http://rredux.com/> <http://rredux.com/ <http://rredux.com/>>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message