phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Terrell <>
Subject Re: create schema on write
Date Fri, 03 Jun 2016 16:59:45 GMT
Now that I see the VIEW solution written down, I think I tried something
similar (but *after* all UPSERTS, not *during* like you showed here.).  I
tried to create a view consisting of the dynamic field names because I
hated the extra typing of specifying the field name and type after the
table name every time.  But it did not work and gave me errors.  I was
using 4.6.  But I am anxious to try it again when I upgrade to 4.7 or 4.8.

On Fri, Jun 3, 2016 at 11:28 AM, James Taylor <>

> Sure, Steve. Let's say you have a base table you define for your JSON
> object like this:
> -- Create sequence that doles out IDs when a new JSON type
> -- is processed
> -- Create base table for JSON
>     created_date DATE
>     raw_json VARCHAR)
>     APPEND_ONLY_SCHEMA = true,
> Then as each json object is processed, you'd pull out the object name/type
> from the message, each field (as you're doing now), and dynamically create
> a view with the fields  as columns:
> CREATE VIEW IF NOT EXISTS <json_type_name> (
>     <field 1> <field 1 type>, ...)
>     AS SELECT * FROM json
>     APPEND_ONLY_SCHEMA = true;
> This would then assign a json_id when a new json_type_name is encountered
> and otherwise add any new columns to the json_type_name VIEW. By declaring
> it as APPEND_ONLY_SCHEMA we can optimize the RPC traffic by doing no RPCs
> if all field names are already know by the metadata cached on the client,
> and otherwise doing a single RPC to create or alter the view with any new
> field names.
> The nice thing about this too is that the metadata is introspectable
> through the standard JDBC metadata APIs - you can look up the fields and
> their types based on the <json_type_name>.
> Note that the APPEND_ONLY_SCHEMA and AUTO_PARTITION_SEQ and 4.8 features,
> but you can do the same pre-4.8 with a few more RPCs by catching the
> TableAlreadyExistsException and adding each field through ALTER VIEW calls.
> This is how we've modeled time-series data in support of Argus[1], not as
> JSON in this case, but as tags and a metric value.
> HTH.  Thanks,
> James
> [1]
> On Fri, Jun 3, 2016 at 8:41 AM, Steve Terrell <>
> wrote:
>> James, I don't understand how to use IF NOT EXISTS to track the dynamic
>> columns.  Could you elaborate?  Thanks!
>> On Fri, Jun 3, 2016 at 10:36 AM, James Taylor <>
>> wrote:
>>> That's pretty slick, Steve. Another variant along the same lines would
>>> be to create/alter a view over a base table where you use the IF NOT
>>> EXISTS. In this way, you let Phoenix track the dynamic columns for you. We
>>> have a new feature in 4.8 for declaring a table as an APPEND_ONLY_SCHEMA
>>> and we'll minimize the RPCs for ensuring the meta data is up to date. It
>>> works well for dynamically tracking schema as data is processed, a pretty
>>> common pattern.
>>> Thanks,
>>> James
>>> On Friday, June 3, 2016, Steve Terrell <> wrote:
>>>> I have a similar situation.  I have records with varying fields that I
>>>> wanted to access individually and also as a group.
>>>> My actual records are JSON objects, so they look like like this:
>>>> {"field1": value1, "field2": value2, …}
>>>> To make matter harder, the fields are also varying types:  ints,
>>>> strings, boolean, floats.
>>>> What I ended up doing that has worked really well is make each field in
>>>> the JSON record a dynamic field in a Phoenix table.  And, to preserve the
>>>> integrity of the original record, I have one static string field in which
>>>> the entire original JSON goes into.
>>>> Because of the dynamic fields, I have to execute an UPSERT for each row
>>>> instead of some kind of batch loading.  But, I routinely load millions of
>>>> rows like this, sometimes with 20+ fields, and it has never failed on an
>>>> upsert and runs fast enough.
>>>> A few of the fields are actually always in each record, so I use those
>>>> for my primary keys,
>>>> The end result is great!  I can query by any field, quickly pull entire
>>>> records from the key fields, and any JSON object can be loaded.
>>>> In my code that translates the JSON into UPSERT statements I also keep
>>>> a running tab of field names, their type, and number of occurrences.  Then
>>>> a the end of my millions of UPSERTS I populate a simple table that just
>>>> lists all the fields/types/counts so that there is no mystery about what
>>>> fields are available when I want to query the big table.  I did this
>>>> because I don't always know what field names are in the JSON, and I don't
>>>> know how to get a list of dynamic field names from Phoenix SQL.
>>>> Hope this helps,
>>>>     Steve
>>>> On Thu, Jun 2, 2016 at 9:28 PM, Stephen Wilcoxon <>
>>>> wrote:
>>>>> The simple solution is to transform the multi-valued field into a
>>>>> separate linked table with multiple rows per "id".  If this is what you
>>>>> mean by "exploding the data", why do you not want to do so?
>>>>> On Thu, Jun 2, 2016 at 7:11 PM, Rahul Jain <> wrote:
>>>>>> Folks,
>>>>>> I used this question earlier on the wrong list; posting on apache
>>>>>> phoenix user group on the advise of James Taylor:
>>>>>> We have a use case of migration from legacy database to
>>>>>> hbase/phoenix. I'd want to hear your thoughts on how to manage the
>>>>>> multi-valued field:
>>>>>> Field name : namesList
>>>>>> Contains: comma separated list of strings.
>>>>>>  e.g. "Jack, Richard, Fred"
>>>>>> The field does not have an inherent limit on the number of strings,
>>>>>> however for successful migration we need to support unto 100 strings
in the
>>>>>> field
>>>>>> This field is currently indexed in our legacy DB technology where
>>>>>> user is able to query by any of the contained strings.
>>>>>> e.g. query:
>>>>>> select * from myTable where namesList LIKE '%Fred%';
>>>>>> Is there any way Phoenix supports generating multiple index values
>>>>>> a single column in a row ? We are looking for a method where we don't
>>>>>> explicitly specify individual offsets for indexing, (nameList[0],
>>>>>> nameList[1] etc) , rather say a custom functional index equivalent
>>>>>> generates the indexes as a list and each of the index in the list
>>>>>> persisted individually.
>>>>>> Effectively our goal is to avoid full table scans when doing queries
>>>>>> that match values from multi valued fields such as example above.
>>>>>> do want to avoid exploding the data however possible.
>>>>>> Any comments / suggestions are welcome.
>>>>>> Thank you !
>>>>>> Rahul

View raw message