drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Reshetov <alexander.v.reshe...@gmail.com>
Subject Re: Batch load of unstructured data in Drill
Date Thu, 08 Dec 2016 16:57:18 GMT
Hi John,

Thanks, I tried with directory containing several parquet sub-directories.
It works and looks in Drill like one parquet data source.

Not exactly what I want, but it's good workaround. Thanks again.

On Wed, Dec 7, 2016 at 4:39 PM, John Omernik <john@omernik.com> wrote:
> Alexander -
>
> When I have something like this, especially when the output will be
> extremely large, I use CTAS into Parquet files. That said, I think you are
> more looking at the ETL process for JSON.  So, ignoring the CTAS to Parquet
> for now, if you have a bunch of JSON files that will be loaded
> incrementally into Drill, I use the "hidden" directory feature of Drill.
> Let's, for this example say you have a table (directory) named mytable.
> Inside of that you partition your table into subdirectories by days in
> YYYY-MM-DD format. So your directory structure may be
>
> - mytable
> ---- 2016-12-01
> ---- 2016-12-02
> ---- 2016-12-03
>
> For simplicity, let's assume the date is just the load date.  My ETL would
> be this
>
> 1. Batch job starts today, 2016-12-07
> 2.  Check for .2016-12-07 directory, it not exists, create it.
> 3. Copy all new json into .2016-12-07
> 4. Check for 2016-12-07 directory, if not exists, create it
> 5. Move all json in .2016-12-07 to 2016-12-07
> 6. Remove directory .2016-12-07
>
> The reason for this process is simple, the copy process may cause "partial"
> json records to be read by Drill during a query on the main data, thus
> causing a query data. (Let's say a file is being copied and is partially
> over when drill tries to query it).  By default, Drill ignores directories
> that start with . so by using a load directory with prefix of . you can
> copy all the data in your batch to the clustered file system, and then use
> a filesystem mv command which should be instant.  (thus avoiding your query
> errors).
>
> This is simplistic, but you should get the idea.
>
> John
>
>
>
> On Wed, Dec 7, 2016 at 7:08 AM, Alexander Reshetov <
> alexander.v.reshetov@gmail.com> wrote:
>
>> Hello,
>>
>> I want to load batches of unstructured data in Drill. Mostly JSON data.
>>
>> Is there any batch API or other options to do so?
>>
>>
>> Thanks.
>>

Mime
View raw message