kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Otto <ao...@wikimedia.org>
Subject Re: integrate Camus and Hive?
Date Wed, 11 Mar 2015 15:36:32 GMT
> e.g File produce by the camus job:  /user/[hive.user]/output/
> *partition_month_utc=2015-03/partition_day_utc=2015-03-11/partition_minute_bucket=2015-03-11-02-09/*

Bhavesh, how do you get Camus to write into a directory hierarchy like this?  Is it reading
the partition values from your messages' timestamps?


> On Mar 11, 2015, at 11:29, Bhavesh Mistry <mistry.p.bhavesh@gmail.com> wrote:
> 
> HI Yang,
> 
> We do this today camus to hive (without the Avro) just plain old tab
> separated log line.
> 
> We use the hive -f command to add dynamic partition to hive table:
> 
> Bash Shell Scripts add time buckets into HIVE table before camus job runs:
> 
> for partition in "${@//\//,}"; do
>   echo "ALTER TABLE ${env:TABLE_NAME} ADD IF NOT EXISTS PARTITION
> ($partition);"
> done | hive -f
> 
> 
> e.g File produce by the camus job:  /user/[hive.user]/output/
> *partition_month_utc=2015-03/partition_day_utc=2015-03-11/partition_minute_bucket=2015-03-11-02-09/*
> 
> Above will add hive dynamic partition before camus job runs.  It works, and
> you can have any schema:
> 
> CREATE EXTERNAL TABLE IF NOT EXISTS ${env:TABLE_NAME} (
>  SOME Table FIELDS...
>  )
>  PARTITIONED BY (
>    partition_month_utc STRING,
>    partition_day_utc STRING,
>    partition_minute_bucket STRING
>  )
>  ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
>  STORED AS SEQUENCEFILE
>  LOCATION '${env:TABLE_LOCATION_CAMUS_OUTPUT}'
> ;
> 
> 
> I hope this will help !   You will have to construct  hive query according
> to partition define.
> 
> Thanks,
> 
> Bhavesh
> 
> On Wed, Mar 11, 2015 at 7:24 AM, Andrew Otto <aotto@wikimedia.org> wrote:
> 
>>> Hive provides the ability to provide custom patterns for partitions. You
>>> can use this in combination with MSCK REPAIR TABLE to automatically
>> detect
>>> and load the partitions into the metastore.
>> 
>> I tried this yesterday, and as far as I can tell it doesn’t work with a
>> custom partition layout.  At least not with external tables.  MSCK REPAIR
>> TABLE reports that there are directories in the table’s location that are
>> not partitions of the table, but it wouldn’t actually add the partition
>> unless the directory layout matched Hive’s default
>> (key1=value1/key2=value2, etc.)
>> 
>> 
>> 
>>> On Mar 9, 2015, at 17:16, Pradeep Gollakota <pradeepg26@gmail.com>
>> wrote:
>>> 
>>> If I understood your question correctly, you want to be able to read the
>>> output of Camus in Hive and be able to know partition values. If my
>>> understanding is right, you can do so by using the following.
>>> 
>>> Hive provides the ability to provide custom patterns for partitions. You
>>> can use this in combination with MSCK REPAIR TABLE to automatically
>> detect
>>> and load the partitions into the metastore.
>>> 
>>> Take a look at this SO
>>> 
>> http://stackoverflow.com/questions/24289571/hive-0-13-external-table-dynamic-partitioning-custom-pattern
>>> 
>>> Does that help?
>>> 
>>> 
>>> On Mon, Mar 9, 2015 at 1:42 PM, Yang <teddyyyy123@gmail.com> wrote:
>>> 
>>>> I believe many users like us would export the output from camus as a
>> hive
>>>> external table. but the dir structure of camus is like
>>>> /YYYY/MM/DD/xxxxxx
>>>> 
>>>> while hive generally expects /year=YYYY/month=MM/day=DD/xxxxxx if you
>>>> define that table to be
>>>> partitioned by (year, month, day). otherwise you'd have to add those
>>>> partitions created by camus through a separate command. but in the
>> latter
>>>> case, would a camus job create >1 partitions ? how would we find out the
>>>> YYYY/MM/DD values from outside ? ---- well you could always do
>> something by
>>>> hadoop dfs -ls and then grep the output, but it's kind of not clean....
>>>> 
>>>> 
>>>> thanks
>>>> yang
>>>> 
>> 
>> 


Mime
View raw message