kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Otto <ao...@wikimedia.org>
Subject Re: integrate Camus and Hive?
Date Wed, 11 Mar 2015 17:42:39 GMT
Thanks,

Do you have this partitioner implemented?  Perhaps it would be good to try to get this into
Camus as a build in option.  HivePartitioner? :)

-Ao


> On Mar 11, 2015, at 13:11, Bhavesh Mistry <mistry.p.bhavesh@gmail.com> wrote:
> 
> Hi Ad
> 
> You have to implement custom partitioner and also you will have to create
> what ever path (based on message eg log line timestamp, or however you
> choose to create directory hierarchy from your each message).
> 
> You will need to implement your own Partitioner class implementation:
> https://github.com/linkedin/camus/blob/master/camus-api/src/main/java/com/linkedin/camus/etl/Partitioner.java
> and use configuration "etl.partitioner.class=CLASSNAME"  then you can
> organize any way you like.
> 
> I hope this helps.
> 
> 
> Thanks,
> 
> Bhavesh
> 
> 
> On Wed, Mar 11, 2015 at 8:36 AM, Andrew Otto <aotto@wikimedia.org> wrote:
> 
>>> e.g File produce by the camus job:  /user/[hive.user]/output/
>>> 
>> *partition_month_utc=2015-03/partition_day_utc=2015-03-11/partition_minute_bucket=2015-03-11-02-09/*
>> 
>> Bhavesh, how do you get Camus to write into a directory hierarchy like
>> this?  Is it reading the partition values from your messages' timestamps?
>> 
>> 
>>> On Mar 11, 2015, at 11:29, Bhavesh Mistry <mistry.p.bhavesh@gmail.com>
>> wrote:
>>> 
>>> HI Yang,
>>> 
>>> We do this today camus to hive (without the Avro) just plain old tab
>>> separated log line.
>>> 
>>> We use the hive -f command to add dynamic partition to hive table:
>>> 
>>> Bash Shell Scripts add time buckets into HIVE table before camus job
>> runs:
>>> 
>>> for partition in "${@//\//,}"; do
>>>  echo "ALTER TABLE ${env:TABLE_NAME} ADD IF NOT EXISTS PARTITION
>>> ($partition);"
>>> done | hive -f
>>> 
>>> 
>>> e.g File produce by the camus job:  /user/[hive.user]/output/
>>> 
>> *partition_month_utc=2015-03/partition_day_utc=2015-03-11/partition_minute_bucket=2015-03-11-02-09/*
>>> 
>>> Above will add hive dynamic partition before camus job runs.  It works,
>> and
>>> you can have any schema:
>>> 
>>> CREATE EXTERNAL TABLE IF NOT EXISTS ${env:TABLE_NAME} (
>>> SOME Table FIELDS...
>>> )
>>> PARTITIONED BY (
>>>   partition_month_utc STRING,
>>>   partition_day_utc STRING,
>>>   partition_minute_bucket STRING
>>> )
>>> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
>>> STORED AS SEQUENCEFILE
>>> LOCATION '${env:TABLE_LOCATION_CAMUS_OUTPUT}'
>>> ;
>>> 
>>> 
>>> I hope this will help !   You will have to construct  hive query
>> according
>>> to partition define.
>>> 
>>> Thanks,
>>> 
>>> Bhavesh
>>> 
>>> On Wed, Mar 11, 2015 at 7:24 AM, Andrew Otto <aotto@wikimedia.org>
>> wrote:
>>> 
>>>>> Hive provides the ability to provide custom patterns for partitions.
>> You
>>>>> can use this in combination with MSCK REPAIR TABLE to automatically
>>>> detect
>>>>> and load the partitions into the metastore.
>>>> 
>>>> I tried this yesterday, and as far as I can tell it doesn’t work with a
>>>> custom partition layout.  At least not with external tables.  MSCK
>> REPAIR
>>>> TABLE reports that there are directories in the table’s location that
>> are
>>>> not partitions of the table, but it wouldn’t actually add the partition
>>>> unless the directory layout matched Hive’s default
>>>> (key1=value1/key2=value2, etc.)
>>>> 
>>>> 
>>>> 
>>>>> On Mar 9, 2015, at 17:16, Pradeep Gollakota <pradeepg26@gmail.com>
>>>> wrote:
>>>>> 
>>>>> If I understood your question correctly, you want to be able to read
>> the
>>>>> output of Camus in Hive and be able to know partition values. If my
>>>>> understanding is right, you can do so by using the following.
>>>>> 
>>>>> Hive provides the ability to provide custom patterns for partitions.
>> You
>>>>> can use this in combination with MSCK REPAIR TABLE to automatically
>>>> detect
>>>>> and load the partitions into the metastore.
>>>>> 
>>>>> Take a look at this SO
>>>>> 
>>>> 
>> http://stackoverflow.com/questions/24289571/hive-0-13-external-table-dynamic-partitioning-custom-pattern
>>>>> 
>>>>> Does that help?
>>>>> 
>>>>> 
>>>>> On Mon, Mar 9, 2015 at 1:42 PM, Yang <teddyyyy123@gmail.com> wrote:
>>>>> 
>>>>>> I believe many users like us would export the output from camus as
a
>>>> hive
>>>>>> external table. but the dir structure of camus is like
>>>>>> /YYYY/MM/DD/xxxxxx
>>>>>> 
>>>>>> while hive generally expects /year=YYYY/month=MM/day=DD/xxxxxx if
you
>>>>>> define that table to be
>>>>>> partitioned by (year, month, day). otherwise you'd have to add those
>>>>>> partitions created by camus through a separate command. but in the
>>>> latter
>>>>>> case, would a camus job create >1 partitions ? how would we find
out
>> the
>>>>>> YYYY/MM/DD values from outside ? ---- well you could always do
>>>> something by
>>>>>> hadoop dfs -ls and then grep the output, but it's kind of not
>> clean....
>>>>>> 
>>>>>> 
>>>>>> thanks
>>>>>> yang
>>>>>> 
>>>> 
>>>> 
>> 
>> 


Mime
View raw message