kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bhavesh Mistry <mistry.p.bhav...@gmail.com>
Subject Re: integrate Camus and Hive?
Date Wed, 11 Mar 2015 17:11:32 GMT
Hi Ad

You have to implement custom partitioner and also you will have to create
what ever path (based on message eg log line timestamp, or however you
choose to create directory hierarchy from your each message).

You will need to implement your own Partitioner class implementation:
https://github.com/linkedin/camus/blob/master/camus-api/src/main/java/com/linkedin/camus/etl/Partitioner.java
and use configuration "etl.partitioner.class=CLASSNAME"  then you can
organize any way you like.

I hope this helps.


Thanks,

Bhavesh


On Wed, Mar 11, 2015 at 8:36 AM, Andrew Otto <aotto@wikimedia.org> wrote:

> > e.g File produce by the camus job:  /user/[hive.user]/output/
> >
> *partition_month_utc=2015-03/partition_day_utc=2015-03-11/partition_minute_bucket=2015-03-11-02-09/*
>
> Bhavesh, how do you get Camus to write into a directory hierarchy like
> this?  Is it reading the partition values from your messages' timestamps?
>
>
> > On Mar 11, 2015, at 11:29, Bhavesh Mistry <mistry.p.bhavesh@gmail.com>
> wrote:
> >
> > HI Yang,
> >
> > We do this today camus to hive (without the Avro) just plain old tab
> > separated log line.
> >
> > We use the hive -f command to add dynamic partition to hive table:
> >
> > Bash Shell Scripts add time buckets into HIVE table before camus job
> runs:
> >
> > for partition in "${@//\//,}"; do
> >   echo "ALTER TABLE ${env:TABLE_NAME} ADD IF NOT EXISTS PARTITION
> > ($partition);"
> > done | hive -f
> >
> >
> > e.g File produce by the camus job:  /user/[hive.user]/output/
> >
> *partition_month_utc=2015-03/partition_day_utc=2015-03-11/partition_minute_bucket=2015-03-11-02-09/*
> >
> > Above will add hive dynamic partition before camus job runs.  It works,
> and
> > you can have any schema:
> >
> > CREATE EXTERNAL TABLE IF NOT EXISTS ${env:TABLE_NAME} (
> >  SOME Table FIELDS...
> >  )
> >  PARTITIONED BY (
> >    partition_month_utc STRING,
> >    partition_day_utc STRING,
> >    partition_minute_bucket STRING
> >  )
> >  ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> >  STORED AS SEQUENCEFILE
> >  LOCATION '${env:TABLE_LOCATION_CAMUS_OUTPUT}'
> > ;
> >
> >
> > I hope this will help !   You will have to construct  hive query
> according
> > to partition define.
> >
> > Thanks,
> >
> > Bhavesh
> >
> > On Wed, Mar 11, 2015 at 7:24 AM, Andrew Otto <aotto@wikimedia.org>
> wrote:
> >
> >>> Hive provides the ability to provide custom patterns for partitions.
> You
> >>> can use this in combination with MSCK REPAIR TABLE to automatically
> >> detect
> >>> and load the partitions into the metastore.
> >>
> >> I tried this yesterday, and as far as I can tell it doesn’t work with a
> >> custom partition layout.  At least not with external tables.  MSCK
> REPAIR
> >> TABLE reports that there are directories in the table’s location that
> are
> >> not partitions of the table, but it wouldn’t actually add the partition
> >> unless the directory layout matched Hive’s default
> >> (key1=value1/key2=value2, etc.)
> >>
> >>
> >>
> >>> On Mar 9, 2015, at 17:16, Pradeep Gollakota <pradeepg26@gmail.com>
> >> wrote:
> >>>
> >>> If I understood your question correctly, you want to be able to read
> the
> >>> output of Camus in Hive and be able to know partition values. If my
> >>> understanding is right, you can do so by using the following.
> >>>
> >>> Hive provides the ability to provide custom patterns for partitions.
> You
> >>> can use this in combination with MSCK REPAIR TABLE to automatically
> >> detect
> >>> and load the partitions into the metastore.
> >>>
> >>> Take a look at this SO
> >>>
> >>
> http://stackoverflow.com/questions/24289571/hive-0-13-external-table-dynamic-partitioning-custom-pattern
> >>>
> >>> Does that help?
> >>>
> >>>
> >>> On Mon, Mar 9, 2015 at 1:42 PM, Yang <teddyyyy123@gmail.com> wrote:
> >>>
> >>>> I believe many users like us would export the output from camus as a
> >> hive
> >>>> external table. but the dir structure of camus is like
> >>>> /YYYY/MM/DD/xxxxxx
> >>>>
> >>>> while hive generally expects /year=YYYY/month=MM/day=DD/xxxxxx if you
> >>>> define that table to be
> >>>> partitioned by (year, month, day). otherwise you'd have to add those
> >>>> partitions created by camus through a separate command. but in the
> >> latter
> >>>> case, would a camus job create >1 partitions ? how would we find
out
> the
> >>>> YYYY/MM/DD values from outside ? ---- well you could always do
> >> something by
> >>>> hadoop dfs -ls and then grep the output, but it's kind of not
> clean....
> >>>>
> >>>>
> >>>> thanks
> >>>> yang
> >>>>
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message