kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bhavesh Mistry <mistry.p.bhav...@gmail.com>
Subject Re: integrate Camus and Hive?
Date Wed, 11 Mar 2015 21:38:21 GMT
Hi Andrew,

I would say camus is generic enough (but you can propose this to Camus
Team).

Here is sample code and methods that you can use to create any path or
directory structure and create a corresponding (Hive Table schema for it).

public class UTCLogPartitioner extends Partitioner {

    @Override
    public String *encodePartition*(JobContext context, IEtlKey key) {
             long outfilePartitionMs =
EtlMultiOutputFormat.getEtlOutputFileTimePartitionMins(context) * 60000L;
             return ""+DateUtils.getPartition(outfilePartitionMs,
*key.getTime()*);
    }

    @Override
    public String *generatePartitionedPath*(JobContext context, String
topic, String brokerId, int partitionId, String *encodedPartition*) {
        StringBuilder sb = new StringBuilder();
        sb.append("Create your HDFS custom path here");
        return sb.toString();
    }

}

I

Thanks,
Bhavesh

On Wed, Mar 11, 2015 at 10:42 AM, Andrew Otto <aotto@wikimedia.org> wrote:

> Thanks,
>
> Do you have this partitioner implemented?  Perhaps it would be good to try
> to get this into Camus as a build in option.  HivePartitioner? :)
>
> -Ao
>
>
> > On Mar 11, 2015, at 13:11, Bhavesh Mistry <mistry.p.bhavesh@gmail.com>
> wrote:
> >
> > Hi Ad
> >
> > You have to implement custom partitioner and also you will have to create
> > what ever path (based on message eg log line timestamp, or however you
> > choose to create directory hierarchy from your each message).
> >
> > You will need to implement your own Partitioner class implementation:
> >
> https://github.com/linkedin/camus/blob/master/camus-api/src/main/java/com/linkedin/camus/etl/Partitioner.java
> > and use configuration "etl.partitioner.class=CLASSNAME"  then you can
> > organize any way you like.
> >
> > I hope this helps.
> >
> >
> > Thanks,
> >
> > Bhavesh
> >
> >
> > On Wed, Mar 11, 2015 at 8:36 AM, Andrew Otto <aotto@wikimedia.org>
> wrote:
> >
> >>> e.g File produce by the camus job:  /user/[hive.user]/output/
> >>>
> >>
> *partition_month_utc=2015-03/partition_day_utc=2015-03-11/partition_minute_bucket=2015-03-11-02-09/*
> >>
> >> Bhavesh, how do you get Camus to write into a directory hierarchy like
> >> this?  Is it reading the partition values from your messages'
> timestamps?
> >>
> >>
> >>> On Mar 11, 2015, at 11:29, Bhavesh Mistry <mistry.p.bhavesh@gmail.com>
> >> wrote:
> >>>
> >>> HI Yang,
> >>>
> >>> We do this today camus to hive (without the Avro) just plain old tab
> >>> separated log line.
> >>>
> >>> We use the hive -f command to add dynamic partition to hive table:
> >>>
> >>> Bash Shell Scripts add time buckets into HIVE table before camus job
> >> runs:
> >>>
> >>> for partition in "${@//\//,}"; do
> >>>  echo "ALTER TABLE ${env:TABLE_NAME} ADD IF NOT EXISTS PARTITION
> >>> ($partition);"
> >>> done | hive -f
> >>>
> >>>
> >>> e.g File produce by the camus job:  /user/[hive.user]/output/
> >>>
> >>
> *partition_month_utc=2015-03/partition_day_utc=2015-03-11/partition_minute_bucket=2015-03-11-02-09/*
> >>>
> >>> Above will add hive dynamic partition before camus job runs.  It works,
> >> and
> >>> you can have any schema:
> >>>
> >>> CREATE EXTERNAL TABLE IF NOT EXISTS ${env:TABLE_NAME} (
> >>> SOME Table FIELDS...
> >>> )
> >>> PARTITIONED BY (
> >>>   partition_month_utc STRING,
> >>>   partition_day_utc STRING,
> >>>   partition_minute_bucket STRING
> >>> )
> >>> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> >>> STORED AS SEQUENCEFILE
> >>> LOCATION '${env:TABLE_LOCATION_CAMUS_OUTPUT}'
> >>> ;
> >>>
> >>>
> >>> I hope this will help !   You will have to construct  hive query
> >> according
> >>> to partition define.
> >>>
> >>> Thanks,
> >>>
> >>> Bhavesh
> >>>
> >>> On Wed, Mar 11, 2015 at 7:24 AM, Andrew Otto <aotto@wikimedia.org>
> >> wrote:
> >>>
> >>>>> Hive provides the ability to provide custom patterns for partitions.
> >> You
> >>>>> can use this in combination with MSCK REPAIR TABLE to automatically
> >>>> detect
> >>>>> and load the partitions into the metastore.
> >>>>
> >>>> I tried this yesterday, and as far as I can tell it doesn’t work with
> a
> >>>> custom partition layout.  At least not with external tables.  MSCK
> >> REPAIR
> >>>> TABLE reports that there are directories in the table’s location that
> >> are
> >>>> not partitions of the table, but it wouldn’t actually add the
> partition
> >>>> unless the directory layout matched Hive’s default
> >>>> (key1=value1/key2=value2, etc.)
> >>>>
> >>>>
> >>>>
> >>>>> On Mar 9, 2015, at 17:16, Pradeep Gollakota <pradeepg26@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> If I understood your question correctly, you want to be able to
read
> >> the
> >>>>> output of Camus in Hive and be able to know partition values. If
my
> >>>>> understanding is right, you can do so by using the following.
> >>>>>
> >>>>> Hive provides the ability to provide custom patterns for partitions.
> >> You
> >>>>> can use this in combination with MSCK REPAIR TABLE to automatically
> >>>> detect
> >>>>> and load the partitions into the metastore.
> >>>>>
> >>>>> Take a look at this SO
> >>>>>
> >>>>
> >>
> http://stackoverflow.com/questions/24289571/hive-0-13-external-table-dynamic-partitioning-custom-pattern
> >>>>>
> >>>>> Does that help?
> >>>>>
> >>>>>
> >>>>> On Mon, Mar 9, 2015 at 1:42 PM, Yang <teddyyyy123@gmail.com>
wrote:
> >>>>>
> >>>>>> I believe many users like us would export the output from camus
as a
> >>>> hive
> >>>>>> external table. but the dir structure of camus is like
> >>>>>> /YYYY/MM/DD/xxxxxx
> >>>>>>
> >>>>>> while hive generally expects /year=YYYY/month=MM/day=DD/xxxxxx
if
> you
> >>>>>> define that table to be
> >>>>>> partitioned by (year, month, day). otherwise you'd have to add
those
> >>>>>> partitions created by camus through a separate command. but
in the
> >>>> latter
> >>>>>> case, would a camus job create >1 partitions ? how would
we find out
> >> the
> >>>>>> YYYY/MM/DD values from outside ? ---- well you could always
do
> >>>> something by
> >>>>>> hadoop dfs -ls and then grep the output, but it's kind of not
> >> clean....
> >>>>>>
> >>>>>>
> >>>>>> thanks
> >>>>>> yang
> >>>>>>
> >>>>
> >>>>
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message