samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Selina Tech <swucaree...@gmail.com>
Subject Re: Re: questions of partition and task of Samza
Date Thu, 29 Oct 2015 19:33:15 GMT
Hi, Yan:

     Thanks a lot for your answer.

Sincerely,
Selina

On Mon, Oct 26, 2015 at 8:03 PM, Yan Fang <yanfangwork@163.com> wrote:

> Hi Selina,
>
>
> Your understanding is correct. Yes, you "need to consumer the original
> input and send it back to Kafka and reset the* Key to departmentName *and
> then consume it again
> to count in Samza" if you want to count the number of students in the same
> departmentName. This is a typical aggregation use case. Because after
> aggregating the students in the same department, you can do more than just
> "count". :)
>
>
> Cheers,
> Yan
>
>
> At 2015-10-25 06:12:50, "Selina Tech" <swucareer99@gmail.com> wrote:
> >Hi, Yan:
> >
> >      Thanks a lot for your reply.
> >
> >      You mentioned "if you give the msgs the same partition key", which
> >mean same partition key value or  same partition key attribute name?
> >
> >       I mentioned "primary key" as "key" at public
> >KeyedMessage(java.lang.String topic, K key, V message) or you can ignore
> >it. I explain it in another way below.
> >
> >       If I need aggregate data, but the data are not in same partition,
> do
> >we need consumer the data, and put it back it to Kafka with with new key
> >and then consumer it again and aggregate it in Samza.
> >
> >      For example,  messages about student GPA information was send to
> >Kafka by* K key(String schoolName)*. The message looks like "name,
> >schoolName,  departmentName,  grade, GPA", and assuming I have 3
> >partitions, With my understanding, all student records in one school
> should
> >go to same partition.
> >
> >      Right now I need to aggregate data for same department, no matter
> >which school.  which mean all the same departmentName message will be in
> >three different partition. If I just count it in one samza job, will the
> >result correct?  Do I need to consumer the original input and send it back
> >to Kafka and reset the* Key to  departmentName *and then consume it again
> >to count in Samza?
> >
> >     If I did not understand the partition and task of Samza, would you
> >like to correct me?
> >
> >Sincerely,
> >Selina
> >
> >On Sat, Oct 24, 2015 at 2:45 AM, Yan Fang <yanfangwork@163.com> wrote:
> >
> >>
> >>
> >> Hi Selina,
> >>
> >>
> >> what do you mean by "primary key" here? Is it one of the partitions of
> >> "input" or something like "if one msg meets condition x, we think msg
> has
> >> the primary key"?
> >>
> >>
> >> If you just want to count the msgs, you can count in one Samza job and
> >> send the result to "output" topic. You can send to any partition of the
> >> "output" if you give the msgs the same partition key.
> >>
> >>
> >> Thanks,
> >> Yan
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> At 2015-10-22 08:30:15, "Selina Tech" <swucareer99@gmail.com> wrote:
> >> >Hi, All:
> >> >
> >> >        In the Samza document, it mentioned "Each task consumes data
> from
> >> >one partition for each of the job’s input streams." Does it mean if the
> >> >data processing one job is not in one partition, the result will be
> wrong.
> >> >
> >> >        Assuming my Samza input data on Kafka topic -- "input" is
> >> >partitioned by default -- round robin. And I have five partitions. If
> my
> >> >Samza job is to count messages by primary key of the message at "input"
> >> >topic, and then output it to kafka topic -- "output".
> >> >
> >> >       So I need steps as below
> >> >      1. read data from Kafka topic "input"
> >> >      2. reset the partition key to "primary key" in Samza
> >> >      3. produce it back to Kafka topic named as "temp"
> >> >      4. read "temp" topic at Samza
> >> >      5. count it in Samza
> >> >      6. write it to Kafka topic named as "output"
> >> >
> >> >      If I just read data from Kafka topic "input" and count it in
> Samza
> >> >and write it to topic "output". The result will not be correct because
> >> there
> >> >might have multiple messages for same "primary key" in "output"
> topic.  Do
> >> >I understand it correctly?
> >> >
> >> >Sincerely,
> >> >Selina
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message