pulsar-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Pulsar Slack" <apache.pulsar.sl...@gmail.com>
Subject Slack digest for #general - 2019-04-20
Date Sat, 20 Apr 2019 09:11:02 GMT
2019-04-19 11:47:28 UTC - Mr BECHAMKI: @Mr BECHAMKI has joined the channel
----
2019-04-19 13:18:44 UTC - stefan: Hi. I am having trouble re initializing the cluster meta
data. I end up with Exception in thread "main" org.apache.zookeeper.KeeperException$NodeExistsException:
KeeperErrorCode = NodeExists for /namespace
----
2019-04-19 13:31:06 UTC - Ruud Kamphuis: @Ruud Kamphuis has joined the channel
----
2019-04-19 13:33:55 UTC - stefan: Hi guys. When running locally on my laptop, i end up with
a connection refused : Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException:
Connection refused: localhost/127.0.0.1:6650. Any help appreciated
----
2019-04-19 13:34:47 UTC - Ruud Kamphuis: There seems to be a typo in your address, it reads
`localhost/127.0.0.1:6650` thats not good
----
2019-04-19 13:35:03 UTC - Ruud Kamphuis: it should be `localhost:6650` or `127.0.0.1:6650`
----
2019-04-19 13:36:06 UTC - Ruud Kamphuis: Hello everyone. I read the whole FAQ (<https://github.com/apache/pulsar/blob/master/faq.md>)
but couldn't find the answer to this question:

Is it possible to have multiple consumers listening to 1 topic that have their own subscription
type? For example, I have an ETL consumer that wants to make sure it gets all the messages.
And I have a Stats consumer that keeps track off stats. I want to make sure there is only
1 ETL consumer, and only 1 Stats consumer.
----
2019-04-19 13:36:38 UTC - Ruud Kamphuis: As far as I know, 1 topic can only have 1 subscription?
Or is there a way to somehow group consumers by consumerName ?
----
2019-04-19 13:37:28 UTC - stefan: agreed. i just downloaded it with wget and launch the standalon
bin/pulsar standalone and called :  ./bin/pulsar-client produce my-topic --messages "hello-pulsar"
----
2019-04-19 13:37:35 UTC - stefan: i did not even touch the conf
----
2019-04-19 13:38:15 UTC - Sijie Guo: 1 topic can have as many subscriptions as it can
----
2019-04-19 13:38:25 UTC - Sijie Guo: each subscription can choose its own subscription type.
----
2019-04-19 13:38:38 UTC - Sijie Guo: the consumers use same subscription name are in the same
consumer group.
----
2019-04-19 13:44:34 UTC - Ruud Kamphuis: Is this somewhere documented? Because I read through
the whole docs and faq but couldn't find it.
----
2019-04-19 13:44:40 UTC - Ruud Kamphuis: (thanks for your answer btw!)
----
2019-04-19 13:46:35 UTC - Ruud Kamphuis: Ah, I know what I was doing wrong.

I saw
`<ws://broker-service-url:8080/ws/v2/consumer/persistent/:tenant/:namespace/:topic/:subscription>`

And thought that `:subscription` was the type, so I entered `shared` there..

But that's just the name of the subscription, nice!
----
2019-04-19 13:49:39 UTC - Sijie Guo: :+1:
----
2019-04-19 13:57:35 UTC - Kai Levy: I understand ZK's general role, I am just hoping to get
into the specifics. For example, does using pulsar's reader interface cause writes on ZK,
like creating a subscription does?
----
2019-04-19 14:34:50 UTC - Ruud Kamphuis: Another question &gt; When using websockets with
a schema, is it still required to base64 encode the `payload`? Or can you send a message like
this:
```
{ "payload": { "id": 1, "event": "some-event" } }
```

Maybe I misunderstand the schemas thing.
----
2019-04-19 14:35:26 UTC - Ruud Kamphuis: So if I change my scema from `None` to `JSON`
----
2019-04-19 15:17:38 UTC - Joe Francis: Readers have no persistent state, so no.
----
2019-04-19 15:29:43 UTC - Kai Levy: So generally speaking, is there a list of operations that
do use zookeeper, and whether they are reads or writes?
----
2019-04-19 15:31:15 UTC - Kai Levy: Or a straightforward way I can analyze the source code
to find operations that use zookeeper?
----
2019-04-19 15:44:06 UTC - Joe Francis: Topics and Subscriptions have state and metadata, and
so they will have ZK entries, and this metadata gets updated if you create/delete or  set
properties on them.  Then there is Bookkeeper ledgers associated with the topics and  cursors
which gets updated when data files get rolled over. You can look ManagedLedgerInfo.java to
see what metadata is kept
----
2019-04-19 16:05:54 UTC - Kai Levy: Does creating consumers on existing subscriptions ever
write to zk? Or just read?
----
2019-04-19 16:14:42 UTC - Sébastien de Melo: Hi guys!
We encounter a very weird error with our Pulsar function.  It has 2 input topics and when
we make a load test on 1 topic, the function eventually stops listening to this topic at some
point and never recovers.  The messages sent to the other topic are still processed though
(confirmed by the stats subcommand).  Then we have to delete it and recreate it so that it
works again.
----
2019-04-19 16:56:08 UTC - Sanjeev Kulkarni: @Sébastien de Melo huh, thats wierd. any errors
in the functipn log? how long after the fnction starts do you see this happening
----
2019-04-19 16:56:37 UTC - Sanjeev Kulkarni: and whats the message rate on each of the topic?
----
2019-04-19 17:01:38 UTC - Ruud Kamphuis: Why is the pulsar docker 1GB big? Isn't there a Docker
image available that only contains Pulsar itself?
----
2019-04-19 17:13:55 UTC - Joe Francis: In general no.
----
2019-04-19 18:03:49 UTC - Sam Leung: I have a question about phased rollout of a service that
is a consumer. Our current system’s paradigm allows us to specify a percentage of traffic
to route to a new deployment, e.g. 99% of traffic goes to service A v1, 1% goes to service
A v2. Eventually we tweak those until all requests go to v2
In Pulsar, messages are pushed to the clients according to the subscription, so that means
v1 and v2 will both process messages as fast as they can. Has precise throttling of a certain
group of consumers been considered?
I see some potential solutions as:
- use consumer priority and permits to get rough distribution, but that does not actually
give me control
- have consumers nack a % of received messages, but a lot of busy work and again not very
precise
- create pulsar function to route messages in a distribution into v1's topic and v2's topic,
but that could end up with a lot of duplication
- add something to `AbstractDispatcherMultipleConsumers` to support groups of consumers with
a % of messages routed to them
Any thoughts?
----
2019-04-19 18:13:49 UTC - David Kjerrumgaard: @Ruud Kamphuis The Pulsar docker image currently
includes bookkeeper, zookeeper, and other components that contribute to the size of the image.
 We could create a standalone "pulsar" only docker image, but it would be incumbent upon the
user to also spin up a ZK, and BK image to configure the networking between them via docker-compose
or similar. So far, nobody has elected to go down that route.
----
2019-04-19 18:16:59 UTC - David Kjerrumgaard: @Sam Leung If you are looking for a short term
"hack" to simulate the behavior you described, you could write a simple pulsar function that
processes the message, generates a random number between 1 and 100, if it is less than 100
then route it to service A v1, otherwise route it to service A v2.
----
2019-04-19 18:18:14 UTC - Sam Leung: @David Kjerrumgaard I understand that “hack” could
work. I am trying to figure out the long term solution
----
2019-04-19 18:19:21 UTC - David Kjerrumgaard: @Sam Leung Sure, I am curious as to how the
long term solution would be different from a routing perspective, i.e how would you determine
which messages go to which consumers?
----
2019-04-19 18:20:57 UTC - David Kjerrumgaard: and how would you handle slow consumers, i.e.
one consumer takes longer to process messages than others, would you adapt to the back-pressure,
etc?   What if one of the consumers fails? should the remaining one get 100% of the traffic?
----
2019-04-19 18:22:39 UTC - Sam Leung: Ah we have a microservice that could serve those percentage
numbers. If we use pulsar functions to do the routing, I am thinking we would need to cache
the numbers in redis or zookeeper.
We generally have a GA version, which all traffic is routed to by default, but divert 1% (or
whatever) to the new deployments.
----
2019-04-19 18:22:54 UTC - David Kjerrumgaard: Just things to consider if you want to submit
a PIP, etc.
----
2019-04-19 18:23:08 UTC - Sam Leung: Each service also has multiple instances, so it should
be resilient enough that the GA has at least one consumer running.
----
2019-04-19 18:23:11 UTC - Matteo Merli: There are several optimizations that could be done
on the Docker image
----
2019-04-19 18:23:41 UTC - Matteo Merli: Basically that image just needs the pulsar-bin.tar.gz
plus JVM
----
2019-04-19 18:23:53 UTC - Sam Leung: Definitely good things to think about in a more general
scenario though.
----
2019-04-19 18:24:13 UTC - Matteo Merli: There was some discussion here: <https://github.com/apache/pulsar/pull/3602>
----
2019-04-19 18:24:39 UTC - David Kjerrumgaard: Since this use case is geared towards A/B testing
(in my mind anyway), I was thinking of the case were v2 of the service has a bug in it that
causes ALL instances to fail.
----
2019-04-19 18:25:57 UTC - David Kjerrumgaard: users would think that some of the messages
aren't getting processed by the system.  A lot of messages would go un-acked which can cause
issues, etc.
----
2019-04-19 18:26:57 UTC - Sam Leung: I see.. if v2 did not have an ack timeout and doesn’t
disconnect, the messages would be stuck.
----
2019-04-19 18:27:14 UTC - David Kjerrumgaard: yep
----
2019-04-19 18:28:03 UTC - Sam Leung: Okay, alternatively, if we didn’t need the precision
of exact percentages, what do you think would be a good canary test to ensure v2 works?
----
2019-04-19 18:30:44 UTC - David Kjerrumgaard: Assuming that v2 would in turn distribute messages
to downstream services, etc?
----
2019-04-19 18:31:09 UTC - Sam Leung: sure
----
2019-04-19 18:31:24 UTC - Ruud Kamphuis: Thanks. I get that having a standalone image is great
for everybody that just wants to test Pulsar out.

However, I do find the naming of the current docker files super confusing.

pulsar
pulsar-standalone
pulsar-all

They all seem to have ZK, BK and more installed.

I expected `pulsar` to be the single pulsar package. And `pulsar-standalone` to be P,ZK,BK,Dashboard
etc

Why are they the same(ish)?

If you want to go to production, then you need / want to have these services split right?
----
2019-04-19 18:33:31 UTC - Ruud Kamphuis: Thanks I will subscribe to the issue.
----
2019-04-19 18:35:21 UTC - David Kjerrumgaard: That's a good question. I'd have to think about
it a bit. Can your downstream services handle duplicate messages? If so, you can have v2 create
its own subscription on the incoming topic
----
2019-04-19 18:36:49 UTC - David Kjerrumgaard: yes, in a production environment these services
are typically spread out.
----
2019-04-19 18:37:40 UTC - Sam Leung: That would be nice for the cases where that the downstream
services can handle that, it would put a bit of duplicated effort, but well worth it. But
there are some that cannot.
----
2019-04-19 18:37:48 UTC - David Kjerrumgaard: We deploy the services separately as pods in
K8s and use the configs to control which services are running in each pod
----
2019-04-19 18:38:09 UTC - Sébastien de Melo: Approximately 120 000 messages in 1 minute.
The function processes between 50k and 85k and stops working. It takes a few minutes. There
are some 500 errors from the API we call in the logs.
We had 9 instances of the function distributed across 3 brokers. Interestingly the problem
does not occur if we create 20 instances instead of 9
----
2019-04-19 18:38:48 UTC - David Kjerrumgaard: Yea, the answer is going to be very specific
to your environment
----
2019-04-19 18:39:19 UTC - Ruud Kamphuis: I created an issue on Github, <https://github.com/apache/pulsar/issues/4086>
. I think it's better to have it there as others can also search for it.
----
2019-04-19 18:40:08 UTC - David Kjerrumgaard: FWIW, my "hack" would segregate the messages
into different topics, and if v2 is having issue, you will be able to see that in the topic
backlog, ack count, etc.
----
2019-04-19 18:40:34 UTC - David Kjerrumgaard: and it wouldn't impact the v1 flow
----
2019-04-19 18:41:22 UTC - David Kjerrumgaard: topics are cheap in Pulsar as well :smiley:
----
2019-04-19 18:44:54 UTC - Sam Leung: Makes sense. Yeah they’re cheap, but I’m thinking
about the scale where we’re running at say 50% capacity, if we have 3 services that consume
from the same topic on different subscriptions, and they each run their own A/B test, we suddenly
are duplicating the messages into 6 topics, with 3x the number of messages.
----
2019-04-19 18:45:42 UTC - Sam Leung: But that’s relatively unlikely :slightly_smiling_face:
----
2019-04-19 18:46:05 UTC - David Kjerrumgaard: From a design perspective, I think it is best
NOT to embed this behavior into the core classes, and instead use functions or similar tools
to implement this and other unique behaviors, such as filtering, replicating, etc. Adding
this into the base class makes the topic configuration that much more complicated.
----
2019-04-19 18:46:43 UTC - Sam Leung: I agree
----
2019-04-19 18:46:52 UTC - David Kjerrumgaard: I wouldn't worry about the scalability of Pulsar
too much :smiley:
----
2019-04-19 18:47:29 UTC - David Kjerrumgaard: with proper message retention and expiration
policies in place you will be fine
----
2019-04-19 18:55:04 UTC - Sam Leung: Thanks for all your help!
----
2019-04-19 19:34:06 UTC - Matteo Merli: Having ZK and BK in same image is not the reason for
the big size :slightly_smiling_face:
----
Mime
View raw message