spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lars Albertsson <>
Subject Re: How to spin up Kafka using docker and use for Spark Streaming Integration tests
Date Sun, 10 Jul 2016 09:58:00 GMT
Let us assume that you want to build an integration test setup where
you run all participating components in Docker.

You create a docker-compose.yml with four Docker images, something like this:

# Start docker-compose.yml
version: '2'

    build: myapp_dir
      - kafka
      - cassandra

    image: spotify/kafka
      - "2181:2181"
      - "9092:9092"

    image: spotify/cassandra
      - <might need some tweaking here>
      - "9042:9042"

    build: test_harness_dir
      - kafka
      - cassandra
# End docker-compose.yml

I haven't used the spotify/cassandra image, so you might need to do
some environment variable plumbing to get it working.

Your test harness would then push messages to Kafka, and poll
Cassandra for the expected output. Your Spark Streaming application
has Spark installed on the
Docker image, and runs Spark with local master.

You need to run this on a machine that has Docker and Docker Compose
installed, typically a Ubuntu host. This machine can either be bare
metal or a full VM (Virtualbox, VMware, Xen), which is what you get if
you run in an IaaS cloud like GCE or EC2. Hence, your CI/CD Jenkins
machine should be a dedicated instance.

Developers with Macs would run docker-machine, which uses Virtualbox
IIRC. Developers with Linux machines can run Docker and Docker Compose

You can in theory run Jenkins in Docker and spin up new Docker
containers from inside Docker using some docker-inside-docker setup.
It will add complexity, however, and I suspect it will be brittle, so
I don't recommend it.

You could also in theory use some cloud container service that runs
your images during tests. They have different ways of welding Docker
images than Docker Compose, however, so it also increases complexity
and makes the CI/CD setup different than the setup on local developer
machines. I went down this path once, but I cannot recommend it.

If you instead want a setup where the test harness and your Spark
Streaming application runs outside Docker, you omit them from
docker-compose.yml, and have the test harness run docker-compose, and
figure out the ports and addresses to connect to. As mentioned
earlier, this requires more plumbing, but results in an integration
test setup that runs smoothly from Gradle/Maven/SBT and also from

I hope things are clearer. Let me know if you have further questions.


Lars Albertsson
Data engineering consultant
+46 70 7687109

On Thu, Jul 7, 2016 at 3:14 AM, swetha kasireddy
<> wrote:
> Can this docker image be used to spin up kafka cluster in a CI/CD pipeline
> like Jenkins to run the integration tests? Or it can be done only in the
> local machine that has docker installed? I assume that the box where the
> CI/CD pipeline runs should have docker installed correct?
> On Mon, Jul 4, 2016 at 5:20 AM, Lars Albertsson <> wrote:
>> I created such a setup for a client a few months ago. It is pretty
>> straightforward, but it can take some work to get all the wires
>> connected.
>> I suggest that you start with the spotify/kafka
>> ( Docker image, since it
>> includes a bundled zookeeper. The alternative would be to spin up a
>> separate Zookeeper Docker container and connect them, but for testing
>> purposes, it would make the setup more complex.
>> You'll need to inform Kafka about the external address it exposes by
>> setting ADVERTISED_HOST to the output of "docker-machine ip" (on Mac)
>> or the address printed by "ip addr show docker0" (Linux). I also
>> suggest setting
>> You can choose to run your Spark Streaming application under test
>> (SUT) and your test harness also in Docker containers, or directly on
>> your host.
>> In the former case, it is easiest to set up a Docker Compose file
>> linking the harness and SUT to Kafka. This variant provides better
>> isolation, and might integrate better if you have existing similar
>> test frameworks.
>> If you want to run the harness and SUT outside Docker, I suggest that
>> you build your harness with a standard test framework, e.g. scalatest
>> or JUnit, and run both harness and SUT in the same JVM. In this case,
>> you put code to bring up the Kafka Docker container in test framework
>> setup methods. This test strategy integrates better with IDEs and
>> build tools (mvn/sbt/gradle), since they will run (and debug) your
>> tests without any special integration. I therefore prefer this
>> strategy.
>> What is the output of your application? If it is messages on a
>> different Kafka topic, the test harness can merely subscribe and
>> verify output. If you emit output to a database, you'll need another
>> Docker container, integrated with Docker Compose. If you are emitting
>> database entries, your test oracle will need to frequently poll the
>> database for the expected records, with a timeout in order not to hang
>> on failing tests.
>> I hope this is comprehensible. Let me know if you have followup questions.
>> Regards,
>> Lars Albertsson
>> Data engineering consultant
>> +46 70 7687109
>> Calendar:
>> On Thu, Jun 30, 2016 at 8:19 PM, SRK <> wrote:
>> > Hi,
>> >
>> > I need to do integration tests using Spark Streaming. My idea is to spin
>> > up
>> > kafka using docker locally and use it to feed the stream to my Streaming
>> > Job. Any suggestions on how to do this would be of great help.
>> >
>> > Thanks,
>> > Swetha
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> >
>> > Sent from the Apache Spark User List mailing list archive at
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe e-mail:
>> >

To unsubscribe e-mail:

View raw message