samoa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vishal Karande (JIRA)" <>
Subject [jira] [Commented] (SAMOA-40) Add Kafka stream reader modules to consume data from Kafka framework
Date Tue, 11 Aug 2015 00:09:45 GMT


Vishal Karande commented on SAMOA-40:

Hi @gdfm 

Here are the steps to test SAMOA with KAFKA. Let me know if you have any issues to run it.

A] Set up KAFKA:
Step 1: Download the code

Download the release and un-tar it.
> tar -xzf kafka_2.10-
> cd kafka_2.10-

Step 2: Start the server
Kafka uses ZooKeeper so you need to first start a ZooKeeper server if you don't already have
one. You can use the convenience script packaged with kafka to get a quick-and-dirty single-node
ZooKeeper instance.

> bin/ config/
[2013-04-22 15:01:37,495] INFO Reading configuration from: config/ (org.apache.zookeeper.server.quorum.QuorumPeerConfig)
Now start the Kafka server:
> bin/ config/
[01:47,028] INFO Verifying properties (kafka.utils.VerifiableProperties)
[:01:47,051] INFO Property socket.send.buffer.bytes is overridden to 1048576 (kafka.utils.VerifiableProperties)

Step 3: Create a topic

Let's create a topic named "test" with a single partition and only one replica:
> bin/ --create --zookeeper localhost:2181 --replication-factor 1 --partitions
1 --topic test

We can now see that topic if we run the list topic command:
> bin/ --list --zookeeper localhost:2181

B] Add data into KAFKA:
Use following code to add data into Kafka.

KafkaFileProducer performs following tasks:

1) Read data from dataset/Processed_subject101.dat file
2) For each line in the file it creates a message of type (Key,Value) where key is line number
and value is actual line

Following default variables can be changed to add data for your choice into kafka.

Default Topic name: test

Default file used: dataset/Processed_subject101.dat

D] Validate data present in Kafka:

Run KafkaFileProducer and verify that data is loaded into Kafka using following command

> bin/ --zookeeper localhost:2181 --topic test --from-beginning

C] Test KafkaStream for SAMOA:

> bin/samoa local target/SAMOA-Local-0.3.0-incubating-SNAPSHOT.jar "PrequentialEvaluation
-d /tmp/dump.csv -l (org.apache.samoa.learners.classifiers.trees.VerticalHoeffdingTree -p
4) -s (org.apache.samoa.streams.kafka.KafkaStream -r 20 -t test -k 7) -i 100000 -f 10000 -w

evaluation instances,classified instances,classifications correct (percent),Kappa Statistic
(percent),Kappa Temporal Statistic (percent)

- r  20 :read 20 messages in single read from Kafka
-t  test :read from topic name "test"
-k 7: Total number of classes in the data
-w 0: No delay between instances read
default separator comma is used while parsing values for evaluation

> Add Kafka stream reader modules to consume data from Kafka framework
> --------------------------------------------------------------------
>                 Key: SAMOA-40
>                 URL:
>             Project: SAMOA
>          Issue Type: Task
>          Components: Infrastructure, SAMOA-API
>         Environment: OS X Version 10.10.3
>            Reporter: Vishal Karande
>            Priority: Minor
>              Labels: features
>   Original Estimate: 168h
>  Remaining Estimate: 168h
> Apache SAMOA is designed to process streaming data and develop streaming machine learning
> algorithm. Currently, SAMOA framework supports stream data read from Arff files only.
> Thus, while using SAMOA as a streaming machine learning component in real time use-cases,
> writing and reading data from files is slow and inefficient.
> A single Kafka broker can handle hundreds of megabytes of reads and writes per second

> from thousands of clients. The ability to read data directly from Apache Kafka into SAMOA
> not only improve performance but also make SAMOA pluggable to many real time machine
> learning use cases such as Internet of Things(IoT).
> Add code that enables SAMOA to read data from Apache Kafka as a stream data.
> Kafka stream reader supports following different options for streaming:
> a) Topic selection - Kafka topic to read data
> b) Partition selection - Kafka partition to read data
> c) Batching - Number of data instances read from Kafka in one read request to Kafka
> d) Configuration options - Kafka port number, seed information, time delay between two
read requests
> Components:
> KafkaReader - Consists for APIs to read data from Kafka
> KafkaStream - Stream source for SAMOA providing data read from Kafka
> Dependencies for Kafka are added in pom.xml for in samoa-api component. 

This message was sent by Atlassian JIRA

View raw message