From Ben Ciceron <...@triggit.com>
Subject Re: jobtracker / hadoop comsumer
Date Wed, 31 Aug 2011 21:44:48 GMT
> It really looks like your mapper tasks may be failing to connect to your
> kafka server.


> Here's a brief overview of what that demo job is doing so you can understand
> where the example may have gone wrong.
> DataGenerator:
>   1. When DataGenerator is run, it needs the property 'kafka.etl.topic',
>   and 'kafka.server.uri' set in the properties file. When you run
> ./run-class.sh
>   kafka.etl.impl.DataGenerator test/test.properties, you can tell that
>   they're properly set by the output 'topic=<blah>' and 'server uri=<kafka
>   server url>.

seems ok :
server uri:tcp://<ip_of_my_hostA>:9092
 send 1000 SimpleTestEvent count events to tcp://<ip_of_my_hostA>:9092
11/09/01 05:04:05 INFO producer.SyncProducer: Connected to
<ip_of_my_hostA>:9092 for producing
11/09/01 05:04:05 INFO producer.SyncProducer: Disconnecting from
Dump tcp://<ip_of_my_hostA>:9092	SimpleTestEvent	0	-1 to /tmp/ben6/data/1.dat

>   2. The DataGenerator will create a bunch of dummy messages and pump it to
>   that kafka server. Afterwards, it will write a file to HDFS at path 'input'
>   which you also set in the properties file. The file that is created will be
>   named something like 1.dat.

yes , i see it under hadoop directory as specified as 'input' in

>   3. 1.dat is a sequence file, so if it isn't compressed, you should be
>   able to see its contents in plain text. The contents will essentially list
>   the kafka server url, the partition number and the topic as well as the
>   offset.

mine has only 1 line (some encrypted in the middle of it is shown as is):
SimpleTestEvent	0	-1

>   4. In a real scenario, you'll probably create several of these files for
>   each broker and possibly partition, but for this example, you only need one
>   file. Each file will spawn a mapper during the mapred step.
> CopyJars:
>   1. This should copy the necessary jars for kafka hadoop, and push them
>   into HDFS for the distributed cache. If the jars are copied locally instead
>   of to a remote cluster, most likely HADOOP_CONF_DIR hasn't been set up
>   correctly. The environment should probably be set by the script, so someone
>   can change that.

yes i've got the proper jars under the hadoop directory now.

> SimpleKafkaETLJob
>   1. This job will then setup the distributed classpath, and the input path
>   should be the directory that 1.dat was written to.
>   2. Internally, the mappers will then load 1.dat and use the connection
>   data contained in it to connect to kafka. If it's trying to connect to
>   anything but your kafka server, than this file was incorrectly written.

can we see a sample of a valid 1.dat file , please ?

>   3. The RecordReader wraps all of this and hides all the connection stuff
>   so that your Mapper should see a stream of Kafka messages rather than the
>   contents of 1.dat.
> So please see if you can figure out what is wrong with your example and feel
> free beef up the README instructions to take in account your pitfalls.

ok , yes i plan to do this once i got all the steps right.

thx for all your support so far.

