nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-368) Message queueing system
Date Tue, 22 Jan 2008 14:50:37 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12561362#action_12561362
] 

Andrzej Bialecki  commented on NUTCH-368:
-----------------------------------------

This solution is too heavy on the namenode, so it's suitable only for very low message volumes.
As such, it's not generally applicable and should not be added to Nutch. See also HADOOP-490.

> Message queueing system
> -----------------------
>
>                 Key: NUTCH-368
>                 URL: https://issues.apache.org/jira/browse/NUTCH-368
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 0.9.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>         Attachments: Fetcher-ctrl.patch, msg.tgz
>
>
> This is an implementation of a filesystem-based message queueing system. The motivation
for this functionality is explained in HADOOP-490 - there is nothing Nutch-specific in this
implementation, so if it's considered generally useful it could be moved there.
> Below are excerpts from the included javadocs.
> The model of the system is as follows:
>     * applications (including map-reduce jobs) may create their own separate message
queueing area. Alternatively, they can specifically ask for a named message queue, belonging
to a different application or existing as a system-wide queue. Message queues are created
under "/mq" and then the message queue id (for map-reduce jobs this is a job id, or it can
be any other name passed as job id to the constructor).
>       Please see the example for more information.
>     * a single unit of information passing through queues is a Msg, which has a unique
identifier (consisting of creation time and publisher name), string subject, and content (Writable).
>     * single MsgQueue in fact consists of any number of topics. There are four predefined
ones: in, out, err, and ctrl.
>     * messages are published to topics, which present a sequential view of messages,
sorted by msgId (which corresponds to their order of arrival).
>     * each message queue may periodically poll for changes (MsgQueue.startPolling()),
using a separate thread. Polling updates the list of topics and messages. Poll interval is
configurable, and defaults to 5 sec.
>     * each detected change in the queue (add/remove topic, add/remove message) may be
communicated to registered listeners. Out-of-band messages are not supported in this version,
but it's not too complicated to add them. Applications can create listeners watching queues
for newly added messages, or deleted messages, added topics or deleted topics, etc.
>     * each instance of MsgQueue using the same physical queue maintains its own view
of the queue, keeping track of topics and messages that it considers "processed and discarded".
In other words, multiple readers and creators may modify queues, and each knows which messages
it already processed and which ones are new. In a similar fashion, instances may willfully
"remove" certain topics from their view, even though these topics still physically exist and
are available for other instances (and later on they can "add" them to their view again).
>       This somewhat complicated feature was implemented in order to support multiple
readers for the same message (e.g. many tasks per one mapred job). Each task needs to register
for the same queue, and if they didn't have their own views of the queue, messages would be
consumed by the first task that got to them. As it is implemented now, each task may consume
messages at its own pace. At the end of the job applications may elect to keep the queue around
or to destroy it (and thus remove all topics and messages in it).
>     * messages, topics and queues may be destroyed by any user, at which point they are
physically removed from the filesystem. All users will gradually update their views, during
the next poll operation.
>     * there is a command-line tool to examine and modify queues, and also to retrieve
and send simple text messages. You can run it like this:
>          bin/nutch org.apache.nutch.util.msg.MsgQueueTool ...many options...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message