nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dennis Kubes (JIRA)" <>
Subject [jira] Updated: (NUTCH-565) Arc File to Nutch Segments Converter
Date Thu, 11 Oct 2007 21:30:50 GMT


Dennis Kubes updated NUTCH-565:

    Attachment: arcsegments2.patch

Here is the updated patch.  Works without any or othe LGPL code so it can  be
included in Nutch.  Since arcs a simply tars of gzips it scans through the arc file for the
gzip header then when found starts input there and unzips each record in turn.  It takes about
40 min to process a single file which outputs ~1G in segments.  Multiple files can be run
at once on a Hadoop cluster. 

> Arc File to Nutch Segments Converter
> ------------------------------------
>                 Key: NUTCH-565
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>         Attachments: archive-commons-1.11.0-200612262257.jar, arcsegments2.patch, fastutil-5.0.3-heritrix-subset-1.0.jar,
> Functionality that allows arc files, such as those produced by the internet archive project
or by the Grub distributed crawler to be parsed into Nutch segments.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message