nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jorgelbg <...@git.apache.org>
Subject [GitHub] nutch pull request: WARC exporter for the CommonCrawlDataDumper
Date Fri, 11 Sep 2015 16:18:06 GMT
GitHub user jorgelbg reopened a pull request:

    https://github.com/apache/nutch/pull/55

    WARC exporter for the CommonCrawlDataDumper

    This adds the possibility of exporting the nutch segments to a WARC files. 
    
    From the usage point of view a couple of new command line options are available: 
    
    * `-warc`: enables the functionality to export into WARC files, if not specified the default
JACKSON formatter is used.
    * `-warcSize`: enable the option to define a max file size for each WARC file, if not
specified a default of 1GB per file is used as recommended by the WARC ISO standard.
    
    The usual `-gzip` flag can be used to enable compression on the WARC files, which allow
to compress the output files. 
    
    Some changes to the default CommonCrawlDataDumper were done, essentially some changes
to the Factory and to the Formats. This changes avoid creating a new instance of a CommmonCrawlFormat
on each URL read from the segments. 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/DigitalPebble/nutch warc

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/nutch/pull/55.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #55
    
----
commit 0a627e5a5098a2ad4818b594fe567ea7fdd2c131
Author: Jorge Luis Betancourt <betancourt.jorge@gmail.com>
Date:   2015-09-08T13:21:04Z

    Initial version of the CommonCrawlWARCFormat, generates valid metadata, response and request
records. The request
    records only provide partial information, roughly the same as the CommonCrawl Data Dumper
at the moment.

commit 1889a0b64d48005499f4de01ed18724087feb0f7
Author: Jorge Luis Betancourt <betancourt.jorge@gmail.com>
Date:   2015-09-08T16:37:27Z

    Adding the WARCUtils class and the dependency to the ivy.xml file to avoid the fetching
of another hadoop dependency

commit 169e5a4a4172424b31c91e232bb69056b10827c7
Author: Jorge Luis Betancourt <betancourt.jorge@gmail.com>
Date:   2015-09-08T18:21:47Z

    Removing the transitive property of the ivy.xml file to avoid any future troubles

commit ede35d1aa767741ec5206de7990910fc661983e8
Author: Jorge Luis Betancourt <betancourt.jorge@gmail.com>
Date:   2015-09-10T17:57:11Z

    Doing some refactoring on the existing code, essentially trying to avoid creating an instance
of each CommonCrawlFormat
    per URL processed, since the format is content indepdendent at the momento the factory
should allow to create a format
    without this data.
    
    Added a close method to the the CommonCrawlFormat interface for those cases when the format
needs some closing
    statement.

commit 44beb74172364556f70b6f08d0a8ee511c99eff4
Author: Jorge Luis Betancourt <betancourt.jorge@gmail.com>
Date:   2015-09-11T14:34:42Z

    Adding the changes to the main CCDataDumper class to call the WARC exporter tool.
    Changes to the Jackson format to work with the new structure.
    Changes to the FormatFactory to create the right Jacson/WARC instance.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message