nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "eyal edri" <eyal.e...@gmail.com>
Subject Solved: Downloading file types to file system
Date Tue, 09 Oct 2007 12:02:12 GMT
I've found the solution.
quite simple actually, purely java related.

code:

byte [] data = Content.getContent();
String file = application.getRealPath("/") + "file.dat";
FileOutputStream fileoutputstream = new FileOutputStream(file);
for (int i = 0; i < data.length; i++)
{
    fileoutputstream.write(data[i]);
}
fileoutputstream.close();

that solved the issue.



On 10/9/07, eyal edri <eyal.edri@gmail.com> wrote:
>
> Can anyone help with this?
> Is there another IO java class i can use for saving the byte array?
>
> Eyal.
>
> On 9/22/07, eyal edri < eyal.edri@gmail.com> wrote:
> >
> > Am i catching this content byte array too late (in the code)?
> >
> > is there a previous data field that holds the page content before the
> > content byte array?
> >
> > thanks,
> >
> >
> > On 9/20/07, eyal edri <eyal.edri@gmail.com > wrote:
> > >
> > > Hi,
> > >
> > > I've made some progress with downloading files (EXE/ZIP).
> > > I'm not using yet the plugin system, just injected code to the "
> > > fetcher.java" meantime to test it.
> > > I've written the following code:  (after this line:   Content content
> > > = output.getContent(); )
> > >
> > >
> > >  //  - save the file to fs
> > >      // define regrex to capture domainname & filename
> > >      Pattern regex = Pattern.compile ("http://([^/]*).*/([^/]*)$");
> > >     Matcher urlMatcher = regex.matcher(content.getUrl());
> > >
> > >     String domain = null;
> > >     String fileLast = null;
> > >    // get $1 &$2 backreference from regrex
> > >     while ( urlMatcher.find() ) {
> > >          domain = urlMatcher.group(1);
> > >           fileLast = urlMatcher.group(2);
> > >      }
> > >      LOG.info ("filename " + fileLast);
> > >      LOG.info ("domain " + domain);
> > >      File downloadDir  = new File("/home/eyale/nutch/DOWNLOADS/" +
> > > domain);
> > >      // CHECK IF DIR EXITS
> > >      if ( !downloadDir.exists() )
> > >           downloadDir.mkdir();
> > >       String filename = downloadDir + "/" + fileLast;
> > >
> > >        FileOutputStream out = new FileOutputStream (new File
> > > (filename));
> > >        ObjectOutputStream obj = new ObjectOutputStream (out);
> > >
> > >        // the content.getContent() returns a byte array
> > >        obj.write (content.getContent());
> > >        obj.close();
> > >
> > > after downloading this file, i've found out that it is slightly bigger
> > > than the original file
> > > (compare with file retrived from WGET).
> > > why is that? does this byte array contain more information/data?
> > > how can i get the real file data only?
> > >
> > > thanks,
> > >
> > >
> > > On 9/11/07, Martin Kuen <martin.kuen@gmail.com > wrote:
> > > >
> > > > hi,
> > > >
> > > > I don't think that nutch can be configured to store each downloaded
> > > > file as
> > > > a file (one file downloaded - one file on your local disk).
> > > > The "byte array called content" can be directly stored I think. I
> > > > think
> > > > that's worth giving it a try. The fetcher uses (binary) streams to
> > > > handle
> > > > the downloaded content, so I think it *should* be okay.
> > > >
> > > > Another approach (my two cents):
> > > > 1. Run the fetcher with the -noParse option (most likely not even
> > > > necessary)
> > > > 2. check if the fetcher is advised to store the content (there is a
> > > > property in nutch-default.xml)
> > > > 3. create a dump with the "readseg" command and the "-dump" option
> > > > 4. process the dump file and cut out what is necessary
> > > >
> > > > Just interested if that could work . . . however:
> > > > I had a look at the class implementing the readseg command and found
> > > > that
> > > > the dump file is created with a "PrintWriter". This will create
> > > > trouble I
> > > > think. Maybe you can modify the SegmentReader (use an OutputStream).
> > > >
> > > > Regarding the fetcher - it's using a binary stream to store the
> > > > content
> > > > (FSDataOutputStream).
> > > >
> > > >
> > > > Cheers,
> > > >
> > > > Martin
> > > >
> > > >
> > > > On 9/11/07, eyal edri < eyal.edri@gmail.com> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > I've asked this question before on a different mail list, with no
> > > > real
> > > > > response.
> > > > > I hope someone saw the need for this actions and could help.
> > > > >
> > > > > I'm trying to config nutch to download certain file types
> > > > (exe/zip) to the
> > > > > file system while crawling.
> > > > > I know nutch doesn't have a parse-exe plugin, so i'll focus on the
> > > > ZIP
> > > > > (once
> > > > > i will understand the logic, i will write a parse-exe plugin).
> > > > >
> > > > > I want to know if nutch supports the downloading of files
> > > > inherently
> > > > > (using
> > > > > only conf files) or if not, how can i alter the parse-zip plugin
> > > > in order
> > > > > to
> > > > > download the file.
> > > > > (i saw the parser gets a byte array called "content", can i save
> > > > this to
> > > > > the
> > > > > fs ?).
> > > > >
> > > > > thanks,
> > > > >
> > > > >
> > > > > --
> > > > > Eyal Edri
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Eyal Edri
> >
> >
> >
> >
> > --
> > Eyal Edri
>
>
>
>
> --
> Eyal Edri




-- 
Eyal Edri

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message