nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pradumna Panditrao (JIRA)" <>
Subject [jira] [Commented] (NUTCH-2079) Tika Parsing plugin issue
Date Wed, 12 Aug 2015 13:06:46 GMT


Pradumna Panditrao commented on NUTCH-2079:


1.In current case getParse parses url & page. But I want to pass particular data etc.
if page contains name, age, location etc. So guide for the same.
2. Once I come to know the exact parse contain as per my requirement, I will make the same
changes index-plugin.
3.Yes, I have added the same into gora-mongodb-mapping.xml

So let me know the same.

Sample code of mine:

Parser parser = new AutoDetectParser();
    Metadata metadata = new Metadata();
   //Phone number extractor
    PhoneExtractingContentHandler handler = new PhoneExtractingContentHandler(new BodyContentHandler(),
    InputStream stream = new FileInputStream(file);
    try {
        parser.parse(stream, handler, metadata, new ParseContext());
    finally {
    String[] numbers = metadata.getValues("phonenumbers");
    for (String number : numbers) {

> Tika Parsing plugin issue
> -------------------------
>                 Key: NUTCH-2079
>                 URL:
>             Project: Nutch
>          Issue Type: New Feature
>          Components: deployment
>    Affects Versions: 2.3
>         Environment: Ubuntu 14.04
>            Reporter: Pradumna Panditrao
>             Fix For: 2.3
> Hi,
> I am trying to parse particular data & post the same on the mongodb, however when
I am trying to do some modifications into into parse tika plugin, it has too much inter connectivity
with other classes & it misses the data. I want to pick up particular data from website
using the same plugin & put into mongo db.
> Please suggest for the same.

This message was sent by Atlassian JIRA

View raw message