nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pradumna Panditrao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2079) Tika Parsing plugin issue
Date Wed, 12 Aug 2015 13:06:46 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14693456#comment-14693456
] 

Pradumna Panditrao commented on NUTCH-2079:
-------------------------------------------

Hi,

1.In current case getParse parses url & page. But I want to pass particular data etc.
if page contains name, age, location etc. So guide for the same.
2. Once I come to know the exact parse contain as per my requirement, I will make the same
changes index-plugin.
3.Yes, I have added the same into gora-mongodb-mapping.xml

So let me know the same.


Sample code of mine:

Parser parser = new AutoDetectParser();
    Metadata metadata = new Metadata();
   //Phone number extractor
    PhoneExtractingContentHandler handler = new PhoneExtractingContentHandler(new BodyContentHandler(),
metadata);
    InputStream stream = new FileInputStream(file);
    try {
        parser.parse(stream, handler, metadata, new ParseContext());
    }
    finally {
        stream.close();
    }
    String[] numbers = metadata.getValues("phonenumbers");
    for (String number : numbers) {
        phoneNumbers.add(number);
    }
}




> Tika Parsing plugin issue
> -------------------------
>
>                 Key: NUTCH-2079
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2079
>             Project: Nutch
>          Issue Type: New Feature
>          Components: deployment
>    Affects Versions: 2.3
>         Environment: Ubuntu 14.04
>            Reporter: Pradumna Panditrao
>             Fix For: 2.3
>
>
> Hi,
> I am trying to parse particular data & post the same on the mongodb, however when
I am trying to do some modifications into into parse tika plugin, it has too much inter connectivity
with other classes & it misses the data. I want to pick up particular data from website
using the same plugin & put into mongo db.
> Please suggest for the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message