manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From S <st...@remcam.net>
Subject RE: How to extract text content and index in elastic-search
Date Fri, 06 Oct 2017 13:57:49 GMT
Hi Deepak
If you're using a later version of ES, you can just add the Ingest Plugin to ES.
Alternatively, add a field name for the Content field in the MFC ES configuration.
I'll check it when I get back.
Steph

-----Original Message-----
From: "Dileepa Jayakody" <dileepajayakody@gmail.com>
Sent: ‎06/‎10/‎2017 07:39
To: "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
Subject: Re: How to extract text content and index in elastic-search

Guys, I'm using the latest 2.8.1 release.


Thanks



On Fri, Oct 6, 2017 at 6:05 PM, Dileepa Jayakody <dileepajayakody@gmail.com> wrote:

Hi All,


I'm trying out a small demo, with a file system repository connector and elastic search output
connector to extract spreadsheet documents and index.

I've also added tika transform connector in the job.



When I run the documents get indexed in elastic-search but the content is been indexed in
binary.


See below the indexed content in ES. Can I please know how to extract the spread-sheet content
to text format here? 

Even for a text file, I see the content is been indexed as binary. 

Is there a configuration I need to do here to get the text content extracted and indexed in
ES?


{
        "_index": "test",
        "_type": "generictype",
        "_id": "file:/home/dileepa/Documents/hackathon/test_data/MI%20-%20Project2%20-%20Estimation%20v1.0.xlsx",
        "_score": 1,
        "_source": {
          "stream_size": "101613",
          "X-Parsed-By": "org.apache.tika.parser.DefaultParser",
          "stream_name": "MI - Project2 - Estimation v1.0.xlsx",
          "protected": "false",
          "resourceName": "MI - Project2 - Estimation v1.0.xlsx",
          "uri": "/home/dileepa/Documents/hackathon/test_data/MI - Project2 - Estimation v1.0.xlsx",
          "Content-Type": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
          "content_type": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
          "allow_token_document": "__nosecurity__",
          "deny_token_document": "__nosecurity__",
          "allow_token_share": "__nosecurity__",
          "deny_token_share": "__nosecurity__",
          "allow_token_parent": "__nosecurity__",
          "deny_token_parent": "__nosecurity__",
          "file": {
            "_content_type": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
            "_name": "MI - Project2 - Estimation v1.0.xlsx",
            "_content": "RGV2ZWxvcG1lbnQgRXN0aW1hdGVzCglTZWN0aW9uCUZlYXR1cmUJQXNzdW1wdGlvbnMgYW5kIHNjb3BlCUFkZGl0aW9uYWwgaJlYWxpMAkwCTAJ....."
        }
      }
    ]
  }
}


Thanks,

Dileepa
Mime
View raw message