Did you check the "modified" header returned with the documents from Liferay? Some systems tend to always use "now", which could explain the behavior (this might even be a configuration option). You can see this in a browser's debug window when you reload the page a couple of times (Ctrl+F5 to force reloading).
I've been launching the job a couple of times with a small set of documents and what I see is that the elastic indexes every time each document, even though the weight of the document is always the same and I don't notice any "html dynamic content" like current time that could cause checksum to be different.
Consulting the "Simple history" menu option shows that Elastic output connector is called
"08-23-2018 06:27:19.274 Indexation (Elasticsearch 2.4.6)"
So I guess there is a miss-configuration somewhere...
El jue., 23 ago. 2018 a las 1:45, Karl Wright (<email@example.com>) escribió:
I take it from your question that you are using the Web Connector?
All connectors create a version string that is used to determine whether content needs to be reindexed or not. The Web Connector's version string uses a checksum of the page contents; we found the "last modified" header to be unreliable, if I recall correctly.
On Wed, Aug 22, 2018 at 12:35 PM Gustavo Beneitez <firstname.lastname@example.org> wrote:
I am currently creating a job that indexes part of Liferay intranet content.Every time the job is executed the documents are fully reindexed in Elastic, no matter they didn't change.I thought I had read somewhere the crawler uses "last-modified" http header, but also that saves into database a hash.I was looking for the right one within the user's manual but no luck, so please could you tell me which is the correct one?
Thanks in advance!