lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gergő Törcsvári (JIRA) <j...@apache.org>
Subject [jira] [Updated] (LUCENE-5736) Separate the classifiers to online and caching where possible
Date Sun, 08 Jun 2014 09:37:01 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-5736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gergő Törcsvári updated LUCENE-5736:
------------------------------------

    Attachment: CachingNaiveBayesClassifier.java

The attached class is a working copy!

This is a cache included version of the SimpleNaiveBayes classifier. The cache is a hash-map,
if a word needed, we search it for the all class and take it to the hash. Next time, we pull
out from the cache and not searching in the index again.

The cache (re)initialization is recalculating the docsWithClassSize, clear the hash-maps,
and prepare new ones. 2 map needed, and a list, the first map will contains the term-classes-termInClassOccurrence
(this is the cache), the list contains the classnames, and the second map contains the class-avgUniqueTermNumber.
The last two is fully preloaded, the first is dynamically building in the searches.

If there are a lot term and/or class its need a lot memory so there is a build in possibility
for cutting the cache size. If there are terms thats really rare we expect that they will
rarely come out in the other documents too, and they are left out from the cache. There is
a possibility to left them out full from the classification calculation too.

> Separate the classifiers to online and caching where possible
> -------------------------------------------------------------
>
>                 Key: LUCENE-5736
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5736
>             Project: Lucene - Core
>          Issue Type: Sub-task
>          Components: modules/classification
>            Reporter: Gergő Törcsvári
>         Attachments: CachingNaiveBayesClassifier.java
>
>
> The Lucene classifier implementations are now near onlines if they get a near realtime
reader. It is good for the users whoes have a continously changing dataset, but slow for not
changing datasets.
> The idea is: What if we implement a cache and speed up the results where it is possible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message