Welcome to ftp.vim.org,
Hosted by ftp.nluug.nl Current directory: /ftp/pub/os/Linux/distr/salix/i486/extra-14.2/source/libraries/libexttextcat/ |
Contents of README:Libtextcat is a library with functions that implement the classification technique described in Cavnar & Trenkle, "N-Gram-Based Text Categorization". It was primarily developed for language guessing, a task on which it is known to perform with near-perfect accuracy. The central idea of the Cavnar & Trenkle technique is to calculate a "fingerprint" of a document with an unknown category, and compare this with the fingerprints of a number of documents of which the categories are known. The categories of the closest matches are output as the classification. A fingerprint is a list of the most frequent n-grams occurring in a document, ordered by frequency. Fingerprints are compared with a simple out-of-place metric. See the article for more details. Considerable effort went into making this implementation fast and efficient. The language guesser processes over 100 documents/second on a simple PC, which makes it practical for many uses. It was developed for use in our webcrawler and search engine software, in which it it handles millions of documents a day. |
Name Last modified Size
Parent Directory - README 21-Aug-2016 22:07 1.1K libexttextcat-3.4.4.tar.xz 21-Aug-2016 22:07 1.0M libexttextcat.SlackBuild 12-Dec-2015 01:25 2.8K libexttextcat.info 21-Aug-2016 22:07 339 slack-desc 21-Aug-2016 22:07 1.0K
NLUUG - Open Systems. Open Standards
Become a member
and get discounts on conferences and more, see the NLUUG website!