unfinished documents:
pfe sources pfe manpages pfe docbook |
"bow library" http://www-2.cs.cmu.edu/~mccallum/bow/ - LGPL, plain C from it's frontpage: The library provides facilities for: * Recursively descending directories, finding text files. * Finding `document' boundaries when there are multiple documents per file. * Tokenizing a text file, according to several different methods. * Including N-grams among the tokens. * Mapping strings to integers and back again, very efficiently. * Building a sparse matrix of document/token counts. * Pruning vocabulary by word counts or by information gain. * Building and manipulating word vectors. * Setting word vector weights according to Naive Bayes, TFIDF, and several other methods. * Smoothing word probabilities according to Laplace (Dirichlet uniform), M-estimates, Witten-Bell, and Good-Turning. * Scoring queries for retrieval or classification. * Writing all data structures to disk in a compact format. * Reading the document/token matrix from disk in an efficient, sparse fashion. * Performing test/train splits, and automatic classification tests. * Operating in server mode, receiving and answering queries over a socket. The library does not: * Have English parsing or part-of-speech tagging facilities. * Do smoothing across N-gram models. * Claim to be finished. * Have good documentation. * Claim to be bug-free. Isearch http://www.etymon.com/Isearch/ - BSD, C/C++ Amberfish http://www.etymon.com/Amberfish/ - GPL, C/C++ |