Blog from Apr 06, 2010

I've added a new OntologyScorer implementation in the edu.sdsc.daks.nif.server.ontology package. It uses a Lucene Analyzer to score text, taking advantage of the shingle filter and the snowball filter. This means that we can more accurately score text coming through the tokenizer. You can see in the examples below that it appears to be working correctly although we would ideally get more test cases. I think it should integrate nicely into SOLR since it's now based on Lucene - but I've not tried it.

Here are some sample scores with the new Scorer implementation:
neuroscience.txt scores 1.1073122 in 191ms
boxing.txt scores 0.5 in 219ms
purkinje.txt scores 1.1548238 in 171ms
4mb.txt(bible) scores 0.5 in 3601ms
1mb.txt(db2 guide) scores 0.5 in 968ms
600k.txt(outline of science) scores 0.5 in 510ms
boost.txt scores 0.5 in 15ms

And some timing information (n==100):
For the content of the wikipedia purkinje web page we average 13.48ms with a standard deviation of 17.6ms
For a 600k file we average 490.67ms with a standard deviation of 54.6ms