Comparison of 2 search algorithms for Symbiostock

Last revised: 20 Apr 2013

A powerful feature of Symbiostock is the ability to find related images. The first implementation used a straightforward algorithm - assign 1 point for every time two images shared a keyword tag. This works remarkably well, but there are occasional oddities, both false positives and false negatives.

A second algorithm developed from the observation that commonly used keywords may have an influence that outweighs their utility. For example common keywords like 'Seattle' or 'blue' are less useful in matching than uncommon ones like 'skier' or 'map', so a better result would be to weight the matches, giving less relevance to common keyword. I set up a toy matrix with 11 images and 9 keywords.

Images	keyword tags
leeks peppers skiing France skiing Oregon snow scene France map France map Europe map Africa forest France forest Oregon snow scene Oregon	red green ski France Oregon map snow white food

Then I applied the 2 algorithms producing a series of tables:

The right hand column displays the 1 for 1 results. Each time keywords match, the score increases by 1. So, in the first table,' skiing in France' has 2 keywords in common with 'leeks' and 'snow scene Oregon' and 3 in common with 'skiing Oregon' and 'snow scene France'. The table uses green for the main image and any 'exact' matches, yellow for the top 3 choices, and a lighter color for ambiguities.

The left hand columns show the weighted search results. Results are normalized, re-calculated so they total 100. For example, when an image uses keywords 'France', occurring 6 times in the database, and 'map' used only 4 times, the normalized values would be 40 for 'France' and 60 for 'map'. To get a score, just add the normalized values for each keyword. So, in the table for 'map France' we get a top score of 100 for 'snow scene France, 'map France' and 'map Europe'. The 3rd choice is 'map Africa'. Note that the single value algorithm finds the first 2 matches, but then has 4 other matches with no way to distinguish among them.

Note also, that the weighted algorithm can make finer distinctions - eg with the image ' skiing in France' , the weighted approach favors "skiing Oregon' slightly over 'snow scene France' and by a larger factor over 'snow scene Oregon' .

Results: even with this simplified toy model, the results are conclusive -- in every case, the weighted model produces a better set of similar images, usually with little or no ambiguity. With more keywords and more images, the accuracy should improve.

Recommend this page