KD-Trees and missing values (vector comparison)
- by labratmatt
I have a system that stores vectors and allows a user to find the n most similar vectors to the user's query vector. That is, a user submits a vector (I call it a query vector) and my system spits out "here are the n most similar vectors." I generate the similar vectors using a KD-Tree and everything works well, but I want to do more. I want to present a list of the n most similar vectors even if the user doesn't submit a complete vector (a vector with missing values). That is, if a user submits a vector with three dimensions, I still want to find the n nearest vectors (stored vectors are of 11 dimensions) I have stored.
I have a couple of obvious solutions, but I'm not sure either one seem very good:
Create multiple KD-Trees each built using the most popular subset of dimensions a user will search for. That is, if a user submits a query vector of thee dimensions, x, y, z, I match that query to my already built KD-Tree which only contains vectors of three dimensions, x, y, z.
Ignore KD-Trees when a user submits a query vector with missing values and compare the query vector to the vectors (stored in a table in a DB) one by one using something like a dot product.
This has to be a common problem, any suggestions? Thanks for the help.