-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Strange behaviour of FAISS FlatL2 Index with 1-D data #1709
Comments
Faiss is definitely not designed for low-dimensional (dims < 20, say) data, especially not scalar 1D data. The overhead from the way that Faiss performs its computations would be extreme on scalar data (i.e., it could probably compute neighbors for 32-dim vectors in the same time that it would take to compute scalar distances). The distances returned are also squared distances, could that be part of your issue? |
But actually, what do you mean by "not designed for low-dimensional (dims < 20, say) data, especially not scalar 1D data" in, say, mathematical terms? In the meantime, I noticed that the distance function is defined differently (https://gist.github.com/mdouze/efc94c57e2302469287b9d1a2501d277). I would like to understand the reason why it behaves in such a way. The distance metric should not be as problematic as it does not change between Edit: The above statement about distance is actually confirmed by using voronoi where with the results are similar. |
"Not designed for" means "not optimized for", but the mathematics should be the same as the other exact distance implementations (it tries to compute the same thing), but the order of operations in floating point may differ which could cause differences (fp is not associative, etc), and squared distances are reported for L2 instead of sqrt(sum_i (x_i - y_i)^2). What is your naive implementation? IndexFlat is basically a naive though optimized implementation though with a different order of floating point operations (it uses optimized GEMM kernels to perform the dirty work). I would expect the two to be fairly similar. |
Summary
I would like to verify the performance of Faiss for Predictive Mean Matching / Imputation (see https://stefvanbuuren.name/fimd/sec-pmm.html).
Basicaly, I am opperating on the 1-D vectors that are used for indexing and query (i.e. imputation of missing data). However, I noticed that a slight change in the function that generates missing data leads to very different results if I use faiss with FlatL2 index without voronoi partitioning (so brute force). There are two functions that generate probability of non-missing data from logistic regression defined in the codes at the end of this issue (and in notebook on colab) and compied below (column
rho1
andrho2
). The difference is that in the one function I have- hh.income_scaled
and in the second+ hh.income_scaled
.Then, I sample data from Bernoulli distribution based on
rho1
andrho2
, estimate the target quantity (i.e. mean expenditure) without and with imputation. Then I calculate three performance measures:bias
is the diffeerence between estimated and true value,se
is standard deviation of my estimates andmse
= bias^2 + var of my estimates based on complete cases and 3 different algorithms. Please se these results belowDo you know what is the reason for this behaviour for Faiss Flat L2? One time estimator is biased but for the second function is the significantly better. Especially that K-D tree does exact search so how it may be better? Is it connected with using 1-D vectors? Other algorithms such as K-D tree, Annoy, Flann, N2 works well and as expected.
Do you have any recommendation regarding 1-D data?
Platform
Google Colab
Running on:
Interface:
Reproduction instructions
Codes to reproduce this issue are in the JupyterNotebook that should be run with GPU acceleration (e.g. on colab).
https://colab.research.google.com/drive/1lhANDL68y4O4M50Zwg-5r8dp_Dq5MQ4A?usp=sharing
To run this example you should load the data that is attached to this issue.
households.xlsx
The full code is below
The text was updated successfully, but these errors were encountered: