scipy: pdist indexing
scipy.spatial.distance.pdist(X)
gives the pair-wise distances of X
, $\operatorname{dist}(X[i], X[i])$ not included. The distances are stored in a dense matrix D
.
Question: how to get $\operatorname{dist}(X[i], X[j])$ without expanding this dense matrix by scipy.spatial.distance.squareform(D)
?
A good answer from How does condensed distance matrix work? (pdist):
def square_to_condensed(i, j, n):
assert i != j, "no diagonal elements in condensed matrix"
if i < j:
i, j = j, i
return n*j - j*(j+1)/2 + i - 1 - j
Take $n$ == X.shape[0]
== $4$ as an example. In the code, $i$ and $j$ are swapped if $i < j$, so only consider the bottom triangular region, i.e. $i > j$ always holds in the discussion below.
i\j | 0 | 1 | 2 | 3 |
---|---|---|---|---|
0 | - | D[0] | D[1] | D[2] |
1 | D[0] | - | D[3] | D[4] |
2 | D[1] | D[3] | - | D[5] |
3 | D[2] | D[4] | D[5] | - |
Suppose the indexing function is $\operatorname{f}$: $\operatorname{f}(i, j) = k$ when $\operatorname{dist}(X[i], X[j]) = D[k]$
len(D)
= ${n \choose 2} = (n-1) + (n-2) + \dots + 1$.
- When $j = 0$, $\operatorname{f}(i, j) = i - j - 1$
- When $j = 1$, $\operatorname{f}(i, j) = (n-1) + (i - j - 1)$
- When $j = 2$, $\operatorname{f}(i, j) = (n-1) + (n-2) + (i - j - 1)$
So the pattern here is:
\[\operatorname{f}(i, j) = \text{indices used in previous } (j-1) \text{ columns} + \text{starting index in } j^{\text{th}} \text{ column}\]Further more:
\[\operatorname{f}(i, j) = {n \choose 2} - {n-j \choose 2} + (i - j - 1) = nj - \frac{j(j+1)}{2} + i - j - 1\]
Comments