dHsic¶
-
class
hyppo.d_variate.
dHsic
(compute_kernel='gaussian', bias=True, **kwargs)¶ \(d\)-variate Hilbert Schmidt Independence Criterion (dHsic) test statistic and p-value.
dHsic is a non-parametric kernel-based independence test between an arbitrary number of variables. The dHsic statistic is 0 if the variables are jointly independent and positive if the variables are dependent 1. The default choice is the Gaussian kernel, which uses the median distance as the bandwidth, which is a characteristic kernel that guarantees that dHsic is a consistent test 1 2 3.
- Parameters
compute_kernel (
str
,callable
, orNone
, default:"gaussian"
) -- A function that computes the kernel similarity among the samples within each data matrix. Valid strings forcompute_kernel
are, as defined insklearn.metrics.pairwise.pairwise_kernels
,[
"additive_chi2"
,"chi2"
,"linear"
,"poly"
,"polynomial"
,"rbf"
,"laplacian"
,"sigmoid"
,"cosine"
]Note
"rbf"
and"gaussian"
are the same metric. Set toNone
or"precomputed"
ifargs
are already similarity matrices. To call a custom function, either create the similarity matrix before-hand or create a function of the formmetric(x, **kwargs)
wherex
is the data matrix for which pairwise kernel similarity matrices are calculated and kwargs are extra arguments to send to your custom function.bias (
bool
, default:False
) -- Whether or not to use the biased or unbiased test statistics.**kwargs -- Arbitrary keyword arguments for
multi_compute_kern
.
Notes
The statistic can be derived as follows 1:
dHsic builds on the two-variable Hilbert Schmidt Independence Criterion (Hsic), implemented in
hyppo.independence.Hsic
, but allows for an arbitrary number of variables. For a given kernel, the joint distribution and the product of the marginals is mapped to the reproducing kernel Hilbert space and the squared distance between the embeddings is calculated. The dHsic statistic can be calculated by,\[\mathrm{dHsic} (\mathbb{P}^{(X^1, ..., X^d)}) = \Big\Vert \Pi(\mathbb{P}^{X^1} \otimes \cdot\cdot\cdot \otimes \mathbb{P}^{X^d}) - \Pi(\mathbb{P}^ {(X^1, ..., X^d)}) \Big\Vert^2_{\mathscr{H}}\]Similar to Hsic, dHsic uses a gaussian median kernel by default, and the p-value is calculated using a permutation test using
hyppo.tools.multi_perm_test
.References
- 1(1,2,3)
Nikolas Pfister, Peter Buhlmann, Bernhard Scholkopf, and Jonas Peters. Kernel-based Tests for Joint Independence. arXiv:1603.00285 [math, stat], November 2016. arXiv:1603.00285.
- 2
Arthur Gretton, Kenji Fukumizu, Choon Teo, Le Song, Bernhard Schölkopf, and Alex Smola. A Kernel Statistical Test of Independence. Advances in Neural Information Processing Systems, 2007.
- 3
Arthur Gretton and László Györfi. Consistent Nonparametric Tests of Independence. Journal of Machine Learning Research, 11(46):1391–1423, 2010.
Methods Summary
|
Helper function that calculates the dHsic test statistic. |
|
Calculates the dHsic test statistic and p-value. |
-
dHsic.
statistic
(*args)¶ Helper function that calculates the dHsic test statistic.
- Parameters
*args (
ndarray
offloat
) -- Variable length input data matrices. All inputs must have the same number of samples. That is, the shapes must be(n, p)
,(n, q)
, etc., where n is the number of samples and p and q are the number of dimensions.- Returns
stat (
float
) -- The computed dHsic statistic.
-
dHsic.
test
(*args, reps=1000, workers=1)¶ Calculates the dHsic test statistic and p-value.
- Parameters
*args (
ndarray
offloat
) -- Variable length input data matrices. All inputs must have the same number of samples. That is, the shapes must be(n, p)
,(n, q)
, etc., where n is the number of samples and p and q are the number of dimensions.reps (
int
, default:1000
) -- The number of replications used to estimate the null distribution when using the permutation test used to calculate the p-value.workers (
int
, default:1
) -- The number of cores to parallelize the p-value computation over. Supply-1
to use all cores available to the Process.
- Returns