collection of utility functions for correlation analysis
Check the examples folder for notebooks.
Compute correlation matrix and its p-values
- pearson -- Pearson/Sample correlation (interval- and ratio-scale data)
- kendall -- Kendall's tau rank correlation (ordinal data)
- spearman -- Spearman rho rank correlation (ordinal data)
- mcc -- Matthews correlation coefficient between binary variables
EDA, Dig deeper into results
- flatten -- A table (pandas) with one row for each correlation pairs with the variable indicies, corr., p-value. For example, try to find "good" cutoffs with
corr_vs_pval
and then look up the variable indicies withflatten
afterwards. - slice_yx -- slice a correlation and p-value matrix of a (y,X) dataset into a (y,x_i) vector and (x_j, x_k) matrices
- corr_vs_pval -- Histogram to find p-value cutoffs (alpha) for a) highly correlated pairs, b) unrelated pairs, c) the mixed results.
- bracket_pval -- Histogram with more fine-grained p-value brackets.
- corrgram -- Correlogram, heatmap of correlations with p-values in brackets
Utility functions
- confusion -- Confusion matrix. Required for Matthews correlation (mcc) and is a bitter faster than sklearn's
Parameter Stability
- bootcorr -- Estimate multiple correlation matrices based on bootstrapped samples. From there you can assess how stable correlation estimates are (how sensitive against in-sample variation). For example, stable estimates are good candidates for modeling, and unstable correlation pairs are good candidates for P-hacking and non-reproducibility.
Variable Selection, Search Functions
- mincorr -- From all estimated correlation pairs, pick a given
n=3,5,..
of variables with low and insignificant correlations among each other. (See binsel package for an application.) find_best
-- Find the N "best", i.e. high and most significant, correlationsfind_worst
-- Find the N "worst", i.e. insignificant/random and low, correlations- find_unrelated -- Return variable indicies of unrelated pairs (in terms of insignificant p-value)
The korr
git repo is available as PyPi package
pip install korr
python3.7 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt --no-cache-dir
pip install -r requirements-dev.txt --no-cache-dir
pip install -r requirements-demo.txt --no-cache-dir
(If your git repo is stored in a folder with whitespaces, then don't use the subfolder .venv
. Use an absolute path without whitespaces.)
- Check syntax:
flake8 --ignore=F401
- Run Unit Tests:
pytest
- Remove
.pyc
files:find . -type f -name "*.pyc" | xargs rm
- Remove
__pycache__
folders:find . -type d -name "__pycache__" | xargs rm -rf
Publish
pandoc README.md --from markdown --to rst -s -o README.rst
python setup.py sdist
twine upload -r pypi dist/*
Please open an issue for support.
Please contribute using Github Flow. Create a branch, add commits, and open a pull request.