Skip to content

BioKlustering: a web app for semi-supervised learning of maximally unbalanced genomic data

License

Notifications You must be signed in to change notification settings

solislemuslab/bioklustering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BioKlustering overview

We introduce BioKlustering, a user-friendly open-source and publicly available web app for unsupervised and semi-supervised learning specialized in cases when sequence alignment and/or experimental phenotyping of all classes are not possible.

Among its main advantages, BioKlustering

  1. allows for maximally unbalanced settings of partially observed labels including cases when only one class is observed, which is currently prohibited in most semi-supervised methods,
  2. takes unaligned sequences as input and thus, allows learning for widely diverse sequences (impossible to align) such as virus and bacteria,
  3. is easy to use for anyone with little or no programming expertise, and
  4. works well with small sample sizes.

Usage

BioKlustering is browser-based (preferably Google Chrome), and thus, no installation is needed. Users simply need to click on the following link: https://bioklustering.wid.wisc.edu/.

More details are available in the documentation: DOCS.md.

Source Code

BioKlustering is an open source project, and the source code is available at in this repository with the following structure:

  • BioKlustering-Website contains all the code for the website and machine-learning models (see readme.md file inside this folder)
  • manuscript contains the reproducible analysis and sample dataset used in the published manuscript (in review)

Steps to run this website locally

Users with strong programming skills might like to modify the existing code and run a version of the website locally.

  1. Clone this repository by typing the following line in the terminal
git clone https://github.com/solislemuslab/bioklustering
  1. Get inside the bioklustering/BioKlustering-Website folder, create and activate a python virtual environment:
cd bioklustering/BioKlustering-Website
python3 -m venv virtual-env
source virtual-env/bin/activate

Note that Mac users might need the whole path to python3: /usr/local/bin/python3.

  1. Install the necessary packages by typing the following line in the terminal
pip3 install -r requirements.txt

Note that these requirements assume you are using Python 3.8.13. People can manage different python versions with pyenv. A list of packages can be found in the requirements.txt file and is listed below:

numpy~=1.22
pandas~=2.0.2
bio~=1.5.9
scikit-learn~=1.1.1
plotly~=5.4.0
Django~=3.1.2
django-plotly-dash~=1.4.2
channels~=2.4.0
channels-redis~=3.1.0
django-crispy-forms~=1.9.2
django-redis~=4.12.1
daphne~=2.5.0
redis~=3.5.3
psutil~=5.9.2
kaleido~=0.2.0
  1. You might also need to install plotly-orca which is for writing and saving the static plotly images locally. To install with conda, you can use the following command (or see this link for other alternatives).
conda install -c plotly plotly-orca==1.2.1 psutil requests

To install conda, you can follow instructions in this link. You might need to add a path to conda if it is not in your PATH.

  1. Run the website with
python3 manage.py makemigrations
python3 manage.py migrate
python3 manage.py runserver

Notes:

  • Even when the web app supports all browsers, we recommend the use Google Chrome to render the web app because different browsers might result in different interface and functionalities.
  • Sometimes when running python3 manage.py makemigrations, you might get the following warning message:
The dash_core_components package is deprecated. Please replace
`import dash_core_components as dcc` with `from dash import dcc`

The dash_html_components package is deprecated. Please replace
`import dash_html_components as html` with `from dash import html`

You are trying to add the field 'create_date' with 'auto_now_add=True' to fileinfo without a default; the database needs something to populate existing rows.

 1) Provide a one-off default now (will be set on all existing rows)
 2) Quit, and let me add a default in models.py
Select an option:

If this happens, select option 1 and then press 'Enter' after the message:

Please enter the default value now, as valid Python
You can accept the default 'timezone.now' by pressing 'Enter' or you can provide another value.
The datetime and django.utils.timezone modules are available, so you can do e.g. timezone.now
Type 'exit' to exit this prompt
[default: timezone.now] >>>

Steps to run the unit tests of this website locally

  1. Make sure you are in a virtual environment.
source virtual-env/bin/activate
  1. Install selenium
pip3 install selenium
  1. Download webdriver and move it into the git root directory 'bioklustering/'
  2. Run the following command
python3 manage.py test

Contributions

Users interested in expanding functionalities in BioKlustering are welcome to do so. See details on how to contribute in CONTRIBUTING.md

License

BioKlustering is licensed under the MIT licence. © SolisLemus lab projects (2020)

Citation

If you use the BioKlustering website in your work, we ask that you cite the following paper:

@ARTICLE{Ozminkowski2022-bw,
  title         = "{BioKlustering}: a web app for semi-supervised learning of
                   maximally imbalanced genomic data",
  author        = "Ozminkowski, Samuel and Wu, Yuke and Yang, Liule and Xu,
                   Zhiwen and Selberg, Luke and Huang, Chunrong and
                   Solis-Lemus, Claudia",
  month         =  sep,
  year          =  2022,
  archivePrefix = "arXiv",
  primaryClass  = "q-bio.GN",
  eprint        = "2209.11730"
}

Feedback, issues and questions