Skip to content

Release datasets and data process for benchmark of DNA representation learning.

Notifications You must be signed in to change notification settings

Panzy-18/dna_benchmark

Repository files navigation

Overview

This repo contains 6 DNA task and scripts for quick experiment. See full details in manuscript

Note: We strongly recommend that you browse the overall structure of our code at first. If you have any question, feel free to contact us.

Model Construction

In this frameowork, we support transformer and convolution based model. You can change model hyper-parameters by simply modifying the model config. Examples are listed in ./config

Benchmark

Name Objective Input_length Output Main Metric
promoter Promoter 500 Probability, 1 AUROC
methyl96 Methyl Probability 1 500(flank) Probability, 96 SpearmanR
track7878 TF/DNase/Histone 200 800(flank) Probability, 7878 AUPR
expression218 Gene Expression 200 40800(flank) log RPKM, 218 SpearmanR
snp49 Casual SNP 200 800(flank) Probability AUPR
mpra10 SNP Effect 600 log variant expression SpearmanR

More detailed description is in ./data/$dataset/metadata.json. Preprocessed pipeline in ./preprocess.

Usage

Environment

pip install torch
pip install -r requirements.txt

Pretrained Model and Dataset

Download pre-trained models from the following links.

All the datasets are processed from open resources. Download and preprocessing scripts are listed in ./preprocess. Run the scripts to generate data in your local environments. You can also download data in this link.

Promoter dataset is an example. This required hg38.fa in ./data/genome

Run

Run experiment simply by default setting:

python run_task.py --dataset-dir [directory_in_data_root] \
	--save-dir [directory_to_save_experiment] 

For example, you can train your promoter model from scratch by:

python run_task.py --dataset-dir promoter --save-dir experiment/promoter_default

Run python run_task.py -h to check all the arguments. Check more examples ./scripts folder

Customize Dataset

We provide a easy-to-use customized data pipeline. If you want to start experiment on your own dataset, organize your file:

- data_root
    - customized_data
	- metadata.json
	- train.json
	- ...

In metadata, you must specify fields like: (check more examples in ./data folder)

{
    "dataset_name": ...,
    "dataset_args": {
        "dataset_class": "DNATaskDataset",
        "ref_file": "genome/hg19.fa", # if do not need, use 'null'
        "train_file": "customized_data/train.json", # pass train_file to make trainer train model
	"valid_file": ..., # pass valid_file to make trainer evaluate model after each epoch for training
	...
    },
    "model_args": {
        "task": "ModelForSequenceTask",
        "final_dim": ...,
        "loss_fn": {
            "name": ...,
        }
    },
    "metrics": ..., # the first metric will be the main score to save best model
}

We support JSON or HDF5 format data file. In JSON, a sample is structured like:

{"sequence": "ATGGCTC", "label": [1, 0]} 
or 
{"index": ["chr1", 0, 7, " "], "label": [1, 0]}

In HDF5(for huge dataset storation), a sample is structured by two fields: index and label.

index: np.array([1, 0, 7, 1]) # (chr_num, start_pos, end_pos(exclusive), is_forward)
label: np.array([1, 0])

Visualize

We support two modes of visualization. You can get roll-out attention score without any modification to model, however, this method sometimes does not perform well. (See ./visual_result/without_tscam)

We used TS-CAM to enhance visualization in transformer-based model. To utilize its advantages, you should pass --tscam when training model on a certain dataset. Then, model will provide informative class-specific visualization result. See example code in visualize.py

python visualize.py --load-dir experiment/promoter --save-dir visual_result/promoter

Prediction

Use model.predict method to infer for short sequence task (<1024bp) on your own data. For example:

# example for promoter detection.
# run by 'python $file --load-dir experiment/promoter'
from tools import get_config
from models import get_model

config = get_config()
model = get_model(config)
sequences = ['ATTCATCCAACTCTCCGTGAGCTCCCCTGGGTAGGAGTACAGTGGCAGCCAGTGTCCCCAGAAAACTGGCGCCTCCCCCCTCGCCGTGCGGGGCTAATTAACTCTTAGCCGGCGGGACCCTCCTCCTCCTCGGAGGTTGGCCAGGAGCAGCGCGGCATCCCAGGCGTTCCTGTCTGATGTCATAGGCTGCCGGCGATTGCGGAGAATCGCCACCACGCCTTTATGAAGGTCCCAACTTTGCCATCTGATACCCTTTACTACTGACAGGCGCTCAGCCAATCAGGAGCGGCGAGCGGGGTCTGGGGACCCGGAGCCGCCGAAGCCGTCTCGGGAACCGGCTCTTAACTCTTTGCGGCGGGCCCCGCAGCCGCCGAGGCACAGAGGGCGGGAGCAGGGCCAGGGGTCGGGAATCTGGGAGAGGGGCGCGAGCTAAAGAGCGGATGCCCGGAGGAAAGAAGGAAGGGCTGCGACGCCGCGGGGCTTGCAGGTGGTTCGCGGGG',
            'ATGAAATACACATAAAAAACACACACATTAAATATTAATATATGCTTATTATTGTATTATGAATGAGGAAATAAAATATAACTTGGAATTTTTTTAAAACTTAAAAAAATACAATGGACTGAGCACTGAAATCAGAATATGCAGCTTATTTAGAACAAAATTCTACTTTTTCCCCTAAACTGTCCCTTAACATTGTCATCTCTCCTGCTAATCCTGCATTACCCTGGATCCTTCCTTTTTGTCTCTGCCTCCACTCACTGCTGCCTCTGCCATAAGCCTTCATACTCCAGCTGCTACACACTGCTGCTTCTATCCCTGAGGATTCCACGAGCATCCTTATTCTTCTGTCACTGATATGGTTCCTATTGGCATATCAAAAGTTATAGCCATATGAAGAAAAATCTAGGGATGCAGCAGCAGCAGCAGCAGTAGCAGTAGCAGCAACAGTCTATCAAGATGTTTTAATCTGGAATAAATTTCAGAATAGATCAATTCAGCAT'
            ] # 2 positive sample
model_output = model.predict(sequences)
print(model_output.logits_or_values)

For expression task, see example in predict_expression.py

About

Release datasets and data process for benchmark of DNA representation learning.

Resources

Stars

Watchers

Forks