Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



8 Commits

Repository files navigation


Demultiplexing Illumina sequencing reads

Getting Started


This script was tested for python 3.6 only, and the following two modules are required, (recommend conda to manage the modules)

Recommend create a ENV for demx using conda

  • xopen (=0.9.0)
  • python-levenshtein (=0.12.0)
$ conda create -n demx python=3.6 xopen=0.9.0 python-levenshtein=0.12.0


No need to install the script. Just run the script directly.

  1. Clone this repo to your local machine
$ git clone
  1. It works, if you can see the following message
$ cd hiseq_demx
$ python -h

usage: [-h] -1 FQ1 [-2 FQ2] -o OUTDIR -s INDEX_CSV [--demo]
               [-m MISMATCH] [-x {1,2}] [-l BARCODE_N_LEFT]
               [-r BARCODE_N_RIGHT] [-p THREADS] [-j PARALLEL_JOBS] [-w]


optional arguments:
  -h, --help            show this help message and exit
  -1 FQ1, --fq1 FQ1     read1 in fastq format, gzipped
  -2 FQ2, --fq2 FQ2     read2 in fastq format, gzipped, (optional)
  -o OUTDIR, --outdir OUTDIR
                        directory to save the reulsts
  -s INDEX_CSV, --index-csv INDEX_CSV
                        index list in csv format,
  --demo                run demo (1M reads) for demostration, default: off
  -m MISMATCH, --mismatch MISMATCH
                        mismatches allowed to search index, default: [0]
  -x {1,2}, --barcode-in-read {1,2}
                        barcode in the 5' end of, 1:read1 or 2:read2, default:
  -l BARCODE_N_LEFT, --barcode-n-left BARCODE_N_LEFT
                        bases locate on the left of barcode
  -r BARCODE_N_RIGHT, --barcode-n-right BARCODE_N_RIGHT
                        bases locate on the right of barcode
  -p THREADS, --threads THREADS
                        number of threads, default: [1]
  -j PARALLEL_JOBS, --parallel-jobs PARALLEL_JOBS
                        number of josb run in parallel, default: [1]
  -w, --overwrite       Overwrite exists files, default: off

Running the tests

It supports fastq file in the following format:

  • P7, (P5, optional) index saved in name line (1st line)

@ST-E00318:957:H7VYVCCX2:6:1101:27143:1538 1:N:0:CAGATCAT CGATCTCG

  • inline-barcode located at the beginning of read1 (or read2), (5' end)


P7 index

For Paired-end mode, ONLY P7-index from read1 were checked.

$ cd hiseq_demx/test
$ python ../ -1 idx_1.fq.gz -2 idx_2.fq.gz -o results/p7/pe -s info_idx.csv


RT ----------------------------Demx Report: BEGIN----------------------------
RT num filename                                                count  percent
RT   1 sample1                                                   536   53.60%
RT   2 sample2                                                   149   14.90%
RT   3 undemx                                                    315   31.50%
RT     sum                                                      1000  100.00%
RT -----------------------------Demx Report: END-----------------------------

-1 : path to read1 file
-2 : path to read2 file
-o : path to directory, saving the results -s : path to sample_info file (CSV) -x 1 : barcode located in read1
-l 2 : 2 bp on the left of barcode
-r 3 : 3 bp on the right of barcode
-m 0 : Number of mismatches allowed, for searching barcode

Inline Barcode

In this example, (iCLIP reads), barcode were located at the 5' end of read1, in the following format: 5'-NNN{4bp}NN---, so the following arguments are requried:

$ python ../ -1 iclip_1.fq.gz -2 iclip_2.fq.gz -o results/bc/pe -s info_iclip.csv -x 1 -l 3 -r 2 -m 0 


RT ----------------------------Demx Report: BEGIN----------------------------
RT num filename                                                count  percent
RT   1 sample1                                                   200   40.00%
RT   2 sample2                                                   100   20.00%
RT   3 undemx                                                    200   40.00%
RT     sum                                                       500  100.00%
RT -----------------------------Demx Report: END-----------------------------

-1 : path to read1 file
-2 : path to read2 file
-o : path to directory, saving the results -s : path to sample_info file (CSV) -x 1 : barcode located in read1
-l 3 : 3 bp on the left of barcode
-r 2 : 2 bp on the right of barcode
-m 0 : Number of mismatches allowed, for searching barcode

Barcode; !! Using for SE reads,

see fastx_toolkit documentation.

For this example, barcode are NNN{4nt}NN in this format, and only support reading barcode from the beginning of read, (no matter partial match).

So, We can trim first 3 bp from read, and then run the splitter:

$ zcat iclip_1.fq.gz | fastx_trimmer -f 4 | --bcfile bc.txt --bol --mismatches 0 --prefix aaaaaa. --suffix .fq

Barcode Count   Location
sample1 100     aaaaaa.sample1.fq
sample2 100     aaaaaa.sample2.fq
unmatched       100     aaaaaa.unmatched.fq
total   300

Both P7 index and Inline Barcode

In this example, inline-barcode (eCLIP-like reads) were located at the 5' end of read2, in the following format: 5'{6bp}---, so the following arguments are requried:

$ python ../ -1 idx_eclip_1.fq.gz -2 idx_eclip_2.fq.gz -o results/p7_bc/pe -s info_idx_eclip.csv -x 2 -l 0 -r 1 -m 0 


RT ----------------------------Demx Report: BEGIN----------------------------
RT num filename                                                count  percent
RT   1 sample1                                                   100   20.00%
RT   2 sample2                                                   100   20.00%
RT   3 sample3                                                   100   20.00%
RT   4 sample4                                                   100   20.00%
RT   5 undemx                                                    100   20.00%
RT     sum                                                       500  100.00%
RT -----------------------------Demx Report: END-----------------------------

-1 : path to read1 file
-2 : path to read2 file
-o : path to directory, saving the results -s : path to sample_info file (CSV) -x 2 : barcode located in read2
-l 0 : 0 bp on the left of barcode
-r 1 : 1 bp on the right of barcode
-m 0 : Number of mismatches allowed, for searching barcode

For your data

  • Prepare a sample_info.csv file, including the following columns

sample_name, P7_index, P5_index, barcode



  • Ming Wang : wangmcas{AT}


This project is licensed under the MIT License - see the file for details


  • readfq - A python reader for fastq file, by Heng Li


demultiplexing Illumina reads







No releases published


No packages published