-
Notifications
You must be signed in to change notification settings - Fork 0
EndHic is a fast and easy-to-use Hi-C scaffolding tool, using the Hi-C links from contig end regions instead of whole contig regions to assemble large contigs into chromosomal-level scaffolds.
kingforest93/EndHiC_2
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
EndHic is a fast and easy-to-use Hi-C scaffolding tool, using the Hi-C links from contig end regions instead of whole contig regions to assemble large contigs into chromosomal-level scaffolds. EndHiC takes the HiC-pro"s bin matrix results as input data. After running HiC-pro, the recommended EndHiC usage for most users is to run endhic.pl or endhic_iterate.pl. When your contig assembly is quite good, then endhic.pl [one round of EndHiC] is able to finish the job; When your contig assembly is relatively fragmental, then endhic_iterate.pl [multiple rounds of EndHiC] should be used. How many rounds is needed depend on the contig assembly level, and more fragmental contigs need higher rounds of EndHiC. Two example data are included in EndHiC package: git clone [email protected]:fanagislab/EndHiC.git cd EndHiC/z.testing_data/Arabidopsis_thalina sh work.sh cd EndHiC/z.testing_data/Cichorium_intybus sh work.sh Note: The Arabidopsis_thalina testing data shows the usage of endhic.pl on long-continuous contig assembly, while the Cichorium_intybus testing data shows the usage of endhic_iterate.pl with relatively shorter contig assembly. An example data for detecting the assembly errors in the contigs is also included in EndHiC package: cd EndHiC/z.testing_data/Detect_errors_in_contigs sh work.sh Usage instructions: 1. Input files: (example, human hifiasm + hic-pro) (1) hifiasm.fa.len Includes two column: contig_id contig_length (2) humanHiC_100000_abs.bed Generated by Hic-pro, 100-kb bins, bed format file (3) humanHiC_100000.matrix Generated by Hic-pro, 100-kb bins, raw matrix file (4) humanHiC_100000_iced.matrix Generated by Hic-pro, 100-kb bins, normalized matrix file 2. EndHiC programs: (the ranks shows invoking relationship) endhic_iterate.pl endhic.pl endhic_ctgEnd_pipeline.pl ctgContact_from_ctgEndContacts.pl turningpoint_by_lineartransform.pl scaffold_by_trueCtgContact.pl cluster_and_classify_GFA.pl order_and_orient_GFA.pl (1) Basic usage: run endhic with specified contig end size and specified contact cutoff #calculate the HiC contact values among contigs, using Hi-C links data from fixed-size contig ends ctgContact_from_ctgEndContacts.pl --binsize 100000 --binnum 10 hifiasm.fa.len humanHiC_100000_abs.bed humanHiC_100000.matrix > humanHiC_100000.matrix.100000.10.CtgContact ##adjust the contig contacts, and perform linear transformation, to find the turning point turningpoint_by_lineartransform.pl humanHiC_100000.matrix.100000.10.CtgContact > humanHiC_100000.matrix.100000.10.CtgContact.adjustTransform 2> humanHiC_100000.matrix.100000.10.CtgContact.turningPoint #build contig graph by assigning links to contigs whose contact is larger than a given cutoff, and also satisfy reciprocal best requirement scaffold_by_trueCtgContact.pl --contacts 147.07 --reciprocalmax hifiasm.fa.len humanHiC_100000_iced.matrix.100000.10.CtgContact > humanHiC_100000_iced.matrix.100000.10.CtgContact.overCutoff.1.0.reciprocalMax.gfa #Identify linear and circular topology in the contig graph cluster_and_classify_GFA.pl humanHiC_100000_iced.matrix.100000.10.CtgContact.overCutoff.1.0.reciprocalMax.gfa > humanHiC_100000_iced.matrix.100000.10.CtgContact.overCutoff.1.0.reciprocalMax.gfa.topology #Output cluster results with order and orientation information order_and_orient_GFA.pl --size 2000000 humanHiC_100000_iced.matrix.100000.10.CtgContact.overCutoff.1.0.reciprocalMax.gfa humanHiC_100000_iced.matrix.100000.10.CtgContact.overCutoff.1.0.reciprocalMax.gfa.topology > humanHiC_100000_iced.matrix.100000.10.CtgContact.overCutoff.1.0.reciprocalMax.gfa.cluster (2) Basic pipeline: #run endhic with specified contig end size, in various automatically determined contact cutoff, using Hic-pro raw matrix data endhic_ctgEnd_pipeline.pl --binsize 100000 --binnum 10 hifiasm.fa.len humanHiC_100000_abs.bed humanHiC_100000.matrix #run endhic with specified contig end size, in various automatically determined contact cutoff, using Hic-pro normalized matrix data endhic_ctgEnd_pipeline.pl --binsize 100000 --binnum 10 hifiasm.fa.len humanHiC_100000_abs.bed humanHiC_100000_iced.matrix (3) Standard pipeline: [run only one round of endhic.pl, when contig assembly is quite good] ##run endhic with various contig end size, in various automatically determined contact cutoff, using Hic-pro raw and normalized matrix data. At most cases, this can generate chromosome-level scaffolds endhic.pl hifiasm.fa.len humanHiC_100000_abs.bed humanHiC_100000.matrix humanHiC_100000_iced.matrix (4) Iterative pipeline: [run multiple rounds of endhic.pl, when contig assembly is not so good] ##If a single run of endhic.pl can"t finish the scaffolding task, i.e. the number of resulting clusters is more than that of chromosomes, iterative running of endhic.pl is recommended. In each loop, the contig end size is increasing. In this way, the problems caused by the repeat sequences on the contig ends will be overcomed. Using default parameters of endhic_iterate.pl endhic_iterate.pl --rounds 3 --binnumstep 5 hifiasm.fa.len humanHiC_100000_abs.bed humanHiC_100000.matrix humanHiC_100000_iced.matrix For more shorter contigs, try to run more rounds with smaller increasing of contig end sizes endhic_iterate.pl --rounds 15 --binnumstep 1 hifiasm.fa.len humanHiC_100000_abs.bed humanHiC_100000.matrix humanHiC_100000_iced.matrix 3. EndHiC output sub-directory and files In 01.contig_end_contact_results/ (1) humanHiC_100000.matrix.*.CtgContact Contig contact file, with 7 columns (#CtgId1 CtgId2 EndContact Ctg1Pos Ctg2Pos UsedBinNum1 UsedBinNum2) (2) humanHiC_100000.matrix.*.CtgContact.adjustTransform Contig contact, adjusted, and linear transformed, to find the turning point (3) humanHiC_100000.matrix.*.CtgContact.turningPoint Automatically inferred turning point, which will be used as the basic value for the contig contact cutoff In 02.GFA_contig_graph_results/ (1) humanHiC_100000.matrix.*.CtgContact.overCutoff.1.0.gfa Contig graph in GFA format, contact value over cutoff, can be viewed in Bandage software (2) humanHiC_100000.matrix.*.CtgContact.overCutoff.1.0.reciprocalMax.gfa Contig graph in GFA format, contact value over cutoff, and satisfy reciprocal best, can be viewed in Bandage software In 03.cluster_order_orient_results/ (1) humanHiC_100000.matrix.*.CtgContact.overCutoff.1.0.reciprocalMax.gfa.topology Topology of the contig graph, identify linear or circular groups (2) humanHiC_100000.matrix.*.CtgContact.overCutoff.1.0.reciprocalMax.gfa.cluster Scaffold results, including cluster, order, and orientation information In 04.summary_and_merging_results/ (1) z.EndHiC.A.results.summary Summary and analysis results for the first loop, merging all the raw and iced results (2) z.EndHiC.A.results.summary.cluster [Final EndHiC Result] Final scaffold results with high robustness, merging all the raw and iced results This is recommeded to be the final endhic result. (3) z.EndHiC.A.results.summary.cluster.gfa GFA format of the final scaffold results, which can be graphically viewed in Bandage software Instruction of *.summary file: Part 1: Number of clusters under each condition Part 2: Statistics of all Cluster units Part 3: Statistics of merged Cluster units Part 4: Statistics of stable (high frequency) cluster units Part 5: Statistics of stable cluster units (redundant short contigs removed) Part 6: Included contigs, total number, total length Format of *.cluster file: In total 5 columns column 1: Cluster id, sorted by cluster length column 2: Number of contigs included in this cluster column 3: Cluster length, total length of contigs in this cluster column 4: robustness, i.e. appearance times in the results from various contig end sizes and contact cutoffs column 5: Included contigs with order and orientation, separated by ";", and "+-" means strands e.g. ptg000046l-;ptg000079l+;ptg000058l-;ptg000047l+ (equivalent to ptg000047l-;ptg000058l+;ptg000079l-;ptg000046l+) 4. Pre-EndHiC programs (1) draw HiC heatmap for contigs, helpful to find assembly errors in contigs, each point stands for window size 10*100000 = 1 M matrix2heatmap.py humanHiC_100000_abs.bed humanHiC_100000.matrix 10 (2) mapping the unitigs (p_utg) to contigs (p_ctg), only tested for Hifiasm contigs: perl ./map_utg_to_ctg.pl human.p_ctg.noseq.gfa human.p_utg.noseq.gfa > human.utg_to_ctg.map 2> human.utg_to_ctg.map.gfa (3) Identify the assembly errors in contigs by Hi-C heatmaps and unitig breaks: ##use both the HiC data and the utg_to_ctg.map data perl asm_error_check.pl human_100000_abs.bed human_100000.matrix human.utg_to_ctg.map > human.assmebly_errors.position ##only use the HiC data, when the utg_to_ctg.map is not available perl asm_error_check.pl human_100000_abs.bed human_100000.matrix > human.assmebly_errors.position The result file contains 10 columns: Ctg_id contig id Error_loc assembly error position inferred from HiC data Inter_contact the count of HiC links between the two bins (500kb-apart) crossing the assembly error position Inter_cutoff the cutoff of Inter_contact, which is 1/10 of the median values from all the inter-bins (500kb-apart) Intra_contact1 the count of HiC links within the left bin crossing the assembly error position Intra_contact2 the count of HiC links within the right bin crossing the assembly error position Intra_cutoff the cutoff of Intra_contact1 and Intra_contact2, which is 1/10 of the median values of all the intra-bins Break_pos assembly error position inferred from unitig break point Utg1_id the left unitig crossing the assembly error position Utg2_id the right unitig crossing the assembly error position 5. Post-EndHiC programs (1) Convert to AGP and Fasta format files #convert cluster format file to AGP format file cluster2agp.pl z.EndHiC.A.results.summary.cluster hifiasm.fa.len > scaffolds.agp #convert AGP format file to Fasta format file agp2fasta.pl scaffolds.agp hifiasm.fa > scaffolds.fa (2) Draw Hi-C heatmaps for the EndHiC scaffolds #convert the contig bed file to cluster bed file cluster2bed.pl humanHiC_100000_abs.bed z.EndHiC.B.results.summary.cluster > clusterB_100000_abs.bed 2> clusterB.id.len #draw HiC heatmap for endhic scaffolds, each point represents window size 20*100000 = 2 M matrix2heatmap.py clusterB_100000_abs.bed humanHiC_100000.matrix 20 (3) Mapping unclustered short contigs to each cluster #calculate the HiC contact values among contigs, using Hi-C links data from half contig(i.e. max contig end size) perl ../../../ctgContact_from_ctgEndContacts.pl --binsize 100000 --binnum -1 hifiasm.fa.len humanHiC_100000_abs.bed humanHiC_100000.matrix > humanHiC_100000.matrix.halfContig.ctgContact ##normalize the contig contact by the used bin numbers, only keep the max contact from head vs head, head vs tail, tail vs head, tail vs tail comparisons ctgContact_normalize_distance.pl --normalize humanHiC_100000.matrix.halfContig.ctgContact > humanHiC_100000.matrix.halfContig.ctgContact.normalize #mapping the unclustered short contigs into Endhic clusters, with specified cutoff shortCtgs_to_cluster.pl --contact 1 --times 2 z.EndHiC.A.results.summary.cluster hifiasm.fa.len humanHiC_100000.matrix.halfContig.ctgContact.normalize > shortCtgs.mapped.to.clusters.list (4) Convert EndHiC result to juicebox compatible file formats, which can be viewed in Juicebox ##convert EndHiC .cluster file into juicebox .assembly file cluster_to_juciebox_assembly.pl contigs.fa.len z.EndHiC.A.results.summary.cluster > draft.assembly ##index the contig sequence file bwa index draft.fa ##generate the enzyme cutting sites file draft_MboI.txt juicer/misc/generate_site_positions.py MboI draft draft.fa ##generate the Hi-C reads alignment file aligned/merged_nodups.txt ##prepare data: put the HiC reads under ./fastq/; put the contig sequence file and index files under ./reference/; juicer/CPU/juicer.sh -S early -g draft -s MboI -z ./references/draft.fa -y ./draft_MboI.txt -p ./references/draft.fa.size -t 50 -D juicer/CPU ##generate the hic input file draft.hic for viewing in juicebox 3d-dna/visualize/run-assembly-visualizer.sh draft.assembly merged_nodups.txt For more instructions, please refer to the help pages of juicer and juicebox. 6. Accuracy verifying programs (1) Run endhic using the max contact values from bin pairs of two compared contigs endhic_maxBin_pipeline.pl --binsize 1000000 hifiasm.fa.len humanHiC_100000_abs.bed humanHiC_100000.matrix (2) Apply Hierarchical clustering algorithm with contig distance converted from Hi-C contact values derived from contig end regions ##normalize the contig contact by the used bin numbers, and converted to distance values ranging from 0 to 1 ctgContact_normalize_distance.pl humanHiC_100000.matrix.100000.10.CtgContact > humanHiC_100000.matrix.100000.10.CtgContact.distance ##generate all the middle procedure results of the Hierarchical clustering algorithm hcluster_contigs.pl --verbose -type min humanHiC_100000.matrix.100000.10.CtgContact.distance hifiasm.fa.len > humanHiC_100000.matrix.100000.10.CtgContact.distance.hcluster.one 2> humanHiC_100000.matrix.100000.10.CtgContact.distance.hcluster ##Find the suitable stop loop, which represents correct chromosomes, by giving expected crhomosome number and minimum chromosome length cutoff hcluster_suitable_stop.pl --chr_num 23 --chr_len 20000000 humanHiC_100000.matrix.100000.10.CtgContact.distance.hcluster > humanHiC_100000.matrix.100000.10.CtgContact.distance.hcluster.need (3) Compare two clusters ##only compare the clustering information, not consider order and orientation information cluster_compare.pl human.contigs.minimap2.cluster z.EndHiC.A.results.summary.cluster > z.EndHiC.A.results.summary.cluster.vs.ref 7. Reference Sen Wang, Hengchao Wang, Fan Jiang, Anqi Wang, Hangwei Liu, Hanbo Zhao, Boyuan Yang, Dong Xu, Yan Zhang, Wei Fan. EndHiC: assemble large contigs into chromosomal-level scaffolds using the Hi-C links from contig ends. (2021) https://arxiv.org/abs/2111.15411v1 8. Contact any of these authors for help: Wei Fan, 0000-0001-5036-8733 [email protected] or [email protected] Sen Wang, 0000-0001-9793-4472 [email protected] Hengchao Wang, 0000-0002-8754-4195 [email protected] Fan Jiang, 0000-0003-1359-0970 [email protected] Yan Zhang, 0000-0003-2281-7807 [email protected]
About
EndHic is a fast and easy-to-use Hi-C scaffolding tool, using the Hi-C links from contig end regions instead of whole contig regions to assemble large contigs into chromosomal-level scaffolds.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published