Assembling haplotypes from NGS data for di- and polyploids

Overview

This protocol follows GATK best practices genotyping workflow and was employed for studying phylogeny of di- and polyploid Fragaria species using multi-locus sequence data obtained using Hyb-Seq approach (Kamneva et al. In revision) but can be applied to any group of organisms. It was developed on Mac OS but will probably work just as well in any Unix based environment.
In short, the protocol includes iterative procedire for obtaining individual specific reference and genotypes from sequencing reads, genotypes and mapped reads from the second iteration of the pipeline are processed through HapCompass to obtain phased varinats.

This tutorial is still work in progress. Please feel free to contact me if you have questions, suggestions, need help or something does not work.
Custom scripts supplied with the protocol are not guarantied to work. Here is the license statement [LICENSE].

Objectives

  • Assemble individual haplotype sequences from a mixture of reads generated using several templates at once as happens when sequenced DNA is extracted from di- or polyploid sample.
  • Use bwa, GATK and HapCompass on F. vesca (diploid) and F. corimbosa (tetraploid) NGS data.
  • Sample data

  • Single- and paired-end read from about 257 loci from F. vesca, diploid, (SRR5275181, SRR5275182) and F. corimbosa, tetraploid, (SRR5275235, SRR5275236).
  • Genomic regions of F. vesca reference genome [straw-257.FvHaw4_reference.genes.fa] which will be used as initial references in assembly.
  • Programs

    Make sure that you have all of those programs installed on your computer, working and their location is added to the path. However, some programs, for instance, Trimmomatic, will be run by setting environmental variables which point to the tool, rather than having them in the path:
  • Sratoolkit
  • Trimmomatic
  • Bwa
  • SAMtools
  • GATK, Some slides
  • HapCompas
  • Also please download a tar.gz haplotype_assembly_scripts.tar.gz file and unpack it. The archive contains folder with following custom scripts, these will be accessed through their location in this tutorial, but they can be copied into one of the path directoies, eliminating need for specifying their location:
  • Utility script 1 [util_1.pl].
  • Utility script 2 [util_genotype_filter_1.pl].
  • Utility script 3 that allows to summarize site-by-site information about read coverage [util_assess_coverage_by_position.pl].
  • Script for one iteration of the pipeline [get_variants.sh], if you would like to skip this tutorial, you can just run this script 2 times, first with straw-257.FvHaw4_reference.genes.fa as reference file, then with individual specific reference generated in first iteration and then jump directly to haplotype phasing using assembly section.
  • Main steps of the protocol

  • Download reads
  • Cleaning up raw reads
  • Reference sequence preparation
  • Generating individual-specific reference
  • Obtain information on sequencing coverage
  • Final alignment and genotypes
  • Haplotype phasing using assembly
  • Download reads

  • Download reads for two individuals from SRA using fastq-dump utility from sratoolkit:
    fastq-dump SRR5275181
    fastq-dump --split-files SRR5275182
    fastq-dump --split-files SRR5275235
    fastq-dump --split-files SRR5275236
  • Cleaning up raw reads

  • Clean up the reads with Trimmomatic
    Set environmental variable specifying path to Trimmomatic. Use location relevant to your computer:
    export PATH_TRIMMOMATIC=~/Downloads/Trimmomatic-0.32/
    F. vesca, single-end reads:
    java -classpath $PATH_TRIMMOMATIC/trimmomatic-0.32.jar org.usadellab.trimmomatic.TrimmomaticSE -phred33 -trimlog SRR5275181.trimlog SRR5275181.fastq SRR5275181_trimmed.fastq LEADING:20 TRAILING:20 SLIDINGWINDOW:5:20 MINLEN:36 ILLUMINACLIP:$PATH_TRIMMOMATIC/adapters/TruSeq3-SE.fa:2:30:10

    F. vesca, paired-end reads:

    java -classpath $PATH_TRIMMOMATIC/trimmomatic-0.32.jar org.usadellab.trimmomatic.TrimmomaticPE -phred33 -trimlog SRR5275182.trimlog SRR5275182_1.fastq SRR5275182_2.fastq SRR5275182.1P SRR5275182.1U SRR5275182.2P SRR5275182.2U LEADING:20 TRAILING:20 SLIDINGWINDOW:5:20 MINLEN:36 ILLUMINACLIP:$PATH_TRIMMOMATIC/adapters/TruSeq3-PE.fa:2:30:10

    F. corimbosa, paired-end reads, both libraries:

    java -classpath $PATH_TRIMMOMATIC/trimmomatic-0.32.jar org.usadellab.trimmomatic.TrimmomaticPE -phred33 -trimlog SRR5275235.trimlog SRR5275235_1.fastq SRR5275235_2.fastq SRR5275235.1P SRR5275235.1U SRR5275235.2P SRR5275235.2U LEADING:20 TRAILING:20 SLIDINGWINDOW:5:20 MINLEN:36 ILLUMINACLIP:$PATH_TRIMMOMATIC/adapters/TruSeq3-PE.fa:2:30:10

    java -classpath $PATH_TRIMMOMATIC/trimmomatic-0.32.jar org.usadellab.trimmomatic.TrimmomaticPE -phred33 -trimlog SRR5275236.trimlog SRR5275236_1.fastq SRR5275236_2.fastq SRR5275236.1P SRR5275236.1U SRR5275236.2P SRR5275236.2U LEADING:20 TRAILING:20 SLIDINGWINDOW:5:20 MINLEN:36 ILLUMINACLIP:$PATH_TRIMMOMATIC/adapters/TruSeq3-PE.fa:2:30:10

  • Reference sequence preparation

  • Download following reference file straw-257.FvHaw4_reference.genes.fa
    It contains sequences for 257 F. vesca genes which were target-captured using hybridization as described in the paper.
    Index the reference with BWA:
    bwa index straw-257.FvHaw4_reference.genes.fa
  • Index it with SAMtools:
    samtools faidx straw-257.FvHaw4_reference.genes.fa
  • Create dictionary with PICARDtools
    Set environmental variable specifying path to PICARDtools. Use location relevant to your computer.
    Then run CreateSequenceDictionary.jar:

    export PATH_PICARD=~/Downloads/picard-tools-1.96/picard-tools-1.96/

    java -jar $PATH_PICARD/CreateSequenceDictionary.jar REFERENCE=straw-257.FvHaw4_reference.genes.fa OUTPUT=straw-257.FvHaw4_reference.genes.dict

  • Generating individual-specific reference

  • Align reads to reference with BWA

    F. vesca, single-end reads:

    bwa aln -f SRR5275181.sai straw-257.FvHaw4_reference.genes.fa SRR5275181_trimmed.fastq

    F. vesca, paired-end reads:

    bwa aln -f SRR5275182.1P.sai straw-257.FvHaw4_reference.genes.fa SRR5275182.1P
    bwa aln -f SRR5275182.2P.sai straw-257.FvHaw4_reference.genes.fa SRR5275182.2P
    bwa aln -f SRR5275182.1U.sai straw-257.FvHaw4_reference.genes.fa SRR5275182.1U
    bwa aln -f SRR5275182.2U.sai straw-257.FvHaw4_reference.genes.fa SRR5275182.2U

    F. corimbosa, paired-end reads, both libraries:

    bwa aln -f SRR5275235.1P.sai straw-257.FvHaw4_reference.genes.fa SRR5275235.1P
    bwa aln -f SRR5275235.2P.sai straw-257.FvHaw4_reference.genes.fa SRR5275235.2P
    bwa aln -f SRR5275235.1U.sai straw-257.FvHaw4_reference.genes.fa SRR5275235.1U
    bwa aln -f SRR5275235.2U.sai straw-257.FvHaw4_reference.genes.fa SRR5275235.2U

    bwa aln -f SRR5275236.1P.sai straw-257.FvHaw4_reference.genes.fa SRR5275236.1P
    bwa aln -f SRR5275236.2P.sai straw-257.FvHaw4_reference.genes.fa SRR5275236.2P
    bwa aln -f SRR5275236.1U.sai straw-257.FvHaw4_reference.genes.fa SRR5275236.1U
    bwa aln -f SRR5275236.2U.sai straw-257.FvHaw4_reference.genes.fa SRR5275236.2U

  • Convert to sam files using SAMtools, repetitive hits will be randomly chosen

    F. vesca, single-end reads:

    bwa samse -r '@RG\tID:foo\tSM:bar\tPL:ILLUMINA' straw-257.FvHaw4_reference.genes.fa SRR5275181.sai SRR5275181_trimmed.fastq > SRR5275181.sam

    F. vesca, paired-end reads:

    bwa sampe -r '@RG\tID:foo\tSM:bar\tPL:ILLUMINA' straw-257.FvHaw4_reference.genes.fa SRR5275182.1P.sai SRR5275182.2P.sai SRR5275182.1P SRR5275182.2P > SRR5275182.sam

    bwa samse -r '@RG\tID:foo\tSM:bar\tPL:ILLUMINA' straw-257.FvHaw4_reference.genes.fa SRR5275182.1U.sai SRR5275182.1U > SRR5275182.1U.sam

    bwa samse -r '@RG\tID:foo\tSM:bar\tPL:ILLUMINA' straw-257.FvHaw4_reference.genes.fa SRR5275182.2U.sai SRR5275182.2U > SRR5275182.2U.sam

    F. corimbosa, paired-end reads, both libraries:

    bwa sampe -r '@RG\tID:foo\tSM:bar\tPL:ILLUMINA' straw-257.FvHaw4_reference.genes.fa SRR5275235.1P.sai SRR5275235.2P.sai SRR5275235.1P SRR5275235.2P > SRR5275235.sam

    bwa samse -r '@RG\tID:foo\tSM:bar\tPL:ILLUMINA' straw-257.FvHaw4_reference.genes.fa SRR5275235.1U.sai SRR5275235.1U > SRR5275235.1U.sam

    bwa samse -r '@RG\tID:foo\tSM:bar\tPL:ILLUMINA' straw-257.FvHaw4_reference.genes.fa SRR5275235.2U.sai SRR5275235.2U > SRR5275235.2U.sam

    bwa sampe -r '@RG\tID:foo\tSM:bar\tPL:ILLUMINA' straw-257.FvHaw4_reference.genes.fa SRR5275236.1P.sai SRR5275236.2P.sai SRR5275236.1P SRR5275236.2P > SRR5275236.sam

    bwa samse -r '@RG\tID:foo\tSM:bar\tPL:ILLUMINA' straw-257.FvHaw4_reference.genes.fa SRR5275236.1U.sai SRR5275236.1U > SRR5275236.1U.sam

    bwa samse -r '@RG\tID:foo\tSM:bar\tPL:ILLUMINA' straw-257.FvHaw4_reference.genes.fa SRR5275236.2U.sai SRR5275236.2U > SRR5275236.2U.sam

  • Use SAMtools to convert sam files to bam files, sort and index resulting bam files

    F. vesca, single-end reads:

    samtools view -b -S -o SRR5275181.bam SRR5275181.sam
    samtools sort SRR5275181.bam SRR5275181.sorted
    samtools index SRR5275181.sorted.bam

    F. vesca, paired-end reads:

    samtools view -b -S -o SRR5275182.bam SRR5275182.sam
    samtools sort SRR5275182.bam SRR5275182.sorted
    samtools index SRR5275182.sorted.bam

    samtools view -b -S -o SRR5275182.1U.bam SRR5275182.1U.sam
    samtools sort SRR5275182.1U.bam SRR5275182.1U.sorted
    samtools index SRR5275182.1U.sorted.bam

    samtools view -b -S -o SRR5275182.2U.bam SRR5275182.2U.sam
    samtools sort SRR5275182.2U.bam SRR5275182.2U.sorted
    samtools index SRR5275182.2U.sorted.bam

    F. corimbosa, paired-end reads, both libraries:

    samtools view -b -S -o SRR5275235.bam SRR5275235.sam
    samtools sort SRR5275235.bam SRR5275235.sorted
    samtools index SRR5275235.sorted.bam

    samtools view -b -S -o SRR5275235.1U.bam SRR5275235.1U.sam
    samtools sort SRR5275235.1U.bam SRR5275235.1U.sorted
    samtools index SRR5275235.1U.sorted.bam

    samtools view -b -S -o SRR5275235.2U.bam SRR5275235.2U.sam
    samtools sort SRR5275235.2U.bam SRR5275235.2U.sorted
    samtools index SRR5275235.2U.sorted.bam

    samtools view -b -S -o SRR5275236.bam SRR5275236.sam
    samtools sort SRR5275236.bam SRR5275236.sorted
    samtools index SRR5275236.sorted.bam

    samtools view -b -S -o SRR5275236.1U.bam SRR5275236.1U.sam
    samtools sort SRR5275236.1U.bam SRR5275236.1U.sorted
    samtools index SRR5275236.1U.sorted.bam

    samtools view -b -S -o SRR5275236.2U.bam SRR5275236.2U.sam
    samtools sort SRR5275236.2U.bam SRR5275236.2U.sorted
    samtools index SRR5275236.2U.sorted.bam

  • Merge alignments from different libraries with SAMtools, sort and index merged files

    F. vesca:

    samtools merge -u vesca.merge.bam SRR5275181.sorted.bam SRR5275182.sorted.bam SRR5275182.1U.sorted.bam SRR5275182.2U.sorted.bam

    samtools sort vesca.merge.bam vesca.merge.sorted
    samtools index vesca.merge.sorted.bam

    F. corimbosa:

    samtools merge -u corimbosa.merge.bam SRR5275235.sorted.bam SRR5275235.1U.sorted.bam SRR5275235.2U.sorted.bam SRR5275236.sorted.bam SRR5275236.1U.sorted.bam SRR5275236.2U.sorted.bam

    samtools sort corimbosa.merge.bam corimbosa.merge.sorted
    samtools index corimbosa.merge.sorted.bam

  • Remove PCR duplicates with SAMtools and index new files
    If coordinates of two reads in the mapping are identical, duplicated reads are considered to be a PCR artifact and are removed from alignment.

    F. vesca and F. corimbosa:

    samtools rmdup -sS vesca.merge.sorted.bam vesca.dedup.bam
    samtools index vesca.dedup.bam

    samtools rmdup -sS corimbosa.merge.sorted.bam corimbosa.dedup.bam
    samtools index corimbosa.dedup.bam

  • Find indels with GATK, RealignerTargetCreator, and realign reads around indels with GATK, IndelRealigner
    Set path to GATK:
    export PATH_GATK=~/Downloads/GenomeAnalysisTK-2.6-5-gba531bd/
    Run GATK for both individuals:

    java -jar $PATH_GATK/GenomeAnalysisTK.jar -T RealignerTargetCreator -R straw-257.FvHaw4_reference.genes.fa -I vesca.dedup.bam -o vesca.realigner.intervals

    java -jar $PATH_GATK/GenomeAnalysisTK.jar -T IndelRealigner -R straw-257.FvHaw4_reference.genes.fa -I vesca.dedup.bam -targetIntervals vesca.realigner.intervals -o vesca.realigned.bam

    java -jar $PATH_GATK/GenomeAnalysisTK.jar -T RealignerTargetCreator -R straw-257.FvHaw4_reference.genes.fa -I corimbosa.dedup.bam -o corimbosa.realigner.intervals

    java -jar $PATH_GATK/GenomeAnalysisTK.jar -T IndelRealigner -R straw-257.FvHaw4_reference.genes.fa -I corimbosa.dedup.bam -targetIntervals corimbosa.realigner.intervals -o corimbosa.realigned.bam

  • Call genotypes with GATK, UnifiedGenotyper
    GATK can call genotypes for samples of varying ploidy so it is suitable for analysis of data from polyploid samples:

    java -jar $PATH_GATK/GenomeAnalysisTK.jar -T UnifiedGenotyper -R straw-257.FvHaw4_reference.genes.fa -I vesca.realigned.bam -ploidy 2 -minIndelFrac 0.08 -minIndelCnt 2 -stand_call_conf 17.0 --genotype_likelihoods_model BOTH -maxAltAlleles 2 -deletions 0.1 -o vesca.realigned.vcf

    java -jar $PATH_GATK/GenomeAnalysisTK.jar -T UnifiedGenotyper -R straw-257.FvHaw4_reference.genes.fa -I corimbosa.realigned.bam -ploidy 4 -minIndelFrac 0.08 -minIndelCnt 2 -stand_call_conf 17.0 --genotype_likelihoods_model BOTH -maxAltAlleles 2 -deletions 0.1 -o corimbosa.realigned.vcf

  • Recalibrate allignment with GATK and call genotypes again

    F. vesca :

    perl util_1.pl vesca.realigned.vcf vesca.realigned.edited.vcf
    perl util_genotype_filter_1.pl vesca.realigned.edited.vcf vesca.realigned.filterred.vcf
    java -jar $PATH_GATK/GenomeAnalysisTK.jar -T BaseRecalibrator -R straw-257.FvHaw4_reference.genes.fa -I vesca.realigned.bam -knownSites vesca.realigned.filterred.vcf -o vesca.recal_data.table
    java -jar $PATH_GATK/GenomeAnalysisTK.jar -T PrintReads -R straw-257.FvHaw4_reference.genes.fa -I vesca.realigned.bam -BQSR vesca.recal_data.table -o vesca.calibrated.bam
    java -jar $PATH_GATK/GenomeAnalysisTK.jar -T UnifiedGenotyper -R straw-257.FvHaw4_reference.genes.fa -I vesca.calibrated.bam -ploidy 2 -minIndelFrac 0.08 -minIndelCnt 2 -stand_call_conf 17.0 --genotype_likelihoods_model BOTH -maxAltAlleles 2 -deletions 0.1 -o vesca.calibrated.vcf
    perl util_1.pl vesca.calibrated.vcf vesca.calibrated.edited.vcf
    perl util_genotype_filter_1.pl vesca.calibrated.edited.vcf vesca.calibrated.filterred.vcf

    F. corimbosa:

    perl util_1.pl corimbosa.realigned.vcf corimbosa.realigned.edited.vcf
    perl util_genotype_filter_1.pl corimbosa.realigned.edited.vcf corimbosa.realigned.filterred.vcf
    java -jar $PATH_GATK/GenomeAnalysisTK.jar -T BaseRecalibrator -R straw-257.FvHaw4_reference.genes.fa -I corimbosa.realigned.bam -knownSites corimbosa.realigned.filterred.vcf -o corimbosa.recal_data.table
    java -jar $PATH_GATK/GenomeAnalysisTK.jar -T PrintReads -R straw-257.FvHaw4_reference.genes.fa -I corimbosa.realigned.bam -BQSR corimbosa.recal_data.table -o corimbosa.calibrated.bam
    java -jar $PATH_GATK/GenomeAnalysisTK.jar -T UnifiedGenotyper -R straw-257.FvHaw4_reference.genes.fa -I corimbosa.calibrated.bam -ploidy 4 -minIndelFrac 0.08 -minIndelCnt 2 -stand_call_conf 17.0 --genotype_likelihoods_model BOTH -maxAltAlleles 2 -deletions 0.1 -o corimbosa.calibrated.vcf
    perl util_1.pl corimbosa.calibrated.vcf corimbosa.calibrated.edited.vcf
    perl util_genotype_filter_1.pl corimbosa.calibrated.edited.vcf corimbosa.calibrated.filterred.vcf

  • Call consensus sequence with GATK, alternate reference maker
    This step will generate new organism-specific reference:.

    java -jar $PATH_GATK/GenomeAnalysisTK.jar -T FastaAlternateReferenceMaker -R straw-257.FvHaw4_reference.genes.fa -V vesca.calibrated.filterred.vcf -o vesca.new.ref.fa

    java -jar $PATH_GATK/GenomeAnalysisTK.jar -T FastaAlternateReferenceMaker -R straw-257.FvHaw4_reference.genes.fa -V corimbosa.calibrated.filterred.vcf -o corimbosa.new.ref.fa

  • Obtain information on sequencing coverage

  • Obtain site specific coverage informarion with SAMtools:

    samtools mpileup -uf vesca.new.ref.fa vesca.calibrated.bam > vesca.all_sites.calibrated.bcf
    bcftools view -cg vesca.all_sites.calibrated.bcf > vesca.all_sites.calibrated.vcf

    samtools mpileup -uf corimbosa.new.ref.fa corimbosa.calibrated.bam > corimbosa.all_sites.calibrated.bcf
    bcftools view -cg corimbosa.all_sites.calibrated.bcf > corimbosa.all_sites.calibrated.vcf

  • Final alignment and genotypes

  • Follow steps in reference sequence preparation section to prepare organism-specific references.
  • Follow steps in generating individual-specific reference section but use individual-specific reference sequences rather than original vesca assembli and collect vcf and bam files. These are final alignment and genotypes.
  • You should have already obtained information on sequencing coverage for the reference in use when preparing it. See end of Obtain information on sequencing coverage section.
  • Or run script get_variants.sh as described in the Example 1 and 2 in the header (make sure that environmental variables are set properly):

    ./get_variants.sh ./vesca.new.ref.fa SRR5275181 single ./ SRR5275182 paired ./ vesca 2 vesca_round2 > get_variants_vesca_out 2> get_variants_vesca_err

    ./get_variants.sh ./corimbosa.new.ref.fa SRR5275235 paired ./ SRR5275236 paired ./ corimbosa 4 corimbosa_round2 > get_variants_corimbosa_out 2> get_variants_corimbosa_err

    This script encapsilates entire iteration of the reference, genotype and coverage information generating process and therefore can just be run 2 times, once with straw-257.FvHaw4_reference.genes.fa as reference file, then with individual specific reference generated in first iteration and then go directly to next section haplotype phasing using assembly.
  • It is handy to summarize read coverage information across all of the individuals. To do this first prepare commans file which will run util_assess_coverage_by_position.pl script for each of the all_sites.calibrated.vcf files, supplaying threshould value of the coverage. In this case it is at least 4 reads:

    echo "perl util_assess_coverage_by_position.pl vesca_round2/vesca.all_sites.calibrated.vcf 4" > commands_coverage_by_position.sh

    echo "perl util_assess_coverage_by_position.pl corimbosa_round2/corimbosa.all_sites.calibrated.vcf 4" >> commands_coverage_by_position.sh

    Then run it to get coverage_by_position.txt which esentially
    chmod 777 commands_coverage_by_position.sh
    ./commands_coverage_by_position.sh > coverage_by_position.txt
    This will generate 3 column file: original vcf file name, gene name, string of As and Ns which simbolize acceptable and not acceptable read coverage for each position. Similar file will be used in the next walk-through.
  • Haplotype phasing using assembly

  • Run HapCompass
    Set path to HapCompass:
    export PATH_HAPCOMPASS=~/Downloads/hapcompass_v0.8.2/
    Run HapCompass for both individuals, one gene at a time, here giving example for genes 3 and 30:

    java -jar $PATH_HAPCOMPASS/hapcompass.jar --bam ./vesca_round2/vesca.calibrated.bam --vcf ./vesca_round2/vesca.calibrated.filterred.vcf -o ./vesca_round2/vesca_3.compas_out -p 2 -r "3" --debug

    java -jar $PATH_HAPCOMPASS/hapcompass.jar --bam ./corimbosa_round2/corimbosa.calibrated.bam --vcf ./corimbosa_round2/corimbosa.calibrated.filterred.vcf -o ./corimbosa_round2/corimbosa_30.compas_out -p 4 -r "30" --debug

    These will generate a number of files with prefix vesca_3.compas_out_ and corimbosa_30.compas_out_, the vesca_3.compas_out_MWER_solution.txt and corimbosa_30.compas_out_MWER_solution.txt contain haplotype assembly blocks. Please see program manual for file format and for infomation on other output files.
  • The end

  • This is the end of this turorial, please procees to phylogeny reconstruction using haplotype and consensus sequences from several Fragaria species walk-through to learn about the next step of the project.