Assembling haplotypes from NGS data for di- and polyploids
Overview
This protocol follows GATK best practices genotyping workflow and was employed for studying phylogeny of di- and polyploid Fragaria species using multi-locus sequence data obtained using Hyb-Seq approach (Kamneva et al. In revision) but can be applied to any group of organisms. It was developed on Mac OS but will probably work just as well in any Unix based environment.In short, the protocol includes iterative procedire for obtaining individual specific reference and genotypes from sequencing reads, genotypes and mapped reads from the second iteration of the pipeline are processed through HapCompass to obtain phased varinats.
This tutorial is still work in progress. Please feel free to contact me if you have questions, suggestions, need help or something does not work.
Custom scripts supplied with the protocol are not guarantied to work. Here is the license statement [LICENSE].
Objectives
Sample data
Programs
Make sure that you have all of those programs installed on your computer, working and their location is added to the path. However, some programs, for instance, Trimmomatic, will be run by setting environmental variables which point to the tool, rather than having them in the path:Main steps of the protocol
Download reads
fastq-dump --split-files SRR5275182
fastq-dump --split-files SRR5275235
fastq-dump --split-files SRR5275236
Cleaning up raw reads
Set environmental variable specifying path to Trimmomatic. Use location relevant to your computer:
F. vesca, paired-end reads:
F. corimbosa, paired-end reads, both libraries:
java -classpath $PATH_TRIMMOMATIC/trimmomatic-0.32.jar org.usadellab.trimmomatic.TrimmomaticPE -phred33 -trimlog SRR5275235.trimlog SRR5275235_1.fastq SRR5275235_2.fastq SRR5275235.1P SRR5275235.1U SRR5275235.2P SRR5275235.2U LEADING:20 TRAILING:20 SLIDINGWINDOW:5:20 MINLEN:36 ILLUMINACLIP:$PATH_TRIMMOMATIC/adapters/TruSeq3-PE.fa:2:30:10
java -classpath $PATH_TRIMMOMATIC/trimmomatic-0.32.jar org.usadellab.trimmomatic.TrimmomaticPE -phred33 -trimlog SRR5275236.trimlog SRR5275236_1.fastq SRR5275236_2.fastq SRR5275236.1P SRR5275236.1U SRR5275236.2P SRR5275236.2U LEADING:20 TRAILING:20 SLIDINGWINDOW:5:20 MINLEN:36 ILLUMINACLIP:$PATH_TRIMMOMATIC/adapters/TruSeq3-PE.fa:2:30:10
Reference sequence preparation
It contains sequences for 257 F. vesca genes which were target-captured using hybridization as described in the paper.
Index the reference with BWA:
Set environmental variable specifying path to PICARDtools. Use location relevant to your computer.
Then run CreateSequenceDictionary.jar:
export PATH_PICARD=~/Downloads/picard-tools-1.96/picard-tools-1.96/
java -jar $PATH_PICARD/CreateSequenceDictionary.jar REFERENCE=straw-257.FvHaw4_reference.genes.fa OUTPUT=straw-257.FvHaw4_reference.genes.dict
Generating individual-specific reference
F. vesca, single-end reads:
F. vesca, paired-end reads:
bwa aln -f SRR5275182.1P.sai straw-257.FvHaw4_reference.genes.fa SRR5275182.1P
bwa aln -f SRR5275182.2P.sai straw-257.FvHaw4_reference.genes.fa SRR5275182.2P
bwa aln -f SRR5275182.1U.sai straw-257.FvHaw4_reference.genes.fa SRR5275182.1U
bwa aln -f SRR5275182.2U.sai straw-257.FvHaw4_reference.genes.fa SRR5275182.2U
F. corimbosa, paired-end reads, both libraries:
bwa aln -f SRR5275235.1P.sai straw-257.FvHaw4_reference.genes.fa SRR5275235.1P
bwa aln -f SRR5275235.2P.sai straw-257.FvHaw4_reference.genes.fa SRR5275235.2P
bwa aln -f SRR5275235.1U.sai straw-257.FvHaw4_reference.genes.fa SRR5275235.1U
bwa aln -f SRR5275235.2U.sai straw-257.FvHaw4_reference.genes.fa SRR5275235.2U
bwa aln -f SRR5275236.1P.sai straw-257.FvHaw4_reference.genes.fa SRR5275236.1P
bwa aln -f SRR5275236.2P.sai straw-257.FvHaw4_reference.genes.fa SRR5275236.2P
bwa aln -f SRR5275236.1U.sai straw-257.FvHaw4_reference.genes.fa SRR5275236.1U
bwa aln -f SRR5275236.2U.sai straw-257.FvHaw4_reference.genes.fa SRR5275236.2U
F. vesca, single-end reads:
F. vesca, paired-end reads:
bwa sampe -r '@RG\tID:foo\tSM:bar\tPL:ILLUMINA' straw-257.FvHaw4_reference.genes.fa SRR5275182.1P.sai SRR5275182.2P.sai SRR5275182.1P SRR5275182.2P > SRR5275182.sam
bwa samse -r '@RG\tID:foo\tSM:bar\tPL:ILLUMINA' straw-257.FvHaw4_reference.genes.fa SRR5275182.1U.sai SRR5275182.1U > SRR5275182.1U.sam
bwa samse -r '@RG\tID:foo\tSM:bar\tPL:ILLUMINA' straw-257.FvHaw4_reference.genes.fa SRR5275182.2U.sai SRR5275182.2U > SRR5275182.2U.sam
F. corimbosa, paired-end reads, both libraries:
bwa sampe -r '@RG\tID:foo\tSM:bar\tPL:ILLUMINA' straw-257.FvHaw4_reference.genes.fa SRR5275235.1P.sai SRR5275235.2P.sai SRR5275235.1P SRR5275235.2P > SRR5275235.sam
bwa samse -r '@RG\tID:foo\tSM:bar\tPL:ILLUMINA' straw-257.FvHaw4_reference.genes.fa SRR5275235.1U.sai SRR5275235.1U > SRR5275235.1U.sam
bwa samse -r '@RG\tID:foo\tSM:bar\tPL:ILLUMINA' straw-257.FvHaw4_reference.genes.fa SRR5275235.2U.sai SRR5275235.2U > SRR5275235.2U.sam
bwa sampe -r '@RG\tID:foo\tSM:bar\tPL:ILLUMINA' straw-257.FvHaw4_reference.genes.fa SRR5275236.1P.sai SRR5275236.2P.sai SRR5275236.1P SRR5275236.2P > SRR5275236.sam
bwa samse -r '@RG\tID:foo\tSM:bar\tPL:ILLUMINA' straw-257.FvHaw4_reference.genes.fa SRR5275236.1U.sai SRR5275236.1U > SRR5275236.1U.sam
bwa samse -r '@RG\tID:foo\tSM:bar\tPL:ILLUMINA' straw-257.FvHaw4_reference.genes.fa SRR5275236.2U.sai SRR5275236.2U > SRR5275236.2U.sam
F. vesca, single-end reads:
samtools sort SRR5275181.bam SRR5275181.sorted
samtools index SRR5275181.sorted.bam
F. vesca, paired-end reads:
samtools view -b -S -o SRR5275182.bam SRR5275182.sam
samtools sort SRR5275182.bam SRR5275182.sorted
samtools index SRR5275182.sorted.bam
samtools view -b -S -o SRR5275182.1U.bam SRR5275182.1U.sam
samtools sort SRR5275182.1U.bam SRR5275182.1U.sorted
samtools index SRR5275182.1U.sorted.bam
samtools view -b -S -o SRR5275182.2U.bam SRR5275182.2U.sam
samtools sort SRR5275182.2U.bam SRR5275182.2U.sorted
samtools index SRR5275182.2U.sorted.bam
F. corimbosa, paired-end reads, both libraries:
samtools view -b -S -o SRR5275235.bam SRR5275235.sam
samtools sort SRR5275235.bam SRR5275235.sorted
samtools index SRR5275235.sorted.bam
samtools view -b -S -o SRR5275235.1U.bam SRR5275235.1U.sam
samtools sort SRR5275235.1U.bam SRR5275235.1U.sorted
samtools index SRR5275235.1U.sorted.bam
samtools view -b -S -o SRR5275235.2U.bam SRR5275235.2U.sam
samtools sort SRR5275235.2U.bam SRR5275235.2U.sorted
samtools index SRR5275235.2U.sorted.bam
samtools view -b -S -o SRR5275236.bam SRR5275236.sam
samtools sort SRR5275236.bam SRR5275236.sorted
samtools index SRR5275236.sorted.bam
samtools view -b -S -o SRR5275236.1U.bam SRR5275236.1U.sam
samtools sort SRR5275236.1U.bam SRR5275236.1U.sorted
samtools index SRR5275236.1U.sorted.bam
samtools view -b -S -o SRR5275236.2U.bam SRR5275236.2U.sam
samtools sort SRR5275236.2U.bam SRR5275236.2U.sorted
samtools index SRR5275236.2U.sorted.bam
F. vesca:
samtools merge -u vesca.merge.bam SRR5275181.sorted.bam SRR5275182.sorted.bam SRR5275182.1U.sorted.bam SRR5275182.2U.sorted.bam
samtools sort vesca.merge.bam vesca.merge.sorted
samtools index vesca.merge.sorted.bam
F. corimbosa:
samtools merge -u corimbosa.merge.bam SRR5275235.sorted.bam SRR5275235.1U.sorted.bam SRR5275235.2U.sorted.bam SRR5275236.sorted.bam SRR5275236.1U.sorted.bam SRR5275236.2U.sorted.bam
samtools sort corimbosa.merge.bam corimbosa.merge.sorted
samtools index corimbosa.merge.sorted.bam
If coordinates of two reads in the mapping are identical, duplicated reads are considered to be a PCR artifact and are removed from alignment.
F. vesca and F. corimbosa:
samtools rmdup -sS vesca.merge.sorted.bam vesca.dedup.bam
samtools index vesca.dedup.bam
samtools rmdup -sS corimbosa.merge.sorted.bam corimbosa.dedup.bam
samtools index corimbosa.dedup.bam
Set path to GATK:
java -jar $PATH_GATK/GenomeAnalysisTK.jar -T RealignerTargetCreator -R straw-257.FvHaw4_reference.genes.fa -I vesca.dedup.bam -o vesca.realigner.intervals
java -jar $PATH_GATK/GenomeAnalysisTK.jar -T IndelRealigner -R straw-257.FvHaw4_reference.genes.fa -I vesca.dedup.bam -targetIntervals vesca.realigner.intervals -o vesca.realigned.bam
java -jar $PATH_GATK/GenomeAnalysisTK.jar -T RealignerTargetCreator -R straw-257.FvHaw4_reference.genes.fa -I corimbosa.dedup.bam -o corimbosa.realigner.intervals
java -jar $PATH_GATK/GenomeAnalysisTK.jar -T IndelRealigner -R straw-257.FvHaw4_reference.genes.fa -I corimbosa.dedup.bam -targetIntervals corimbosa.realigner.intervals -o corimbosa.realigned.bam
GATK can call genotypes for samples of varying ploidy so it is suitable for analysis of data from polyploid samples:
java -jar $PATH_GATK/GenomeAnalysisTK.jar -T UnifiedGenotyper -R straw-257.FvHaw4_reference.genes.fa -I vesca.realigned.bam -ploidy 2 -minIndelFrac 0.08 -minIndelCnt 2 -stand_call_conf 17.0 --genotype_likelihoods_model BOTH -maxAltAlleles 2 -deletions 0.1 -o vesca.realigned.vcf
java -jar $PATH_GATK/GenomeAnalysisTK.jar -T UnifiedGenotyper -R straw-257.FvHaw4_reference.genes.fa -I corimbosa.realigned.bam -ploidy 4 -minIndelFrac 0.08 -minIndelCnt 2 -stand_call_conf 17.0 --genotype_likelihoods_model BOTH -maxAltAlleles 2 -deletions 0.1 -o corimbosa.realigned.vcf
F. vesca :
perl util_1.pl vesca.realigned.vcf vesca.realigned.edited.vcf
perl util_genotype_filter_1.pl vesca.realigned.edited.vcf vesca.realigned.filterred.vcf
java -jar $PATH_GATK/GenomeAnalysisTK.jar -T BaseRecalibrator -R straw-257.FvHaw4_reference.genes.fa -I vesca.realigned.bam -knownSites vesca.realigned.filterred.vcf -o vesca.recal_data.table
java -jar $PATH_GATK/GenomeAnalysisTK.jar -T PrintReads -R straw-257.FvHaw4_reference.genes.fa -I vesca.realigned.bam -BQSR vesca.recal_data.table -o vesca.calibrated.bam
java -jar $PATH_GATK/GenomeAnalysisTK.jar -T UnifiedGenotyper -R straw-257.FvHaw4_reference.genes.fa -I vesca.calibrated.bam -ploidy 2 -minIndelFrac 0.08 -minIndelCnt 2 -stand_call_conf 17.0 --genotype_likelihoods_model BOTH -maxAltAlleles 2 -deletions 0.1 -o vesca.calibrated.vcf
perl util_1.pl vesca.calibrated.vcf vesca.calibrated.edited.vcf
perl util_genotype_filter_1.pl vesca.calibrated.edited.vcf vesca.calibrated.filterred.vcf
F. corimbosa:
perl util_1.pl corimbosa.realigned.vcf corimbosa.realigned.edited.vcf
perl util_genotype_filter_1.pl corimbosa.realigned.edited.vcf corimbosa.realigned.filterred.vcf
java -jar $PATH_GATK/GenomeAnalysisTK.jar -T BaseRecalibrator -R straw-257.FvHaw4_reference.genes.fa -I corimbosa.realigned.bam -knownSites corimbosa.realigned.filterred.vcf -o corimbosa.recal_data.table
java -jar $PATH_GATK/GenomeAnalysisTK.jar -T PrintReads -R straw-257.FvHaw4_reference.genes.fa -I corimbosa.realigned.bam -BQSR corimbosa.recal_data.table -o corimbosa.calibrated.bam
java -jar $PATH_GATK/GenomeAnalysisTK.jar -T UnifiedGenotyper -R straw-257.FvHaw4_reference.genes.fa -I corimbosa.calibrated.bam -ploidy 4 -minIndelFrac 0.08 -minIndelCnt 2 -stand_call_conf 17.0 --genotype_likelihoods_model BOTH -maxAltAlleles 2 -deletions 0.1 -o corimbosa.calibrated.vcf
perl util_1.pl corimbosa.calibrated.vcf corimbosa.calibrated.edited.vcf
perl util_genotype_filter_1.pl corimbosa.calibrated.edited.vcf corimbosa.calibrated.filterred.vcf
This step will generate new organism-specific reference:.
java -jar $PATH_GATK/GenomeAnalysisTK.jar -T FastaAlternateReferenceMaker -R straw-257.FvHaw4_reference.genes.fa -V vesca.calibrated.filterred.vcf -o vesca.new.ref.fa
java -jar $PATH_GATK/GenomeAnalysisTK.jar -T FastaAlternateReferenceMaker -R straw-257.FvHaw4_reference.genes.fa -V corimbosa.calibrated.filterred.vcf -o corimbosa.new.ref.fa
Obtain information on sequencing coverage
samtools mpileup -uf vesca.new.ref.fa vesca.calibrated.bam > vesca.all_sites.calibrated.bcf
bcftools view -cg vesca.all_sites.calibrated.bcf > vesca.all_sites.calibrated.vcf
samtools mpileup -uf corimbosa.new.ref.fa corimbosa.calibrated.bam > corimbosa.all_sites.calibrated.bcf
bcftools view -cg corimbosa.all_sites.calibrated.bcf > corimbosa.all_sites.calibrated.vcf
Final alignment and genotypes
./get_variants.sh ./vesca.new.ref.fa SRR5275181 single ./ SRR5275182 paired ./ vesca 2 vesca_round2 > get_variants_vesca_out 2> get_variants_vesca_err
./get_variants.sh ./corimbosa.new.ref.fa SRR5275235 paired ./ SRR5275236 paired ./ corimbosa 4 corimbosa_round2 > get_variants_corimbosa_out 2> get_variants_corimbosa_err
echo "perl util_assess_coverage_by_position.pl vesca_round2/vesca.all_sites.calibrated.vcf 4" > commands_coverage_by_position.sh
echo "perl util_assess_coverage_by_position.pl corimbosa_round2/corimbosa.all_sites.calibrated.vcf 4" >> commands_coverage_by_position.sh
./commands_coverage_by_position.sh > coverage_by_position.txt
Haplotype phasing using assembly
Set path to HapCompass:
java -jar $PATH_HAPCOMPASS/hapcompass.jar --bam ./vesca_round2/vesca.calibrated.bam --vcf ./vesca_round2/vesca.calibrated.filterred.vcf -o ./vesca_round2/vesca_3.compas_out -p 2 -r "3" --debug
java -jar $PATH_HAPCOMPASS/hapcompass.jar --bam ./corimbosa_round2/corimbosa.calibrated.bam --vcf ./corimbosa_round2/corimbosa.calibrated.filterred.vcf -o ./corimbosa_round2/corimbosa_30.compas_out -p 4 -r "30" --debug