Olga K Kamneva

Assembling haplotypes from NGS data for di- and polyploids

Overview

This protocol follows GATK best practices genotyping workflow and was employed for studying phylogeny of di- and polyploid Fragaria species using multi-locus sequence data obtained using Hyb-Seq approach (Kamneva et al. In revision) but can be applied to any group of organisms. It was developed on Mac OS but will probably work just as well in any Unix based environment.
In short, the protocol includes iterative procedire for obtaining individual specific reference and genotypes from sequencing reads, genotypes and mapped reads from the second iteration of the pipeline are processed through HapCompass to obtain phased varinats.

This tutorial is still work in progress. Please feel free to contact me if you have questions, suggestions, need help or something does not work.
Custom scripts supplied with the protocol are not guarantied to work. Here is the license statement [LICENSE].

Objectives

Assemble individual haplotype sequences from a mixture of reads generated using several templates at once as happens when sequenced DNA is extracted from di- or polyploid sample.

Use bwa, GATK and HapCompass on F. vesca (diploid) and F. corimbosa (tetraploid) NGS data.

Sample data

Single- and paired-end read from about 257 loci from F. vesca, diploid, (SRR5275181, SRR5275182) and F. corimbosa, tetraploid, (SRR5275235, SRR5275236).

Genomic regions of F. vesca reference genome [straw-257.FvHaw4_reference.genes.fa] which will be used as initial references in assembly.

Programs

Make sure that you have all of those programs installed on your computer, working and their location is added to the path. However, some programs, for instance, Trimmomatic, will be run by setting environmental variables which point to the tool, rather than having them in the path:

Also please download a tar.gz haplotype_assembly_scripts.tar.gz file and unpack it. The archive contains folder with following custom scripts, these will be accessed through their location in this tutorial, but they can be copied into one of the path directoies, eliminating need for specifying their location:

Utility script 1 [util_1.pl].

Utility script 2 [util_genotype_filter_1.pl].

Utility script 3 that allows to summarize site-by-site information about read coverage [util_assess_coverage_by_position.pl].

Script for one iteration of the pipeline [get_variants.sh], if you would like to skip this tutorial, you can just run this script 2 times, first with straw-257.FvHaw4_reference.genes.fa as reference file, then with individual specific reference generated in first iteration and then jump directly to haplotype phasing using assembly section.

Main steps of the protocol

Download reads

Cleaning up raw reads

Reference sequence preparation

Generating individual-specific reference

Obtain information on sequencing coverage

Final alignment and genotypes

Haplotype phasing using assembly

Download reads

Download reads for two individuals from SRA using fastq-dump utility from sratoolkit:

									fastq-dump SRR5275181

									fastq-dump --split-files SRR5275182

									fastq-dump --split-files SRR5275235

									fastq-dump --split-files SRR5275236

Cleaning up raw reads

Clean up the reads with Trimmomatic
Set environmental variable specifying path to Trimmomatic. Use location relevant to your computer:

									export PATH_TRIMMOMATIC=~/Downloads/Trimmomatic-0.32/
								

F. vesca, single-end reads:

									java -classpath $PATH_TRIMMOMATIC/trimmomatic-0.32.jar org.usadellab.trimmomatic.TrimmomaticSE -phred33 
									-trimlog SRR5275181.trimlog 
									SRR5275181.fastq SRR5275181_trimmed.fastq  
									LEADING:20 TRAILING:20 SLIDINGWINDOW:5:20 MINLEN:36  
									ILLUMINACLIP:$PATH_TRIMMOMATIC/adapters/TruSeq3-SE.fa:2:30:10

								

F. vesca, paired-end reads:

									java -classpath $PATH_TRIMMOMATIC/trimmomatic-0.32.jar org.usadellab.trimmomatic.TrimmomaticPE -phred33  
									-trimlog SRR5275182.trimlog  
									SRR5275182_1.fastq SRR5275182_2.fastq  
									SRR5275182.1P SRR5275182.1U SRR5275182.2P SRR5275182.2U  
									LEADING:20 TRAILING:20 SLIDINGWINDOW:5:20 MINLEN:36  
									ILLUMINACLIP:$PATH_TRIMMOMATIC/adapters/TruSeq3-PE.fa:2:30:10

								

F. corimbosa, paired-end reads, both libraries:

									java -classpath $PATH_TRIMMOMATIC/trimmomatic-0.32.jar org.usadellab.trimmomatic.TrimmomaticPE -phred33  
									-trimlog SRR5275235.trimlog  
									SRR5275235_1.fastq SRR5275235_2.fastq  
									SRR5275235.1P SRR5275235.1U SRR5275235.2P SRR5275235.2U  
									LEADING:20 TRAILING:20 SLIDINGWINDOW:5:20 MINLEN:36  
									ILLUMINACLIP:$PATH_TRIMMOMATIC/adapters/TruSeq3-PE.fa:2:30:10
									
									

									
									java -classpath $PATH_TRIMMOMATIC/trimmomatic-0.32.jar org.usadellab.trimmomatic.TrimmomaticPE -phred33  
									-trimlog SRR5275236.trimlog  
									SRR5275236_1.fastq SRR5275236_2.fastq  
									SRR5275236.1P SRR5275236.1U SRR5275236.2P SRR5275236.2U  
									LEADING:20 TRAILING:20 SLIDINGWINDOW:5:20 MINLEN:36  
									ILLUMINACLIP:$PATH_TRIMMOMATIC/adapters/TruSeq3-PE.fa:2:30:10
									

Reference sequence preparation

Download following reference file straw-257.FvHaw4_reference.genes.fa
It contains sequences for 257 F. vesca genes which were target-captured using hybridization as described in the paper.
Index the reference with BWA:

									bwa index straw-257.FvHaw4_reference.genes.fa
								

Index it with SAMtools:

									samtools faidx straw-257.FvHaw4_reference.genes.fa
								

Create dictionary with PICARDtools
Set environmental variable specifying path to PICARDtools. Use location relevant to your computer.
Then run CreateSequenceDictionary.jar:

									export PATH_PICARD=~/Downloads/picard-tools-1.96/picard-tools-1.96/
									
									java -jar $PATH_PICARD/CreateSequenceDictionary.jar 
									REFERENCE=straw-257.FvHaw4_reference.genes.fa 
									OUTPUT=straw-257.FvHaw4_reference.genes.dict

Generating individual-specific reference

Align reads to reference with BWA

F. vesca, single-end reads:

									bwa aln -f SRR5275181.sai straw-257.FvHaw4_reference.genes.fa SRR5275181_trimmed.fastq

								

F. vesca, paired-end reads:

									bwa aln -f SRR5275182.1P.sai straw-257.FvHaw4_reference.genes.fa SRR5275182.1P

									bwa aln -f SRR5275182.2P.sai straw-257.FvHaw4_reference.genes.fa SRR5275182.2P

									bwa aln -f SRR5275182.1U.sai straw-257.FvHaw4_reference.genes.fa SRR5275182.1U

									bwa aln -f SRR5275182.2U.sai straw-257.FvHaw4_reference.genes.fa SRR5275182.2U

F. corimbosa, paired-end reads, both libraries:

									bwa aln -f SRR5275235.1P.sai straw-257.FvHaw4_reference.genes.fa SRR5275235.1P

									bwa aln -f SRR5275235.2P.sai straw-257.FvHaw4_reference.genes.fa SRR5275235.2P

									bwa aln -f SRR5275235.1U.sai straw-257.FvHaw4_reference.genes.fa SRR5275235.1U

									bwa aln -f SRR5275235.2U.sai straw-257.FvHaw4_reference.genes.fa SRR5275235.2U

									bwa aln -f SRR5275236.1P.sai straw-257.FvHaw4_reference.genes.fa SRR5275236.1P

									bwa aln -f SRR5275236.2P.sai straw-257.FvHaw4_reference.genes.fa SRR5275236.2P

									bwa aln -f SRR5275236.1U.sai straw-257.FvHaw4_reference.genes.fa SRR5275236.1U

									bwa aln -f SRR5275236.2U.sai straw-257.FvHaw4_reference.genes.fa SRR5275236.2U

Convert to sam files using SAMtools, repetitive hits will be randomly chosen

F. vesca, single-end reads:

									bwa samse -r '@RG\tID:foo\tSM:bar\tPL:ILLUMINA' straw-257.FvHaw4_reference.genes.fa 
									SRR5275181.sai SRR5275181_trimmed.fastq > SRR5275181.sam

F. vesca, paired-end reads:

									bwa sampe -r '@RG\tID:foo\tSM:bar\tPL:ILLUMINA' straw-257.FvHaw4_reference.genes.fa 
									SRR5275182.1P.sai SRR5275182.2P.sai  
									SRR5275182.1P SRR5275182.2P > SRR5275182.sam 
									
									bwa samse -r '@RG\tID:foo\tSM:bar\tPL:ILLUMINA' straw-257.FvHaw4_reference.genes.fa 
									SRR5275182.1U.sai SRR5275182.1U > SRR5275182.1U.sam
									
									bwa samse -r '@RG\tID:foo\tSM:bar\tPL:ILLUMINA' straw-257.FvHaw4_reference.genes.fa 
									SRR5275182.2U.sai SRR5275182.2U > SRR5275182.2U.sam

F. corimbosa, paired-end reads, both libraries:

									bwa sampe -r '@RG\tID:foo\tSM:bar\tPL:ILLUMINA' straw-257.FvHaw4_reference.genes.fa 
									SRR5275235.1P.sai SRR5275235.2P.sai  
									SRR5275235.1P SRR5275235.2P > SRR5275235.sam 
									
									bwa samse -r '@RG\tID:foo\tSM:bar\tPL:ILLUMINA' straw-257.FvHaw4_reference.genes.fa 
									SRR5275235.1U.sai SRR5275235.1U > SRR5275235.1U.sam
									
									bwa samse -r '@RG\tID:foo\tSM:bar\tPL:ILLUMINA' straw-257.FvHaw4_reference.genes.fa 
									SRR5275235.2U.sai SRR5275235.2U > SRR5275235.2U.sam
									
									bwa sampe -r '@RG\tID:foo\tSM:bar\tPL:ILLUMINA' straw-257.FvHaw4_reference.genes.fa 
									SRR5275236.1P.sai SRR5275236.2P.sai 
									SRR5275236.1P SRR5275236.2P > SRR5275236.sam 
									
									bwa samse -r '@RG\tID:foo\tSM:bar\tPL:ILLUMINA' straw-257.FvHaw4_reference.genes.fa 
									SRR5275236.1U.sai SRR5275236.1U > SRR5275236.1U.sam
									
									bwa samse -r '@RG\tID:foo\tSM:bar\tPL:ILLUMINA' straw-257.FvHaw4_reference.genes.fa 
									SRR5275236.2U.sai SRR5275236.2U > SRR5275236.2U.sam

Use SAMtools to convert sam files to bam files, sort and index resulting bam files

F. vesca, single-end reads:

									samtools view -b -S -o SRR5275181.bam SRR5275181.sam

									samtools sort SRR5275181.bam SRR5275181.sorted
	
									samtools index SRR5275181.sorted.bam

F. vesca, paired-end reads:

									samtools view -b -S -o SRR5275182.bam SRR5275182.sam	

									samtools sort SRR5275182.bam SRR5275182.sorted
	
									samtools index SRR5275182.sorted.bam
									
									samtools view -b -S -o SRR5275182.1U.bam SRR5275182.1U.sam	

									samtools sort SRR5275182.1U.bam SRR5275182.1U.sorted
			
									samtools index SRR5275182.1U.sorted.bam
									
									samtools view -b -S -o SRR5275182.2U.bam SRR5275182.2U.sam	

									samtools sort SRR5275182.2U.bam SRR5275182.2U.sorted
			
									samtools index SRR5275182.2U.sorted.bam

F. corimbosa, paired-end reads, both libraries:

									samtools view -b -S -o SRR5275235.bam SRR5275235.sam	

									samtools sort SRR5275235.bam SRR5275235.sorted
	
									samtools index SRR5275235.sorted.bam
									
									samtools view -b -S -o SRR5275235.1U.bam SRR5275235.1U.sam	

									samtools sort SRR5275235.1U.bam SRR5275235.1U.sorted
			
									samtools index SRR5275235.1U.sorted.bam
									
									samtools view -b -S -o SRR5275235.2U.bam SRR5275235.2U.sam	

									samtools sort SRR5275235.2U.bam SRR5275235.2U.sorted
			
									samtools index SRR5275235.2U.sorted.bam
									
									samtools view -b -S -o SRR5275236.bam SRR5275236.sam	

									samtools sort SRR5275236.bam SRR5275236.sorted
	
									samtools index SRR5275236.sorted.bam
									
									samtools view -b -S -o SRR5275236.1U.bam SRR5275236.1U.sam	

									samtools sort SRR5275236.1U.bam SRR5275236.1U.sorted
			
									samtools index SRR5275236.1U.sorted.bam
									
									samtools view -b -S -o SRR5275236.2U.bam SRR5275236.2U.sam	

									samtools sort SRR5275236.2U.bam SRR5275236.2U.sorted
			
									samtools index SRR5275236.2U.sorted.bam

Merge alignments from different libraries with SAMtools, sort and index merged files

F. vesca:

									samtools merge -u vesca.merge.bam SRR5275181.sorted.bam SRR5275182.sorted.bam SRR5275182.1U.sorted.bam SRR5275182.2U.sorted.bam

									samtools sort vesca.merge.bam vesca.merge.sorted
	
									samtools index vesca.merge.sorted.bam

F. corimbosa:

									samtools merge -u corimbosa.merge.bam SRR5275235.sorted.bam SRR5275235.1U.sorted.bam SRR5275235.2U.sorted.bam SRR5275236.sorted.bam SRR5275236.1U.sorted.bam SRR5275236.2U.sorted.bam

									samtools sort corimbosa.merge.bam corimbosa.merge.sorted
	
									samtools index corimbosa.merge.sorted.bam

Remove PCR duplicates with SAMtools and index new files
If coordinates of two reads in the mapping are identical, duplicated reads are considered to be a PCR artifact and are removed from alignment.

F. vesca and F. corimbosa:

									samtools rmdup -sS vesca.merge.sorted.bam vesca.dedup.bam

									samtools index vesca.dedup.bam
									
									samtools rmdup -sS corimbosa.merge.sorted.bam corimbosa.dedup.bam

									samtools index corimbosa.dedup.bam

Find indels with GATK, RealignerTargetCreator, and realign reads around indels with GATK, IndelRealigner
Set path to GATK:

									export PATH_GATK=~/Downloads/GenomeAnalysisTK-2.6-5-gba531bd/
									

Run GATK for both individuals:

									java -jar $PATH_GATK/GenomeAnalysisTK.jar -T RealignerTargetCreator -R straw-257.FvHaw4_reference.genes.fa -I vesca.dedup.bam -o vesca.realigner.intervals

									java -jar $PATH_GATK/GenomeAnalysisTK.jar -T IndelRealigner -R straw-257.FvHaw4_reference.genes.fa -I vesca.dedup.bam -targetIntervals vesca.realigner.intervals -o vesca.realigned.bam

									java -jar $PATH_GATK/GenomeAnalysisTK.jar -T RealignerTargetCreator -R straw-257.FvHaw4_reference.genes.fa -I corimbosa.dedup.bam -o corimbosa.realigner.intervals

									java -jar $PATH_GATK/GenomeAnalysisTK.jar -T IndelRealigner -R straw-257.FvHaw4_reference.genes.fa -I corimbosa.dedup.bam -targetIntervals corimbosa.realigner.intervals -o corimbosa.realigned.bam

Call genotypes with GATK, UnifiedGenotyper
GATK can call genotypes for samples of varying ploidy so it is suitable for analysis of data from polyploid samples:

									java -jar $PATH_GATK/GenomeAnalysisTK.jar -T UnifiedGenotyper -R straw-257.FvHaw4_reference.genes.fa -I vesca.realigned.bam -ploidy 2 -minIndelFrac 0.08 -minIndelCnt 2 -stand_call_conf 17.0 --genotype_likelihoods_model BOTH -maxAltAlleles 2 -deletions 0.1 -o vesca.realigned.vcf

									java -jar $PATH_GATK/GenomeAnalysisTK.jar -T UnifiedGenotyper -R straw-257.FvHaw4_reference.genes.fa -I corimbosa.realigned.bam -ploidy 4 -minIndelFrac 0.08 -minIndelCnt 2 -stand_call_conf 17.0 --genotype_likelihoods_model BOTH -maxAltAlleles 2 -deletions 0.1 -o corimbosa.realigned.vcf

Recalibrate allignment with GATK and call genotypes again

F. vesca :

									perl util_1.pl vesca.realigned.vcf vesca.realigned.edited.vcf

									perl util_genotype_filter_1.pl vesca.realigned.edited.vcf vesca.realigned.filterred.vcf

									java -jar $PATH_GATK/GenomeAnalysisTK.jar -T BaseRecalibrator -R straw-257.FvHaw4_reference.genes.fa -I vesca.realigned.bam -knownSites vesca.realigned.filterred.vcf -o vesca.recal_data.table

									java -jar $PATH_GATK/GenomeAnalysisTK.jar -T PrintReads -R straw-257.FvHaw4_reference.genes.fa -I vesca.realigned.bam -BQSR vesca.recal_data.table -o vesca.calibrated.bam

									java -jar $PATH_GATK/GenomeAnalysisTK.jar -T UnifiedGenotyper -R straw-257.FvHaw4_reference.genes.fa -I vesca.calibrated.bam -ploidy 2 -minIndelFrac 0.08 -minIndelCnt 2 -stand_call_conf 17.0 --genotype_likelihoods_model BOTH -maxAltAlleles 2 -deletions 0.1 -o vesca.calibrated.vcf

									perl util_1.pl vesca.calibrated.vcf vesca.calibrated.edited.vcf

									perl util_genotype_filter_1.pl vesca.calibrated.edited.vcf vesca.calibrated.filterred.vcf

F. corimbosa:

									perl util_1.pl corimbosa.realigned.vcf corimbosa.realigned.edited.vcf

									perl util_genotype_filter_1.pl corimbosa.realigned.edited.vcf corimbosa.realigned.filterred.vcf

									java -jar $PATH_GATK/GenomeAnalysisTK.jar -T BaseRecalibrator -R straw-257.FvHaw4_reference.genes.fa -I corimbosa.realigned.bam -knownSites corimbosa.realigned.filterred.vcf -o corimbosa.recal_data.table

									java -jar $PATH_GATK/GenomeAnalysisTK.jar -T PrintReads -R straw-257.FvHaw4_reference.genes.fa -I corimbosa.realigned.bam -BQSR corimbosa.recal_data.table -o corimbosa.calibrated.bam

									java -jar $PATH_GATK/GenomeAnalysisTK.jar -T UnifiedGenotyper -R straw-257.FvHaw4_reference.genes.fa -I corimbosa.calibrated.bam -ploidy 4 -minIndelFrac 0.08 -minIndelCnt 2 -stand_call_conf 17.0 --genotype_likelihoods_model BOTH -maxAltAlleles 2 -deletions 0.1 -o corimbosa.calibrated.vcf

									perl util_1.pl corimbosa.calibrated.vcf corimbosa.calibrated.edited.vcf

									perl util_genotype_filter_1.pl corimbosa.calibrated.edited.vcf corimbosa.calibrated.filterred.vcf

Call consensus sequence with GATK, alternate reference maker
This step will generate new organism-specific reference:.

								java -jar $PATH_GATK/GenomeAnalysisTK.jar -T FastaAlternateReferenceMaker -R straw-257.FvHaw4_reference.genes.fa -V vesca.calibrated.filterred.vcf -o vesca.new.ref.fa
								
								java -jar $PATH_GATK/GenomeAnalysisTK.jar -T FastaAlternateReferenceMaker -R straw-257.FvHaw4_reference.genes.fa -V corimbosa.calibrated.filterred.vcf -o corimbosa.new.ref.fa

Obtain information on sequencing coverage

Obtain site specific coverage informarion with SAMtools:

								samtools  mpileup -uf vesca.new.ref.fa vesca.calibrated.bam > vesca.all_sites.calibrated.bcf

								bcftools view -cg vesca.all_sites.calibrated.bcf > vesca.all_sites.calibrated.vcf
								
								samtools  mpileup -uf corimbosa.new.ref.fa corimbosa.calibrated.bam > corimbosa.all_sites.calibrated.bcf

								bcftools view -cg corimbosa.all_sites.calibrated.bcf > corimbosa.all_sites.calibrated.vcf

Final alignment and genotypes

Follow steps in reference sequence preparation section to prepare organism-specific references.

Follow steps in generating individual-specific reference section but use individual-specific reference sequences rather than original vesca assembli and collect vcf and bam files. These are final alignment and genotypes.

You should have already obtained information on sequencing coverage for the reference in use when preparing it. See end of Obtain information on sequencing coverage section.

Or run script get_variants.sh as described in the Example 1 and 2 in the header (make sure that environmental variables are set properly):

								
									./get_variants.sh ./vesca.new.ref.fa SRR5275181 single ./ SRR5275182 paired ./ vesca 2 vesca_round2 > get_variants_vesca_out 2> get_variants_vesca_err
									
									./get_variants.sh ./corimbosa.new.ref.fa SRR5275235 paired ./ SRR5275236 paired ./ corimbosa 4 corimbosa_round2 > get_variants_corimbosa_out 2> get_variants_corimbosa_err

This script encapsilates entire iteration of the reference, genotype and coverage information generating process and therefore can just be run 2 times, once with straw-257.FvHaw4_reference.genes.fa as reference file, then with individual specific reference generated in first iteration and then go directly to next section haplotype phasing using assembly.

It is handy to summarize read coverage information across all of the individuals. To do this first prepare commans file which will run util_assess_coverage_by_position.pl script for each of the all_sites.calibrated.vcf files, supplaying threshould value of the coverage. In this case it is at least 4 reads:

									echo "perl util_assess_coverage_by_position.pl vesca_round2/vesca.all_sites.calibrated.vcf 4" > commands_coverage_by_position.sh
									
									echo "perl util_assess_coverage_by_position.pl corimbosa_round2/corimbosa.all_sites.calibrated.vcf 4" >> commands_coverage_by_position.sh

Then run it to get coverage_by_position.txt which esentially

									chmod 777 commands_coverage_by_position.sh

									./commands_coverage_by_position.sh > coverage_by_position.txt

This will generate 3 column file: original vcf file name, gene name, string of As and Ns which simbolize acceptable and not acceptable read coverage for each position. Similar file will be used in the next walk-through.

Haplotype phasing using assembly

Run HapCompass
Set path to HapCompass:

								export PATH_HAPCOMPASS=~/Downloads/hapcompass_v0.8.2/
								

Run HapCompass for both individuals, one gene at a time, here giving example for genes 3 and 30:

								java -jar $PATH_HAPCOMPASS/hapcompass.jar --bam ./vesca_round2/vesca.calibrated.bam --vcf ./vesca_round2/vesca.calibrated.filterred.vcf -o ./vesca_round2/vesca_3.compas_out -p 2 -r "3" --debug

								java -jar $PATH_HAPCOMPASS/hapcompass.jar --bam ./corimbosa_round2/corimbosa.calibrated.bam --vcf ./corimbosa_round2/corimbosa.calibrated.filterred.vcf -o ./corimbosa_round2/corimbosa_30.compas_out -p 4 -r "30" --debug

These will generate a number of files with prefix vesca_3.compas_out_ and corimbosa_30.compas_out_, the vesca_3.compas_out_MWER_solution.txt and corimbosa_30.compas_out_MWER_solution.txt contain haplotype assembly blocks. Please see program manual for file format and for infomation on other output files.

The end

This is the end of this turorial, please procees to phylogeny reconstruction using haplotype and consensus sequences from several Fragaria species walk-through to learn about the next step of the project.