Phylogeny reconstruction using haplotype and consensus sequences from several Fragaria species

Overview

This protocol follows somewhat standart phylogeny inference framework and is only complicated by the fact that some individualsl in the dataset are represented by haplotypes and some are represented by consensus sequences. This protocol was employed for studying phylogeny of di- and polyploid Fragaria species using multi-locus sequence data obtained using Hyb-Seq approach (Kamneva et al. In revision) but can be applied to any group of organisms. It was developed on Mac OS but will probably work just as well in any Unix based environment.

This tutorial is still work in progress. Please feel free to contact me if you have questions, suggestions, need help or something does not work.
Custom scripts supplied with the protocol are not guarantied to work. Here is the license statement [LICENSE].

Objectives

  • Build gene trees using mix of haplotype and consensus sequences for a set of strawberry species.
  • Use set of custom scripts, MAFFT, ReadSeq, jModelTest, PhyML and CONSEL.
  • Sample data

    Please download a tar.gz phylogeny_fragaria_input.tar.gz file and unpack it. The archive contains number of files, these will be accessed through their location in this tutorial so place the resulting forder in your working directory:
  • coverage_by_position.txt - 3 column file, columns are as follows: (1) input vcf file, (2) gene, (3) string of As and Ns indicating if coverage in a given position of the gene is acceptable or not.
  • 40 files with suffix .calibrated.filterred.vcf, one for each individual.
  • 40x257 files with suffix .compas_out_MWER_solution.txt, one per individual, per gene.
  • organisms_tab12_edited - file describing included individuals.
  • straw-257.FvHaw4_reference.genes.fa - F. vesca reference file initially used in assembly.
  • gene_N.fa files where N is from 1 to 257, those are individial specific consensus sequencies aligned to each other.
  • gene_N_al.fa files where N is from 1 to 257, those are unaligned individial specific consensus sequencies.
  • Programs

    Make sure that you have all of those programs installed on your computer, working and their location is added to the path.
  • MAFFT
  • ReadSeq
  • jModelTest
  • PhyML
  • CONSEL
  • ape R package
  • Also please download a tar.gz phylogeny_fragaria_scripts.tar.gz file and unpack it. The archive contains following custom scripts, these will be accessed through their location in this tutorial, but they can be copied into one of the path directoies, eliminating need for specifying their location:
  • R_alignments.r
  • R_functions.r
  • R_get_fragments.r
  • R_insert_Ns.r
  • R_jmodeltest.r
  • R_phyml.r
  • R_site_lnl.r
  • R_trim_reformat.r
  • rename.pl
  • util_get_blocks.pl
  • util_get_haplotypes_by_fragments.pl
  • util_split_vcf.pl
  • get_trees.sh - this is a script that implements entire pipeline (starting from Get fragments of continous haplotype assembly across individual that will be represented by haplotypes in phylogenies step) outlined in this tutorial so it is convenient to run it once you understand the whole process. Edit it to set up environmental variables correctly.
  • Each of the scripts contains option list and example usage in its header.

    Main steps of the protocol

  • Get blocks of continous haplotype assembly per individual
  • Get fragments of continous haplotype assembly across individual that will be represented by haplotypes in phylogenies
  • Prepare and align sequences
  • Determine evolutionary model of best fit
  • Build gene trees
  • Test maximum likelihood gene trees against random trees
  • Collect and plot gene trees
  • Get blocks of continous haplotype assembly per individual

  • Split vcf files into gene specific vcf files (individual vcf files will be written into ./phylogeny_fragaria_gene_vcf folder):

    mkdir ./phylogeny_fragaria_genes

    awk 'NR>1 {print $4"_"$5"_index"$2".calibrated.filterred.vcf"}' phylogeny_fragaria_input/organisms_tab12_edited | \
    while read i;\
    do perl ./phylogeny_fragaria_scripts/util_split_vcf.pl ./phylogeny_fragaria_input/${i} ./phylogeny_fragaria_genes; \
    done;

    Generate haplotype block markup for each gene for each individual (_blocks file will be written into ./phylogeny_fragaria_gene_vcf folder for each gene for each individual):

    awk 'NR>1 {print $4"_"$5"_index"$2}' phylogeny_fragaria_input/organisms_tab12_edited | \
    while read i;\
    do for g in {1..257}; \
    do perl ./phylogeny_fragaria_scripts/util_get_blocks.pl \
    ./phylogeny_fragaria_input/${i}_${g}.compas_out_MWER_solution.txt \
    ./phylogeny_fragaria_genes/gene${g}_$i.calibrated.filterred.vcf > \
    ./phylogeny_fragaria_genes/${i}_${g}_blocks; \
    done; done;

  • Get fragments of continous haplotype assembly across individual that will be represented by haplotypes in phylogenies

  • I this tutorial we will execute an example where individuals from diploid species are included as consensus sequences, two virginiana individuals are represented by haplotypes and other polyploids are excluded.
    In this step we will find regions of contigous haplotype assembly between two F. virginiana individuals (see header of the script for details on how to execute in for a different individuals and to understand it better):
    Rscript phylogeny_fragaria_scripts/R_get_fragments.r \
    39,40 \
    virginiana \
    virginiana \
    500 \
    ./phylogeny_fragaria_input/ \
    ./phylogeny_fragaria_input/coverage_by_position.txt \
    ./phylogeny_fragaria_input/organisms_tab12_edited \
    ./phylogeny_fragaria_input/straw-257.FvHaw4_reference.genes.fa \
    ./phylogeny_fragaria_genes \
    ./phylogeny_fragaria_scripts/R_functions.r \
    ./phylogeny_fragaria_scripts \
    ./phylogeny_fragaria_input/ \
    ./phylogeny_fragaria_input/ \
    ./phylogeny_fragaria_input/

    The script will write different output files, including blocks of contigous haplotype assembly and visualization of those in plots_haplotypes_fragments_virginiana.pdf file.
  • Now, we can execude generated command files to get haplotypes by fragments for both individuals:

    chmod 777 virginiana/commands_get_haplotypes_by_fragments_virginiana
    chmod 777 virginiana/commands_allign_groups_make_files_virginiana

    nohup virginiana/commands_get_haplotypes_by_fragments_virginiana > virginiana/commands_get_haplotypes_by_fragments_virginiana_out 2>virginiana/commands_get_haplotypes_by_fragments_virginiana_err
    nohup virginiana/commands_allign_groups_make_files_virginiana > virginiana/commands_allign_groups_make_files_virginiana_out 2>virginiana/commands_get_haplotypes_by_fragments_virginiana_err

    This will generate files with haplotype sequences for each gene and fragment. They are written in fasta format in geneN_fragmentM_haplotypes_genomes_virginiana.fa files in the ./virginiana/ directory.
  • Prepare and align sequences

  • First lets prepare sequences by inserting Ns into consensus positions with low coverage (please see header of the R script for meaning of parameters):
    Rscript phylogeny_fragaria_scripts/R_insert_Ns.r \
    ./phylogeny_fragaria_input/coverage_by_position.txt \
    ./phylogeny_fragaria_input/ \
    ./phylogeny_fragaria_genes
  • Now let's prepare files for alignments, herelong string of integers represents individuals to be included as consensus sequence (see header of the script for details):
    Rscript phylogeny_fragaria_scripts/R_alignments.r \
    39,40 \
    virginiana \
    virginiana \
    500 \
    1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29 \
    ./phylogeny_fragaria_genes \
    ./phylogeny_fragaria_input/organisms_tab12_edited
    This will generate geneN_fragmentM_all_genomes_virginiana.fa in ./virginiana/ directory, the file contains consensus sequences for diploids (individuals 1 to 29) and haplotype sequences for virginiana individuals (39 and 40).
    Execute created command file:
    chmod 777 virginiana/commands_allign_virginiana
    nohup virginiana/commands_allign_virginiana > commands_allign_virginiana_out 2>commands_allign_virginiana_err
    This will produce alignments written to files geneN_fragmentM_all_genomes_virginiana_mafft.fa in ./virginiana/ directory.
  • After previous step is finished, we have to trim flanks of alignments to remove parts of consensus sequences not covered by haplotypes and remove gapped columns. We will also rename seqences (some names are rather long) and convert alignments into phylip format. The last argument ensures that only first 10 genes are processed (see the script for other arguments):
    Rscript phylogeny_fragaria_scripts/R_trim_reformat.r \
    39,40 \
    virginiana \
    virginiana \
    500 \
    ./phylogeny_fragaria_input/organisms_tab12_edited \
    ./phylogeny_fragaria_scripts \
    10
    Execute generated command file

    export PATH_READSEQ=~/Downloads/

    chmod 777 ./virginiana/commands_rename_convert_virginiana
    ./virginiana/commands_rename_convert_virginiana

    This step produces final alignments.
  • Determine evolutionary model of best fit

  • Generate commands to run jModeltest:
    Rscript phylogeny_fragaria_scripts/R_jmodeltest.r \
    virginiana \
    virginiana
    Execute obtained set of commands:
  • export PATH_JMODELTEST=~/Downloads/jmodeltest-2.1.10

    chmod 777 ./virginiana/commands_jmodeltest
    nohup ./virginiana/commands_jmodeltest > ./virginiana/commands_jmodeltest_out 2>./virginiana/commands_jmodeltest_err

    This will produce virginiana/geneN_fragmentM_all_genomes_virginiana_mafft_filterred_renamed_jmodeltest files with jModelTest results.

    Build gene trees

  • Generate commands to run PhyML with 100 bootstraps and best model of evolution estimated with jModelTest:
    Rscript phylogeny_fragaria_scripts/R_phyml.r \
    virginiana \
    virginiana
    Execute obtained set of commands, make sure that PhyML command line executable is in the one of teh path directories and accessible through phyml command from anywhere:
  • chmod 777 ./virginiana/commands_phyml
    nohup ./virginiana/commands_phyml > commands_phyml_out 2> commands_phyml_err

    Test maximum likelihood gene trees against random trees

  • Dedicated directory for this analysis will be created in output directory (virginiana in this case), this directory is named consel. The script will generate 100 random trees (virginiana/consel/XXX_randomN.ph_phyml_tree.txt) and command file (virginiana/commands_site_lnl) to run site-specific likelihood calculation for each random tree using the alignment for each fragment:
    Rscript phylogeny_fragaria_scripts/R_site_lnl.r \
    virginiana \
    virginiana \
    12345
    Execute obtained set of commands:

    chmod 777 ./virginiana/commands_site_lnl
    nohup ./virginiana/commands_site_lnl > ./virginiana/commands_site_lnl_out 2> ./virginiana/commands_site_lnl_err

  • Compare maxmum likelihood tree to random trees
    Rscript phylogeny_fragaria_scripts/R_consel.r \
    virginiana \
    virginiana
    This will create file virginiana/commands_consel with commands to greage all site-specific likelihood files (across ML gene tree and random trees) and then run consel (makermt, consel and catpv).
    Execute obtained set of commands:

    export PATH_CONSEL=~/Downloads/consel/src

    chmod 777 ./virginiana/commands_consel
    nohup ./virginiana/commands_consel > ./virginiana/commands_consel_out 2> ./virginiana/commands_consel_err
    This will generate XXX_test_random.ph_phyml_lk.txt, number of intermediate consel files with prefix XXX_test_random.ph_phyml_lk and final XXX_consel_pv file which contains ranks for the input trees (ML tree is the first one) and p-values for a number of test comparing the trees to each other for each fragment. In the study we used SH-test to say if ML tree really is better than random trees and therefore alignment does contain some phylogenetic signal.
  • Collect and plot gene trees

  • Run an R script and get results!
    Rscript phylogeny_fragaria_scripts/R_plot.r \
    39,40 \
    virginiana \
    virginiana \
    Drymocallis_glandulous_index40 \
    ./phylogeny_fragaria_input/organisms_tab12_edited
  • This will create virginiana/trees_virginiana.pdf (visualizations of the obtained gene trees) and virginiana/all_trees_virginiana.txt files (tab delimited table containing newick format trees and information about them in fields listed in header)

    The end

    This is the end of this turorial, please procees to finding and testing hybridization events in Fragaria species history walk-through to learn about the next step of the project.