Let's prove evolution!

Pointless Posts, Raves n Rants, Obscure Opinions

Moderator: Moderators

Post Reply
User avatar
Difflugia
Guru
Posts: 2298
Joined: Wed Jun 12, 2019 10:25 am
Location: Michigan
Has thanked: 1848 times
Been thanked: 1369 times

Let's prove evolution!

Post #1

Post by Difflugia »

I have contended several times in the past and again very recently that the pattern in genetic relationships between extant organisms is sufficient to demonstrate evolution in the Darwinian sense ("descent with modification") to the exclusion of any other reasonable explanation for biodiversity. By "reasonable explanation," I mean any that can even in principle be distinguished from evolution, so the only unreasonable ones would be something like claiming that the gods created life in exactly the pattern that we would expect from evolution with no justification beyond the gods being able to do whatever they want ("Last Thursdayism").

That's not to say that other lines of evidence (fossils, for example) are useless, but only that with modern DNA sequencing and computers, we could establish the fact of evolution even without them.

Since this is more of a tutorial than debate as such, I've put it into Random Ramblings. If anyone wants to debate anything I say, I'm game, but please start a debate thread. I'm happy to answer questions or clarify things here, but if it's "just asking questions", please start a debate thread. I'd also be happy to engage in any non-debate discussion of implications, either in this thread, another Random Ramblings thread, or General Chat.

The main premise is that with a relatively simple set of rules, the difference between two genetic sequences (or an analog, like sequences of amino acids in a protein) can be given a score that measures how different they are. With a list of gene sequences and the set of difference scores (or "distances") between them, we can arrange the sequences into a tree. If all of our hypotheses are correct, then the tree will match the history of descent. Each "leaf" (terminal node) in the tree will represent one of the sequences in the list and each "branch" (nonterminal node) will represent a common ancestor.

The goal here will be to show that a moderately computer- and science-literate person can show that each hypothesis is true before relying on it as a premise for testing another hypothesis. Most creationist objections about the validity of any such experiment revolve around claims that the various premises are "evolutionary assumptions," but they ignore all of the other experiments showing the validity of the premises. I expect this tutorial to allow any sufficiently motivated person to prove any of the underlying hypotheses to any arbitrary standard. While that still may not convince another creationist, the goal is to offer anyone the ability to conduct enough experiments to at least convince themselves.

Since this will require a number of otherwise independent discussions before we even get to individual hypotheses (installing the software, obtaining data, interpeting results), I'll be doing this as a series of posts.

For anyone just interested in seeing the tree without going to the trouble of collecting and formatting data, a well-done, interactive version is called the OneZoom Tree of Life. One can start at the beginning and manually navigate with the mouse. The wheel zooms in and out and dragging changes your location within the tree. It's easy to get lost and I often find myself zooming back out to start over. There are also links to specific nodes of common interest, like human beings, birds, and flowering plants. As you zoom in our out yourself, the browser link will update so you could paste it into a conversation here, if you liked. As an example, a rather insignificant, one-celled eukaryote that I'm nonetheless rather fond of can be found here.
My preferred pronouns are he, him, and his.

User avatar
Difflugia
Guru
Posts: 2298
Joined: Wed Jun 12, 2019 10:25 am
Location: Michigan
Has thanked: 1848 times
Been thanked: 1369 times

Re: Let's prove evolution!

Post #2

Post by Difflugia »

First, let's get the software installed and make sure it works. There are two main software packages that I use. Both are easy to get and don't actually require installation beyond unzipping into a folder, but both are command line tools and neither is particularly user-friendly. In this post, we'll download and test the first of the two, MUSCLE.

MUSCLE stands for MUltiple Sequence Comparison by Log-Expectation. This program takes a list of sequences and aligns them to be used as input for tree generation. We'll discuss what this means later, but the important part for now is that this is the first step in a two-step process.

A slight hiccup is that since the last time I checked, a new version (v5) of MUSCLE has been released. There's a wrinkle in that the newest version no longer generates output that's directly compatible with PHYLIP (the program discussed in the next post). The updates to the algorithm are mainly for larger datasets unlikely to have much effect on the datasets we'll be using, so rather than insert another tool into the process, we'll use the previous version 3.8.31 version for this exercise. It can be downloaded from this page. Each download is just an executable for a particular architecture. I use Linux, but will be assuming that most people will use Windows.

Though there are countless ways to set things up, probably the easiest setup will be to create a desktop folder for your experiments. I'll be assuming that it's called Phylogeny, because that's what I've named mine. Download muscle3.8.31_i86win32.exe into that folder. Since it's easier to type for both of us, I suggest renaming it to muscle.exe and will be assuming that you have done so. As long as you're willing to do your experiments in the same folder as the executable, then the installation is complete.

Now let's try running it. Start a command prompt. Choosing "Run" from the Start Menu and entering cmd.exe is one way. Change to the Phylogeny directory that we just created. I'm going to leave this vague because the process is slightly different for each version of Windows. If anyone has trouble with a specific Windows version, either comment or PM me and I'll try to help.

Now you're at a command prompt in the Phylogeny folder and should see something like this:

Code: Select all

Microsoft Windows 2000 [Version 5.02.3790]
(C) Copyright 1985-2000 Microsoft Corp.

C:\Users\Difflugia\Desktop\Phylogeny>
Type muscle at the prompt and you should see version and usage information followed by the prompt:

Code: Select all

MUSCLE v3.8.31 by Robert C. Edgar

http://www.drive5.com/muscle
This software is donated to the public domain.
Please cite: Edgar, R.C. Nucleic Acids Res 32(5), 1792-97.


Basic usage

    muscle -in <inputfile> -out <outputfile>

Common options (for a complete list please see the User Guide):

    -in <inputfile>    Input file in FASTA format (default stdin)
    -out <outputfile>  Output alignment in FASTA format (default stdout)
    -diags             Find diagonals (faster for similar sequences)
    -maxiters <n>      Maximum number of iterations (integer, default 16)
    -maxhours <h>      Maximum time to iterate in hours (default no limit)
    -html              Write output in HTML format (default FASTA)
    -msf               Write output in GCG MSF format (default FASTA)
    -clw               Write output in CLUSTALW format (default FASTA)
    -clwstrict         As -clw, with 'CLUSTAL W (1.81)' header
    -log[a] <logfile>  Log to file (append if -loga, overwrite if -log)
    -quiet             Do not write progress messages to stderr
    -version           Display version information and exit

Without refinement (very fast, avg accuracy similar to T-Coffee): -maxiters 2
Fastest possible (amino acids): -maxiters 1 -diags -sv -distance1 kbit20_3
Fastest possible (nucleotides): -maxiters 1 -diags

C:\Users\Difflugia\Desktop\Phylogeny>
Now we'll run it on data. Without worrying about details for the moment, open Notepad, then paste the following into a new document:

Code: Select all

>Baboon
MLINRWLFSTNHKDIGTLYLLFGAWAGVTGMALSLLIRAELGQPGSLLGNDHIYNVIVTAHAFVMIFFMV
MPIMIGGFGNWLVPLMIGAPDMAFPRLNNMSFWLLPPSFLLLMASIAVEAGAGTGWTVYPPLSGNFSHPG
ASVDLVIFSLHLAGISSILGAINFITTIINMKPPAMSQYQTPLFVWSILITAVLLLLSLPVLAAGITMLL
TDRNLNTTFFDPVGGGDPILYQHLFWFFGHPEVYILILPGFGMISHIVTHYSGKKEPFGYMGMVWAMMSI
GFLGFIVWAHHMFTVGMDVDTRAYFTSATMIIAIPTGVKVFSWLATLHGGNIKWSPAMLWALGFIFLFTM
GGLTGIILANSSLDIVLHDTYYVVAHFHYVLSMGAVFAIMGGFIHWFPLFSGYTLDQTCAKAHFIITFVG
VNLTFFPQHFLGLSGMPRRYSDYPDAYTTWNTLSSMGSFISLTATILMIYMIWEAFASKRKVLLTEHPST
SLEWLNGCPPPHHTFEEPAYIKLNEKGGSRTP
MAHPVQLGLQDATSPVMEELITFHDQALMAMFLISFLILYALSSTLTTKLTNTNITDAQEMETIWTILPA
VILILIALPSLRILYMTDEINNPSFTIKSIGHQWYWTYEYTDYGGLIFNSYMLPPLFLNPGDLRLLEVDN
RVVLPIEAPVRMMITSQDVLHSWTIPTLGLKTDAVPGRLNQTVFTATRPGVYYGQCSEICGANHSFMPIV
AELIPLKIFEMGPVFTL
MTHQLHAYHMVKPSPWPLTGALSAFLLTSGLIMWFHFYSTALLTLGLLTNALTMYQWWRDIIRESTYQGH
HTTPVQKSLRYGMTLFIISEVFFFAGFFWAFYHSSLAPTPRLGCHWPPTGITPLNPLEVPLLNTSVLLAS
GVTITWAHHSLMNGNRKQTIQALLITILLGTYFTLLQISEYFEAPFTISDGIYGSTFFVATGFHGLHVII
GSTFLLICLIRQLFYHFTPSHHFGFEAAAWYWHFVDVIWLFLYISIYWWGS

>Chimpanzee
MFTDRWLFSTNHKDIGTLYLLFGAWAGVLGTALSLLIRAELGQPGNLLGNDHIYNVIVTAHAFVMIFFMV
MPIMIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSLLLLLASAMVEAGAGTGWTVYPPLAGNYSHPG
ASVDLTIFSLHLAGISSILGAINFITTIINMKPPAMTQYQTPLFVWSVLITAVLLLLSLPVLAAGITMLL
TDRNLNTTFFDPAGGGDPILYQHLFWFFGHPEVYILILPGFGMISHIVTYYSGKKEPFGYMGMVWAMMSI
GFLGFIVWAHHMFTVGMDVDTRAYFTSATMIIAIPTGVKVFSWLATLHGSNMKWSAAVLWALGFIFLFTV
GGLTGIVLANSSLDIVLHDTYYVVAHFHYVLSMGAVFAIMGGFIHWFPLFSGYTLDQTYAKIQFAIMFIG
VNLTFFPQHFLGLSGMPRRYSDYPDAYTTWNVLSSVGSFISLTAVMLMIFMIWEAFASKRKVLMVEEPSA
NLEWLYGCPPPYHTFEEPVYMKS
MAHAAQVGLQDATSPIMEELIIFHDHALMIIFLICFLVLYALFLTLTTKLTNTSISDAQEMETVWTILPA
IILVLIALPSLRILYMTDEVNDPSFTIKSIGHQWYWTYEYTDYGGLIFNSYMLPPLFLEPGDLRLLDVDN
RVVLPVEAPVRMMITSQDVLHSWAVPTLGLKTDAIPGRLNQTTFTATRPGVYYGQCSEICGANHSFMPIV
LELIPLKIFEMGPVFTL
MTHQSHAYHMVKPSPWPLTGALSALLMTSGLAMWFHFYSTTLLTLGLLTNTLTMYQWWRDVMREGTYQGH
HTPPVQKGLRYGMILFITSEVFFFAGFFWAFYHSSLAPTPQLGGHWPPTGITPLNPLEVPLLNTSVLLAS
GVSITWAHHSLMENNRNQMIQALLITILLGLYFTLLQASEYFESPFTISDGIYGSTFFVATGFHGLHVII
GSTFLTICLIRQLMFHFTSKHHFGFQAAAWYWHFVDVVWLFLYVSIYWWGS

>Bonobo
MFTDRWLFSTNHKDIGTLYLLFGTWAGVLGTALSLLIRAELGQPGNLLGNDHIYNVIVTAHAFVMIFFMV
MPIMIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSLLLLLASAMVEAGAGTGWTVYPPLAGNYSHPG
ASVDLTIFSLHLAGVSSILGAINFITTIINMKPPAMTQYQTPLFVWSVLITAVLLLLSLPVLAAGITMLL
TDRNLNTTFFDPAGGGDPILYQHLFWFFGHPEVYILILPGFGMISHIVTYYSGKKEPFGYMGMVWAMMSI
GFLGFIVWAHHMFTVGMDVDTRAYFTSATMIIAIPTGVKVFSWLATLHGSNMKWSAAVLWALGFIFLFTV
GGLTGIVLANSSLDIVLHDTYYVVAHFHYVLSMGAVFAIMGGFIHWFPLFSGYTLDQTYAKIQFAIMFIG
VNLTFFPQHFLGLSGMPRRYSDYPDAYTTWNVLSSVGSFISLTAVMLMIFMIWEAFASKRKVLMVEEPSA
NLEWLYGCPPPYHTFEEPVYMKS
MAHAAQVGLQDATSPIMEELIIFHDHALMIIFLICFLVLYALFLTLTTKLTNTSISDAQEMETVWTILPA
IILVLIALPSLRILYMTDEVNDPSFTIKSIGHQWYWTYEYTDYGGLIFNSYMLPPLFLEPGDLRLLDVDN
RVVLPVEAPVRMMITSQDVLHSWAVPTLGLKTDAIPGRLNQTTFTATRPGVYYGQCSEICGANHSFMPIV
LELIPLKIFEMGPVFTL
MAHQSHAYHMVKPSPWPLTGALSALLMTSGLAMWFHFYSTTLLTLGLLTNTLTMYQWWRDVMRESTYQGH
HTPPVQKGLRYGMILFITSEVFFFAGFFWAFYHSSLAPTPQLGGHWPPTGITPLNPLEVPLLNTSVLLAS
GVSITWAHHSLMENNRNQMIQALLITILLGLYFTLLQASEYFESPFTISDGIYGSTFFVATGFHGLHVII
GSTFLTICLIRQLMFHFTSKHHFGFEAAAWYWHFVDVVWLFLYVSIYWWGS

>Gorilla
MFTDRWLFSTNHKDIGTLYLLFGAWAGVLGTALSLLIRAELGQPGNLLGNDHIYNVIVTAHAFVMIFFMV
MPIMIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSFLLLLASAMVEAGAGTGWTVYPPLAGNYSHPG
ASVDLTIFSLHLAGISSILGAINFITTIINMKPPAMTQYQTPLFVWSVLITAVLLLLSLPVLAAGITMLL
TDRNLNTTFFDPAGGGDPILYQHLFWFFGHPEVYILILPGFGMISHIVTYYSGKKEPFGYMGMVWAMMSI
GFLGFIVWAHHMFTVGMDVDTRAYFTSATMIIAIPTGVKVFSWLATLHGSNTKWSAAMLWALGFIFLFTV
GGLTGIVLANSSLDIVLHDTYYVVAHFHYVLSMGAVFAIMGGFIHWFPLFSGYTLDQTYAKIHFAIMFIG
VNLTFFPQHFLGLSGMPRRYSDYPDAYTTWNILSSVGSFISLTAVMLMIFMIWEAFASKRKVLMIEEPST
NLEWLYGCPPPYHTFEEPVYMK
MAHAAQVGLQDATSPIMEELIIFHDHALMIIFLICFLVLYALFLTLTTKLTNTNISDAQEMETIWTILPA
IILVLIALPSLRILYMTDEINDPSFTIKSIGHQWYWTYEYTDYGGLIFNSYMLPPLFLEPGDLRLLDVDN
RVVLPVEAPVRMMITSQDVLHSWAVPTLGLKTDAIPGRLNQTTFTATRPGVYYGQCSEICGANHSFMPIV
LELIPLKIFEMGPVFAL
MIHQSHAYHMVKPSPWPLTGALSALLMTSGLAMWFHFHSTTLLMLGLLTNMLTMYQWWRDVMRESTYQGH
HTLPVQKGLRYGMILFITSEVFFFAGFFWAFYHSSLAPTPQLGAHWPPTGITPLNPLEVPLLNTSVLLAS
GVSITWAHHSLMENNRNQMIQALLITILLGLYFTLLQASEYFEAPFTISDGIYGSTFFVATGFHGLHVII
GSTFLTICLIRQLMFHFTSKHHFGFEAAAWYWHFVDVVWLFLYVSIYWWGS

>Neandertal
MFADRWLFSTNHKDIGTLYLLFGAWAGVLGTALSLLIRAELGQPGNLLGNDHIYNVIVTAHAFVMIFFMV
MPIMIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSLLLLLASAMVEAGAGTGWTVYPPLAGNYSHPG
ASVDLTIFSLHLAGVSSILGAINFITTIINMKPPAMTQYQTPLFVWSVLITAVLLLLSLPVLAAGITMLL
TDRNLNTTFFDPAGGGDPILYQHLFWFFGHPEVYILILPGFGMISHIVTYYSGKKEPFGYMGMVWAMMSI
GFLGFIVWAHHMFTVGMDVDTRAYFTSATMIIAIPTGVKVFSWLATLHGSNMKWSAAVLWALGFIFLFTV
GGLTGIVLANSSLDIVLHDTYYVVAHFHYVLSMGAVFAIMGGFIHWFPLFSGYTLDQTYAKIHFTIMFIG
VNLTFFPQHFLGLSGMPRRYSDYPDAYTTWNILSSVGSFISLTAVMLMIFMIWEAFASKRKVLMVEEPSM
NLEWLYGCPPPYHTFEEPVYMKS
MAHAAQVGLQDATSPIMEELIIFHDHALMIIFLICFLVLYALFLTLTTKLTNTSISDAQEMETVWTILPA
IILVLIALPSLRILYMTDEVNDPSFTIKSIGHQWYWTYEYTDYGGLIFNSYMLPPLFLEPGDLRLLDVDN
RVVLPVEAPIRMMITSQDVLHSWAVPTLGLKTDAIPGRLNQTTFTATRPGVYYGQCSEICGANHSFMPIV
LELIPLKIFEMGPVFTL
MTHQSHAYHMVKPSPWPLTGALSALLMTSGLAMWFHFHSTTLLMLGLLTNTLTMYQWWRDVTRESTYQGH
HTPPVQKGLRYGMVLFITSEVFFFAGFFWAFYHSSLAPTPQLGGHWPPTGITPLNPLEVPLLNTSVLLAS
GVSITWAHHSLMENNRNQMIQALLITILLGLYFTLLQASEYFESPFTISDGIYGSTFFVATGFHGLHVII
GSTFLTICFIRQLMFHFTSKHHFGFEAAAWYWHFVDVVWLFLYVSIYWWGS

>Human
MFADRWLFSTNHKDIGTLYLLFGAWAGVLGTALSLLIRAELGQPGNLLGNDHIYNVIVTAHAFVMIFFMV
MPIMIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSLLLLLASAMVEAGAGTGWTVYPPLAGNYSHPG
ASVDLTIFSLHLAGVSSILGAINFITTIINMKPPAMTQYQTPLFVWSVLITAVLLLLSLPVLAAGITMLL
TDRNLNTTFFDPAGGGDPILYQHLFWFFGHPEVYILILPGFGMISHIVTYYSGKKEPFGYMGMVWAMMSI
GFLGFIVWAHHMFTVGMDVDTRAYFTSATMIIAIPTGVKVFSWLATLHGSNMKWSAAVLWALGFIFLFTV
GGLTGIVLANSSLDIVLHDTYYVVAHFHYVLSMGAVFAIMGGFIHWFPLFSGYTLDQTYAKIHFTIMFIG
VNLTFFPQHFLGLSGMPRRYSDYPDAYTTWNILSSVGSFISLTAVMLMIFMIWEAFASKRKVLMVEEPSM
NLEWLYGCPPPYHTFEEPVYMKS
MAHAAQVGLQDATSPIMEELITFHDHALMIIFLISFLVLYALFLTLTTKLTNTNISDAQEMETVWTILPA
IILVLIALPSLRILYMTDEVNDPSLTIKSIGHQWYWTYEYTDYGGLIFNSYMLPPLFLEPGDLRLLDVDN
RVVLPIEAPIRMMITSQDVLHSWAVPTLGLKTDAIPGRLNQTTFTATRPGVYYGQCSEICGANHSFMPIV
LELIPLKIFEMGPVFTL
MTHQSHAYHMVKPSPWPLTGALSALLMTSGLAMWFHFHSMTLLMLGLLTNTLTMYQWWRDVTRESTYQGH
HTPPVQKGLRYGMILFITSEVFFFAGFFWAFYHSSLAPTPQLGGHWPPTGITPLNPLEVPLLNTSVLLAS
GVSITWAHHSLMENNRNQMIQALLITILLGLYFTLLQASEYFESPFTISDGIYGSTFFVATGFHGLHVII
GSTFLTICFIRQLMFHFTSKHHFGFEAAAWYWHFVDVVWLFLYVSIYWWGS

>Heidelberg Man
MFADRWLFSTNHKDIGTLYLLFGAWAGVLGTALSLLIRAELGQPGNLLGNDHIYNVIVTAHAFVMIFFMV
MPIMIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSLLLLLASAMVEAGAGTGWTVYPPLAGNYSHPG
ASVDLTIFSLHLAXISSILGAINFITTIINMKPPAMTQYQTPLFVWSVLITAVLLLLSLPVLAAGITMLL
TDRNLNTTFFDPAGGGDPILYQHLFWFFGHPEVYILILPGFGMISHIVTYYSGKKEPFGYMGMVWAMMSI
GFLGFIVWAHHMFTVGMDVDTRAYFTSATMIIAIPTGVKVFSWLATLHGSNMKWSAAVLWALGFIFLFTV
GGLTGIVLANSSLDIVLHDTYYVVAHFHYVLSMGAVFAIMGGFIHWFPLFSGYTLDQTYAKIHFAIMFIG
VNLTFFPQHFLGLSGMPRRYSDYPDAYTTWNILSSVGSFISLTAVMLMIFMIWEAFASKRKVLMVEEPSM
NLEWLYGCPPPYHTFEEPVYMKS
MAHAAQVGLQDATSPIMEELIIFHDHALMIIFLICFLVLYALFLTLTTKLTNTNISDAQEMETVWTILPA
IILILIALPSLRILYMTDEVNDPSFTIKSIGHQWYWTYEYTDYGGLIFNSYMLPPLFLEPGDLRLLDVDN
RVVLPVEAPIRMMITSQDVLHSWAVPTLGLKTDAIPGRLNQTTFTATRPGVYYGQCSEICGANHSFMPIV
LELIPLKIFEMGPVFTL
MTHQSHAYHMVKPSPWPLTGALSALLMTSGLAMWFHFHSTXXXXXXXXTNTLTMYQWWRDVTRESTYQGH
HTPPVQKGLRYGMILFITSEVFFFAGFFWAFYHSSLAPTPXXXXXWPPTGITPLNPLEVPLLNTSVLLAS
GVSITWAHHSLMENNRNQMIQALLITILLGLYFTLLQASEYFESPFTISDGIYGSTFFVATGFHGLHVII
GSTFLTICFIRQLMFHFTSKHHFGFEAAAWYWHFVDVVWLFLYVSIYWWGS

>Sumatran Orangutan
MFADRWLFSTNHKDIGTLYLLFGAWAGVLGTALSLLIRAELGQPGNLLGNDHIYNVIVTAHAFVMIFFMV
MPIMIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSFLLLLASATVEAGAGTGWTVYPPLAGNYSHPG
ASVDLTIFSLHLAGISSILGAINFITTIINMKPPAMSQYQTPLFVWSVLITAVLLLLSLPVLAAGITMLL
TDRNLNTTFFDPAGGGDPILYQHLFWFFGHPEVYILILPGFGMISHIVTHYSGKEEPFGYMGMVWAMVSI
GFLGFIVWAHHMFTVGMDVDTRAYFTSATMIIAIPTGVKVFSWLATLHGSNTKWSAAILWALGFIFLFTV
GGLTGIVLANSSLDIVLHDTYYVVAHFHYVLSMGAVFAIMGGFIHWFPLFSGYTLDQTYAKIHFITMFIG
VNLTFFPQHFLGLSGMPRRYSDYPDAYTTWNILSSAGSFISLTAVMLMIFMIWEAFASKRKVPMVEQPST
SLEWLYGCPPPYHTFEEPVYMKPEQK
MAHAAQVGLQDATSPIMEELVIFHDHALMIIFLICFLVLYALFLTLTTKLTNTSISDAQEMETIWTILPA
IILILIALPSLRILYLTDEINDPSFTIKSIGHQWYWTYEYTDYGGLIFNSYMLPPLFLEPGDLRLLDVDN
RVVLPVEAPVRMMITSQDVLHSWTVPSLGLKTDAIPGRLNQTTFTATRPGVYYGQCSEICGANHSFMPIV
LELIPLKIFEMGPVFTL
MAHQSHAYHMVKPSPWPLTGALSALLTTSGLTMWFHFHSTTLLLTGLLTNALTMYQWWRDVVRESTYQGH
HTLPVQKGLRYGMILFITSEVFFFAGFFWAFYHSSLAPTPQLGGHWPPTGIIPLNPLEVPLLNTSVLLAS
GVSITWAHHSLMENNRTQMIQALLITILLGIYFTLLQASEYIEAPFTISDGIYGSTFFMATGFHGLHVII
GSTFLTVCLARQLLFHFTSKHHFGFEAAAWYWHFVDVVWLFLYVSIYWWGS

>Bornean Orangutan
MFADRWLFSTNHKDIGTLYLLFGAWAGVLGTALSLLIRAELGQPGNLLGNDHIYNVIVTAHAFVMIFFMV
MPMMIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLLPSFLLLLASATVEAGAGTGWTVYPPLAGNYSHPG
ASVDLTIFSLHLAGISSILGAINFITTIINMKPPAMSQYQTPLFVWSILITAVLLLLSLPVLAAGITMLL
TDRNLNTTFFDPAGGGDPILYQHLFWFFGHPEVYILILPGFGMISHIVTHYSGKKEPFGYMGMVWAMVSI
GFLGFIVWAHHMFTVGMDVDTRAYFTSATMIIAIPTGVKVFSWLATLHGSNTKWSAAILWALGFIFLFTV
GGLTGIVLANSSLDIVLHDTYYVVAHFHYVLSMGAVFAIMGGFIHWFPLFSGYTLNQTYAKIHFITMFVG
VNLTFFPQHFLGLSGMPRRYSDYPDAYTTWNILSSAGSFISLTAVMLMIFMIWEAFASKRKVPMIEQPST
SLEWLYGCPPPYHTFEEPVYMKP
MAHAAQVGLQDATSPIMEELVIFHDHALMIIFLICFLVLYALFLTLTTKLTNTNISDAQEMETIWTILPA
IILILIALPSLRILYLTDEINDPSFTIKSIGHQWYWTYEYTDYGGLIFNSYMLPPLFLEPGDLRLLDVDN
RVVLPVEAPVRMMITSQDVLHSWTVPSLGLKTDAIPGRLNQTTFTATRPGVYYGQCSEICGANHSFMPIV
LELIPLKIFEMGPVFAL
MVHQSHAYHMLKPSPWPLTGALSALLMTSGLAMWFHFHSTTLLLTGMLTNALTMYQWWRDVVRESTYQGH
HTLPVQKGLRYGMILFITSEVFFFAGFFWAFYHSSLAPTPQLGGHWPPTGITPLNPLEVPLLNTAVLLAS
GVSITWAHHSLMENNRTQMIQALLITILLGIYFTLLQASEYIEAPFTISDGIYGSTFFMTTGFHGLHVII
GSTFLTVCLSCQLLFHFTSKHHFGFEAAAWYWHFVDVVWLFLYVSIYWWGS
The organisms are hominids and the protein sequences are from the mitochondrial enzyme Cytochrome C Oxidase, abbreviated COX. Select "Save as" and call it hominid_cox.txt. If you laugh, you get detention.

Now at your command prompt, run the following command:

Code: Select all

muscle -in hominid_cox.txt -phyiout infile
We're confusingly naming the output "infile" because that's what the program in the next step, PHYLIP, is expecting. The program will output status updates on which part of its algorithm it's executing, but it should finish pretty quickly because we're using a relatively tiny input file. You should see the following screen output:

Code: Select all

MUSCLE v3.8.31 by Robert C. Edgar

http://www.drive5.com/muscle
This software is donated to the public domain.
Please cite: Edgar, R.C. Nucleic Acids Res 32(5), 1792-97.

hominid_cox 9 seqs, max length 1010, avg  length 1002
00:00:00     22 MB(3%)  Iter   1  100.00%  K-mer dist pass 1
00:00:00     22 MB(3%)  Iter   1  100.00%  K-mer dist pass 2
00:00:00     31 MB(4%)  Iter   1  100.00%  Align node       
00:00:00     31 MB(4%)  Iter   1  100.00%  Root alignment
00:00:01     32 MB(4%)  Iter   2  100.00%  Refine tree   
00:00:01     32 MB(4%)  Iter   2  100.00%  Root alignment
00:00:01     32 MB(4%)  Iter   2  100.00%  Root alignment
00:00:01     32 MB(4%)  Iter   3  100.00%  Refine biparts
00:00:01     32 MB(4%)  Iter   4  100.00%  Refine biparts
00:00:01     32 MB(4%)  Iter   5  100.00%  Refine biparts
00:00:01     32 MB(4%)  Iter   6  100.00%  Refine biparts
00:00:01     32 MB(4%)  Iter   7  100.00%  Refine biparts
00:00:01     32 MB(4%)  Iter   8  100.00%  Refine biparts
00:00:01     32 MB(4%)  Iter   9  100.00%  Refine biparts
00:00:01     32 MB(4%)  Iter  10  100.00%  Refine biparts
00:00:01     32 MB(4%)  Iter  11  100.00%  Refine biparts
00:00:01     32 MB(4%)  Iter  12  100.00%  Refine biparts
00:00:01     32 MB(4%)  Iter  13  100.00%  Refine biparts
00:00:01     32 MB(4%)  Iter  14  100.00%  Refine biparts
00:00:01     32 MB(4%)  Iter  15  100.00%  Refine biparts
00:00:01     32 MB(4%)  Iter  16  100.00%  Refine biparts
00:00:01     32 MB(4%)  Iter  17  100.00%  Refine biparts
00:00:01     32 MB(4%)  Iter  18  100.00%  Refine biparts
00:00:01     32 MB(4%)  Iter  19  100.00%  Refine biparts
00:00:01     32 MB(4%)  Iter  20  100.00%  Refine biparts
00:00:01     32 MB(4%)  Iter  21  100.00%  Refine biparts
00:00:01     32 MB(4%)  Iter  22  100.00%  Refine biparts
00:00:01     32 MB(4%)  Iter  23  100.00%  Refine biparts
00:00:01     32 MB(4%)  Iter  24  100.00%  Refine biparts
00:00:01     32 MB(4%)  Iter  25  100.00%  Refine biparts
00:00:01     32 MB(4%)  Iter  26  100.00%  Refine biparts
00:00:01     32 MB(4%)  Iter  27  100.00%  Refine biparts
00:00:01     32 MB(4%)  Iter  28  100.00%  Refine biparts
00:00:01     32 MB(4%)  Iter  29  100.00%  Refine biparts
00:00:01     32 MB(4%)  Iter  30  100.00%  Refine biparts
00:00:01     32 MB(4%)  Iter  31  100.00%  Refine biparts
00:00:01     32 MB(4%)  Iter  32  100.00%  Refine biparts

C:\Users\Difflugia\Desktop\Phylogeny>
If you see that, then the program worked and generated the input for our next step. Thus endeth the lesson.
My preferred pronouns are he, him, and his.

User avatar
Difflugia
Guru
Posts: 2298
Joined: Wed Jun 12, 2019 10:25 am
Location: Michigan
Has thanked: 1848 times
Been thanked: 1369 times

Re: Let's prove evolution!

Post #3

Post by Difflugia »

Let's now download and test PHYLIP, which is a contraction of PHYLogeny Inference Package. The downloads page contains executables for various systems. Scroll down to the Windows entry and download the correct ZIP file for your architecture, either 32- or 64-bit. If your computer is less than ten years old, it's probably 64-bit. Here are direct links to the 64-bit version and the 32-bit version.

PHYLIP is actually a suite of related programs. There is an included program that generates graphically nicer trees and we'll look at that in a future post, but for now, we only need one of the programs, protpars, which is an abbreviation of protein parsimony. Since the executables are standalone, we can copy the protpars.exe executable out of the ZIP archive and into our Phylogeny directory and use it exactly as we did muscle.exe. Either from your browser or from your Downloads folder (varies by Windows version; if you can't find it, reply or PM me), open the ZIP file. You don't need to extract the whole thing if you don't want to, but navigate to the exe folder and find protpars.exe. Copy that file, then paste it into the Phylogeny folder on your desktop.

Open your command prompt and navigate to your Phylogeny folder. On my system, the command to do that is:

Code: Select all

cd \Users\Difflugia\Desktop\Phylogeny
We should still have our files from before, plus protpars.exe. Typing "dir" into my command prompt shows me this:

Code: Select all

 Directory of C:\users\Difflugia\Desktop\Phylogeny

01/10/2022  08:24p      <DIR>          .
01/10/2022  03:23p      <DIR>          ..
01/10/2022  03:49p               9,444 hominid_cox.txt
01/10/2022  04:03p              10,158 infile
01/10/2022  03:23p             353,792 muscle.exe
01/10/2022  04:02p             241,141 protpars.exe
               4 File(s)        614,535 bytes
               2 Dir(s)  40,425,250,816 bytes free

C:\Users\Difflugia\Desktop\Phylogeny>
If we run protpars.exe without any arguments, it will assume that you want to read infile and write the output to the files outfile and outtree. If infile is missing or if either outfile or outtree exists, it will offer other options. If you want to keep these files for later, it's a good idea to make copies with more meaningful names. Remember, though, that as long as you keep the file with the source data (hominid_cox.txt), you can regenerate the tree later.

So, with infile in your folder, we should be able to run protpars.exe and quickly generate a tree. Let's do that. Enter the command "protpars" at the prompt. You should see the following dialog:

Code: Select all

C:\Users\Difflugia\Desktop\Phylogeny>protpars


Protein parsimony algorithm, version 3.698

Setting for this run:
  U                 Search for best tree?  Yes
  J   Randomize input order of sequences?  No. Use input order
  O                        Outgroup root?  No, use as outgroup species  1
  T              Use Threshold parsimony?  No, use ordinary parsimony
  C               Use which genetic code?  Universal
  W                       Sites weighted?  No
  M           Analyze multiple data sets?  No
  I          Input sequences interleaved?  Yes
  0   Terminal type (IBM PC, ANSI, none)?  IBM PC
  1    Print out the data at start of run  No
  2  Print indications of progress of run  Yes
  3                        Print out tree  Yes
  4          Print out steps in each site  No
  5  Print sequences at all nodes of tree  No
  6       Write out trees onto tree file?  Yes

Are these settings correct? (type Y or the letter for one to change)
I'll explain several of these options in later posts, but these should be fine for now, so type "y" and hit ENTER. You should see this:

Code: Select all

Adding species:
   1. Baboon    
   2. Sumatran O
   3. Bornean Or
   4. Gorilla   
   5. Heidelberg
   6. Chimpanzee
   7. Bonobo    
   8. Neandertal
   9. Human     

Doing global rearrangements
  !-----------------!
   .................
   .................

Output written to file "outfile"

Trees also written onto file "outtree"

Done.

Press enter to quit.
The file outfile contains an ASCII representation of the tree. We can either open it in a text editor like Notepad or use the type command to output it to the terminal:

Code: Select all

C:\Users\Difflugia\Desktop\Phylogeny>type outfile

Protein parsimony algorithm, version 3.698



One most parsimonious tree found:




                       +--Bonobo    
              +--------6  
              !        +--Chimpanzee
           +--5  
           !  !        +--Human     
           !  !     +--8  
     +-----4  +-----7  +--Neandertal
     !     !        !  
     !     !        +-----Heidelberg
  +--3     !  
  !  !     +--------------Gorilla   
  !  !  
  1  !                 +--Bornean Or
  !  +-----------------2  
  !                    +--Sumatran O
  !  
  +-----------------------Baboon    

  remember: this is an unrooted tree!


requires a total of    232.000


C:\Users\Difflugia\Desktop\Phylogeny>
Along with a data source (which we'll get to in another post) and text editor, these two programs, MUSCLE and protpars from PHYLIP, are the only tools necessary to conduct sophisticated experiments and analyze the data yourself. The only real impediment is the readability of the trees themselves. This tree with only nine species isn't too bad, but drawing the chart in ASCII will quickly become unwieldy without a better way to visualize the tree. We'll look at that next.
My preferred pronouns are he, him, and his.

User avatar
Difflugia
Guru
Posts: 2298
Joined: Wed Jun 12, 2019 10:25 am
Location: Michigan
Has thanked: 1848 times
Been thanked: 1369 times

Re: Let's prove evolution!

Post #4

Post by Difflugia »

Now let's see if we can make trees that are easier to read. PHYLIP includes two programs, drawgram and drawtree that will allow us to do that. Let's copy them out of the ZIP archive that we downloaded in the previous step. Open the ZIP archive, navigate to the exe folder, and look for the following files and copy them to your Phylogeny folder. They're not all together, so you may need to copy them in a few groups.
  • drawgram.dll
  • drawgram.exe
  • drawtree.dll
  • drawtree.exe
  • font1
  • font2
  • font3
  • font4
  • font5
  • font6
Now go to the command prompt in the Phylogeny directory and run the "dir" command. If you haven't deleted any files, it should look very much like this:

Code: Select all

 Directory of C:\Users\Difflugia\Desktop\Phylogeny

01/10/2022  11:37p      <DIR>          .
01/10/2022  03:23p      <DIR>          ..
01/10/2022  10:33p             259,335 drawgram.dll
01/10/2022  10:33p             235,832 drawgram.exe
01/10/2022  10:33p             269,628 drawtree.dll
01/10/2022  10:33p             248,248 drawtree.exe
01/10/2022  10:34p               5,934 font1
01/10/2022  10:34p              11,288 font2
01/10/2022  10:34p              16,985 font3
01/10/2022  10:34p              11,297 font4
01/10/2022  10:34p              16,886 font5
01/10/2022  10:34p              14,314 font6
01/10/2022  03:49p               9,444 hominid_cox.txt
01/10/2022  04:03p              10,158 infile
01/10/2022  03:23p             353,792 muscle.exe
01/10/2022  08:50p                 677 outfile
01/10/2022  10:40p                 126 outtree
01/10/2022  04:02p             241,141 protpars.exe
              16 File(s)      1,705,085 bytes
               2 Dir(s)  39,948,984,320 bytes free

C:\Users\Difflugia\Desktop\Phylogeny>
While outfile is an ASCII representation of the tree, outtree contains the data necessary to draw the tree as a nested set of binary pairs. We'll use it to generate a graphical tree using drawgram.

One of the annoying limitations of PHYLIP is a limitation of ten characters for organism names. The drawgram software doesn't have this limitation, however, so I modified outtree to contain longer names. Here's my updated outtree file.

Code: Select all

(((((Bonobo,Chimpanzee),((Human,Neandertal Man),Heidelberg Man)),Gorilla),
(Bornean Orangutan,Sumatran Orangutan)),Baboon);
Now enter "drawgram" into the command window.

Code: Select all

C:\Users\Difflugia\Desktop\Phylogeny>drawgram
Drawgram: can't find input tree file "intree"
Please enter a new file name>
This program is looking for a default file like protpars did, but it doesn't exist. Instead, type "outtree" at the prompt.

Code: Select all

Please enter a new file name> outtree
DRAWGRAM from PHYLIP version 3.698
Reading tree ... 
Tree has been read.
Loading the font .... 
Drawgram: can't find font file "fontfile"
Please enter a new file name>
Now pick one of the font files to use. The first five font files are various combinations of serif and italic fonts and the sixth is Cyrillic. You can play around with them and see what you'll like, but I'll just use font1. Type "font1" at the prompt. We'll get another menu.

Code: Select all

Please enter a new file name> font1
Font loaded.

Rooted tree plotting program version 3.698

Here are the settings: 
 0  Screen type (IBM PC, ANSI):  IBM PC
 P       Final plotting device:  Postscript printer
 (Preview no longer available)
 H                  Tree grows:  Horizontally
 S                  Tree style:  Phenogram
 B          Use branch lengths:  (no branch lengths available)
 L             Angle of labels:  90.0
 R      Scale of branch length:  Automatically rescaled
 D       Depth/Breadth of tree:  0.53
 T      Stem-length/tree-depth:  0.05
 C    Character ht / tip space:  0.3333
 A             Ancestral nodes:  Centered
 F                        Font:  Times-Roman
 M          Horizontal margins:  1.65 cm
 M            Vertical margins:  2.16 cm
 #              Pages per tree:  one page per tree

 Y to accept these or type the letter for one to change
The default setting will create a PostScript file. We want to change it into an image we can use. Enter "p" into the prompt. I won't bother showing all of the options, but enter "w" at the next prompt for a Windows bitmap. It will now ask you for a resolution. I find that I get best results using 2000 for both X and Y and size it down later using image software, but that depends on your image software. MS Paint is better than nothing, but barely. Once again, you'll be back at the list of options. Enter "Y". The image is named "plotfile" with no extension. Most programs are OK with this, but some fail to load it. If the latter is a problem, rename the file "plotfile.bmp".

You can load it into an image program to look at it or copy-paste it into a document, but if you want to upload it to a website, it will need to be something like a .gif, .jpg, or .png. The easiest way to do this is to load it into an image program (MS Paint is fine for this) and then "Save as" one of the other formats. Explaining it is beyond the scope of this post, but I use the open source, command line program ImageMagick. There are downloads for Windows and Linux.

Here's the tree, generated at 2000x2000, converted to grayscale .jpg, then resized to 300x300:

Image

The ImageMagick command to do this is:

Code: Select all

convert plotfile -colorspace gray -resize 300x300 hominid_cox.jpg
After setting the image type and resolution, you can change diagram styles by playing with the options. Here is the "curvogram" style with labels set to 110°.

Image

The drawtree program works exactly the same way, except the diagram is rooted in the center instead of the left edge.

As I said, this step is optional and can probably be skipped until you're ready to "present your findings," but it's certainly more polished than ASCII art. Next, we'll look at how to both understand the input data and examine what the resulting tree does and doesn't tell us.
My preferred pronouns are he, him, and his.

User avatar
Difflugia
Guru
Posts: 2298
Joined: Wed Jun 12, 2019 10:25 am
Location: Michigan
Has thanked: 1848 times
Been thanked: 1369 times

Re: Let's prove evolution!

Post #5

Post by Difflugia »

So, what does this mean?

Each letter represents a different amino acid. Though slightly oversimplified (I'll get more specific later), a mutation in this sequence means that one amino acid is exchanged for another, one or more amino acids are added into the sequence (an "insertion"), or amino acids are removed (a "deletion").

As an example, let's take the beginning of the chimpanzee sequence, MFTDRWL. That represents the following sequence of amino acids:
  • Methionine
  • Phenylalanine
  • Threonine
  • Aspartic Acid
  • Arginine
  • Tryptophan
  • Leucine
The first kind of mutation is a simple substitution, known as a "point" mutation. The aspartic acid in the fourth position, for example, might be changed into asparagine, which can happen if a single guanine in the DNA sequence changes to an adenine. The sequence would then be represented as MFTNRWL.

An insertion would be if some extra DNA somehow ended up in the strand. One way this can happen is if a base gets repeated a few times. Three adenines in a row is a codon for lysine (K), for example, so if we say that an extra six adenines ended up in the DNA sequence between the codons for phenylalanine and threonine, then our mutated sequence would be MFKKTDRWL. A deletion would be similar. If the DNA strand is replicated without the tryptophan codon, the resulting sequence would look like MFTDRL.

The alignment software (MUSCLE) helps the tree generation software by aligning sequences of different lengths into the most likely pattern of insertions and deletions. Aligning the examples would result in the following:

Code: Select all

MF--TDRWL
MF--TNRWL
MFKKTDRWL
MF--TDR-L
The tree generation then either counts mutations and finds the tree that requires the minimum number (maximum parsimony) or weights the probability of each mutation event and finds the tree whose paths are together most likely (maximum likelihood). With robust data (a high signal-to-noise ratio), both processes tend to find the same trees. Maximum likelihood tends to find better trees with less robust data, so it's often better at placing small clades that are more distant from neighboring clades (whales, for example), but it's computationally more expensive than maximum parsimony.

The "signal-to-noise" concept will be important to us in several practical ways. In this context, "signal," which is the number of mutations we can identify between two sequences. If each mutation can be accounted for, then we can easily infer a pattern of descent. Any two sequences that share a mutation, share a common ancestor from after that mutation occurred. Any sequence that doesn't share that mutation is more distantly related.

The competing factor is "noise," which is essentially anything that obscures the exact number of mutations between those sequences. If a point mutation occurs changing one nucleotide into another, but then a second point mutation occurs at the same locus, that's "noisy" because it looks like only one mutation occurred, but it was actually two. Similarly, if one branch experiences a mutation and a neighboring branch happens to experience the same mutation, then the data are noisy because once again, what looks like one mutation was actually two.

There are a few ways that this affects our data selection. First, a DNA sequence with no effect whatsoever ("junk" DNA) has a high signal value in the short term. Mutations that aren't subject to natural selection can collect relatively rapidly. This allows closely-related organisms to be compared, including even cells within a single person, but that signal is quickly eroded as new mutations in the same loci obscure evidence of prior mutation. At the other extreme at which even small changes to a sequence are deleterious, leading to the elimination of those mutations by natural selection. These sequences often have very low noise because a mutation anywhere is unlikely, let alone two in the same place. Unfortunately, there's not much signal there, either.

The goal is to select data that have enough signal to overcome the noise. If one is working with close lineages, then unconserved sequences are preferable. If the organisms are related closely enough, then even junk DNA can be considered homologous by virtue of being at the same chromosomal location, for example. If enough shared sequence still exists to infer a reasonably accurate mutation rate, then resolution in such cases can be very high. On the other hand, more highly conserved sequences allow comparison between more distantly related organisms. The problem there is not as often the presence of noise, but the absence of signal. Longer sequences are often necessary to make statistical inferences.

The data that I selected for the phylogeny of great apes, for example, is made up of data from three genes. The genes are later combined into a single enzyme, cytochrome C oxidase, but the enzyme is composed of three subunits, abbreviated COX1, COX2, and COX3, each with its own gene. I often use just one of them, COX2, when comparing wider groups of taxa, but the great apes are too closely related to each other. Chimpanzees and bonobos, for example, share exactly the same COX2 sequence of 227 amino acids and humans only differ from that sequence by five amino acids. Adding COX1 and COX3 extends that to 1001 amino acids. The practical effect is that with just COX2, protpars finds several equally parsimonious trees of potential descent. Quadrupling the input data removes the apparent ambiguity, narrowing the results to a single tree.
My preferred pronouns are he, him, and his.

User avatar
Difflugia
Guru
Posts: 2298
Joined: Wed Jun 12, 2019 10:25 am
Location: Michigan
Has thanked: 1848 times
Been thanked: 1369 times

Re: Let's prove evolution!

Post #6

Post by Difflugia »

While trying to motivate myself to work up some interesting experiments that can be conducted with smallish datasets, I ran across an online textbook that might be interesting to someone wanting to learn a bit more about evolution, phylogeny, and their relationship to the fossil record.

The Digital Atlas of Ancient Life is a website laid out linearly into chapters with some interactive questions along the way. The chapter on systematics is most applicable to what I'm trying to accomplish with this thread.
My preferred pronouns are he, him, and his.

Post Reply