PDA - Help



  Back to PDA home

Contact us

PDA Help

Example

Download Source Code

Bugs in Older Versions

 

 Index

1. Introduction to PDA

2. User defined parameters in the PDA web form

Input
Main parameters
ClustalW parameters

3. Estimation Optimization Method

4. PDA Output

HTML output
MySQL / Microsoft Access database
Description of all the parameters
Alignments quality

5. Histogram Maker tool

6. Download PDA

See a video demonstration of the program:

QuickDemo.swf  (8,334 KB)

QuickDemo.gif  (2,011 KB)

 

1. Introduction to PDA:

PDA, "Pipeline Diversity Analysis", is a collection of programs and modules mainly written in Perl that automatically can:

  1. search  for polymorphic sequences in a large database, and

  2. estimate their genetic diversity.

PDA has a user-friendly, web-based interface where the user can select the sequences to be analyzed and the parameters to be used. Sequences can be retrieved from either Genbank (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Nucleotide; see the NCBI's Disclaimer and Copyright) or the DPDB database (http://dpdb.uab.es) as a list of accession numbers or a set of organisms and/or genes. Low quality sequences coming from large-scale sequencing projects (i.e. working draft), where most missing data is, will be excluded from the analysis. Alternatively, sequences can be introduced manually in Fasta or Genbank formats. All sequences will be grouped by organism and gene, and groups will be aligned using the ClustalW algorithm. After, different analyses will be performed.

 

2. User defined parameters in the PDA web form:

Input:

First, you have to indicate whether you want the sequences:

  1. to be retrieved from a database, or

  2. they will be given by yourself in Fasta or Genbank formats.

 

In the case you want the sequences to be retrieved from a database, you must first choose the source database from which they will be obtained. At this moment, two options are allowed: Genbank or DPDB. You must know that DPDB is a database of nucleotide sequences from the Drosophila genus, so you can use it in the case you are interested with sequences from this taxonomic group. Then, you have to choose whether you will introduce a list of organisms and/or genes, or a list of accession numbers corresponding to the source database. Every new item must be in a new line. For example:

 

Otherwise, if you want to introduce the sequences yourself, you only have to paste them in the form or use the Navigator to find the appropriate file in your computer. The program can read two formats: Fasta and Genbank. Follow the instructions given below:


Fasta format:

Each new sequence begins with a line >HEADER. The sequence follows on the next lines until the next >HEADER line is found. Note that you cannot specify any sequence annotation using this format. However, you can specify the organism and gene names using the following syntax in the header:

>Organism|gene

Example:

 

Genbank format:

You can retrieve sequences in Genbank format from the Genbank database. Note that each record must end with a new line including exclusively two bars as in the example:

 
LOCUS       AY147419                 486 bp    DNA     linear   INV 12-JAN-2004
DEFINITION  Drosophila auraria isolate DPAJ1325 histone H2A gene, partial cds;
            H2A/H2B intergenic spacer region, complete sequence; and histone
            H2B gene, partial cds.
ACCESSION   AY147419
VERSION     AY147419.1  GI:27368158
KEYWORDS    .
SOURCE      Drosophila auraria
  ORGANISM  Drosophila auraria
            Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; Pterygota;
            Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha;
            Ephydroidea; Drosophilidae; Drosophila.
REFERENCE   1  (bases 1 to 486)
  AUTHORS   Yang,Y., Zhang,Y.P., Qian,Y.H. and Zeng,Q.T.
  TITLE     Phylogenetic relationships of Drosophila melanogaster species group
            deduced from spacer regions of histone gene H2A-H2B
  JOURNAL   Mol. Phylogenet. Evol. 30 (2), 336-343 (2004)
   PUBMED   14715225
REFERENCE   2  (bases 1 to 486)
  AUTHORS   Yang,Y. and Zhang,Y.P.
  TITLE     Direct Submission
  JOURNAL   Submitted (03-SEP-2002) Life Science, Hubei University, Wuhan,
            Hubei 430062, China
FEATURES             Location/Qualifiers
     source          1..486
                     /organism="Drosophila auraria"
                     /mol_type="genomic DNA"
                     /isolate="DPAJ1325"
                     /db_xref="taxon:47315"
     mRNA            complement(<1..>132)
                     /product="histone H2A"
     CDS             complement(<1..132)
                     /codon_start=1
                     /product="histone H2A"
                     /protein_id="AAN87198.1"
                     /db_xref="GI:27368159"
                     /translation="MSGRGKGGKVKGKAKSRSNRAGLQFPVGRIHRLLRKGNYAERVG
                     "
     misc_feature    133..348
                     /note="histone H2A/H2B intergenic spacer"
     mRNA            <349..>486
                     /product="histone H2B"
     CDS             349..>486
                     /codon_start=1
                     /product="histone H2B"
                     /protein_id="AAN87199.1"
                     /db_xref="GI:27368160"
                     /translation="MPPKTSGKAAKKAGKAQKTSPRTTRRRSGRGRRALLIYIYKVLK
                     QV"
ORIGIN     
        1 accaacgcgc tcggcatagt tgcccttgcg gagcagacga tgaatacggc cgactgggaa
       61 ctggagtccg gcacggtttg agcgggactt tgcctttccc tttactttgc caccttttcc
      121 acgaccagac attttctttt atttcacttt attcacttca cacagacgaa gaacgaatgt
      181 tggtgcaacc caagttgtca cgaatttata cttttaggtc tgcttgcgcg ttcagtttgg
      241 ggtgggtcga cttagacctg aaaacattgc tggaaaaaaa gtataagagc gaacaccaaa
      301 actcgtctac catattaagt gaatcgtcaa gtgaagtgaa gtgaaataat gccgccgaaa
      361 actagtggaa aggcagccaa gaaggctggc aaggctcaga agacatcacc aagaacgaca
      421 agaagaagaa gcggaagagg aaggagagct ttgcttatct acatttacaa ggtcctgaag
      481 caggtc
//

LOCUS       AY147418                 486 bp    DNA     linear   INV 12-JAN-2004
DEFINITION  Drosophila auraria isolate YY001325 histone H2A gene, partial cds;
            H2A/H2B intergenic spacer region, complete sequence; and histone
            H2B gene, partial cds.
ACCESSION   AY147418
VERSION     AY147418.1  GI:27368155
KEYWORDS    .
SOURCE      Drosophila auraria
  ORGANISM  Drosophila auraria
            Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; Pterygota;
            Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha;
            Ephydroidea; Drosophilidae; Drosophila.
REFERENCE   1  (bases 1 to 486)
  AUTHORS   Yang,Y., Zhang,Y.P., Qian,Y.H. and Zeng,Q.T.
  TITLE     Phylogenetic relationships of Drosophila melanogaster species group
            deduced from spacer regions of histone gene H2A-H2B
  JOURNAL   Mol. Phylogenet. Evol. 30 (2), 336-343 (2004)
   PUBMED   14715225
REFERENCE   2  (bases 1 to 486)
  AUTHORS   Yang,Y. and Zhang,Y.P.
  TITLE     Direct Submission
  JOURNAL   Submitted (03-SEP-2002) Life Science, Hubei University, Wuhan,
            Hubei 430062, China
FEATURES             Location/Qualifiers
     source          1..486
                     /organism="Drosophila auraria"
                     /mol_type="genomic DNA"
                     /isolate="YY001325"
                     /db_xref="taxon:47315"
     mRNA            complement(<1..>132)
                     /product="histone H2A"
     CDS             complement(<1..132)
                     /codon_start=1
                     /product="histone H2A"
                     /protein_id="AAN87196.1"
                     /db_xref="GI:27368156"
                     /translation="MSGRGKGGKVKGKAKSRSNRAGLQFPVGRIHRLLRKGNYAERVG
                     "
     misc_feature    133..348
                     /note="histone H2A/H2B intergenic spacer"
     mRNA            <349..>486
                     /product="histone H2B"
     CDS             349..>486
                     /codon_start=1
                     /product="histone H2B"
                     /protein_id="AAN87197.1"
                     /db_xref="GI:27368157"
                     /translation="MPPKTSGKAAKKAGKAQKTSPRTTRRRSGRGRRALLIYIYKVLK
                     QV"
ORIGIN     
        1 accaacgcgc tcggcatagt tgcccttgcg gagcagacga tgaatacggc cgactgggaa
       61 ctggagtccg gcacggtttg agcgggactt tgcctttccc tttactttgc caccttttcc
      121 acgaccagac attttctttt atttcacttt attcacttca cacagacgaa gaacgaatgt
      181 tggtgcaacc caagttgtca cgaatttata cttttaggtc tgcttgcgcg ttcagtttgg
      241 ggtgggtcga cttagacctg aaaacattgc tggaaaaaaa gtataagagc gaacaccaaa
      301 actcgtctac catattaagt gaatcgtcaa gtgaagtgaa gtgaaataat gccgccgaaa
      361 actagtggaa aggcagccaa gaaggctggc aaggctcaga agacatcacc aagaacgaca
      421 agaagaagaa gcggaagagg aaggagagct ttgcttatct acatttacaa ggtcctgaag
      481 caggtc
//

LOCUS       AF461290                 384 bp    DNA     linear   INV 30-SEP-2003
DEFINITION  Drosophila auraria cytochrome oxidase II (COII) gene, partial cds;
            mitochondrial gene for mitochondrial product.
ACCESSION   AF461290
VERSION     AF461290.1  GI:20805483
KEYWORDS    .
SOURCE      mitochondrion Drosophila auraria
  ORGANISM  Drosophila auraria
            Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; Pterygota;
            Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha;
            Ephydroidea; Drosophilidae; Drosophila.
REFERENCE   1  (bases 1 to 384)

...


 

Main parameters:

PDA will analyze the performed alignments in terms of Polymorphism, Synonymous and non-synonymous substitutions (1), Linkage disequilibrium (2) and Codon bias (3) by default. However, you can disable the last three analyses. Linkage disequilibrium is estimated in non-overlapping sliding windows. You can choose the window length, which is set to 50 segregating sites by default, although the maximum value that will be used on the web is 100 consecutive segregating sites (be aware that windows may not be of the same length in nucleotides).

You can also choose which Gene regions you want PDA to analyze separately (4). Regions are that of Genbank annotations, which is also used in the DPDB database. CDS and exon are chosen by default.

The Minimum number of sequences per category is the minimum number of sequences you for alignment (5). Number 2 is set by default.

The Minimum ClustalW Score for pairwise comparisons is the minimum percentage of homology between each pairwise sequences comparison in a final alignment (6). The aim of this parameter is to separate fragmented or incorrectly annotated sequences into different subgroups, but it is also very useful to separate, in different alignments, sequences of a same organism that come from well separated populations. Sequences are never used in more than one subgroup.

The Minimum sequences length in the analyses makes the program to exclude those sequences shorter than the number of nucleotides written(7).

A Warning message will appear in the Alignments section if the proportion of excluded sites within an alignment (excluding end gaps) is greater than the introduced percentage (8).

The results will be given in both HTML pages and in a relational database. This can be either in MySQL or Microsoft Access formats (9).

Finally, you can introduce your e-mail address if you wish to receive a notification as soon as the results are available in our server (10). Note that this is not a required option because a unique ID will be assigned to your submitted job, and you will be able to retrieve the results using this ID. You will have to go to "Request results by ID" (at the top of the PDA Home Page) and introduce the unique ID in the box.

 

ClustalW parameters:

Finally, you can modify the default Clustal parameters, although these have been optimized for polymorphism analyses.

Fast pairwise alignment:

  • K-tup: Can be 1 or 2 for proteins; 1 to 4 for DNA. Increase this to increase speed; decrease to improve sensitivity.
     

  • Window length: The number of diagonals around each "top" diagonal that are considered. Decrease for speed; increase for greater sensitivity.
     

  • Score type: The similarity scores may be expressed as raw scores (number of identical residues minus a "gap penalty" for each gap) or as percentage scores. If the sequences are of very different lengths, percentage scores make more sense.
     

  • Topdiag: The number of best diagonals in the imaginary dot-matrix plot that are considered. Decrease (must be greater than zero) to increase speed; increase to improve sensitivity.
     

  • Pairgap: The number of matching residues that must be found in order to introduce a gap. This should be larger than K-Tuple Size. This has little effect on speed or sensitivity.
     

Multiple alignment:

  • Gap open: Reduce this to encourage gaps of all sizes; increase it to discourage them.   Terminal gaps are penalized same as all others except for END GAPS not being selected.  BEWARE of making this too small (approx 5 or so); if the penalty is too small, the program may prefer to align each sequence opposite one long gap.
     

  • End gaps: Here you can select if you want the terminal gaps to be penalized or not.
     

  • Gap extension: Reduce this to encourage longer gaps; increase it to shorten them.   Terminal gaps are penalized same as all others.  BEWARE of making this too small (approx 5 or so); if the penalty is too small, the program may prefer to align each sequence opposite one long gap.
     

  • Gap distances: Penalization for the distance between gaps.

 

3. Estimation Optimization Method:

After the grouping and alignment of sequences, a further step (optional) is taken before estimating the polymorphism parameters. It is referred here as the Estimation Optimization Method:

  1. First, PDA groups the sequences by length, so that sequences in the same group must not differ in more than the 20% of their length.

  2. It calculates the amount of informative sites in each accumulative group of sequences (e.j. group 1 (longest sequences), groups 1+ 2, groups 1 + 2 + 3, etc.).

  3. PDA will use the set of sequences with the largest number of informative sites (in some cases discarding the shortest sequences).

Example:

>LDseq000001
AGCATCGATCATCATCTACGTACGTACGATCAGCCGATGCGCGGGGTTTT   50
>LDseq000002
AGCATCGATCATCGTGTACGTACGTACGATCAGCCGATGCGCGGGGTTTT   50
>LDseq000003
AGCATCGATCATCATCTACGTACGTACGATCAGCCGATGCGCGGGG----   46
>LDseq000004
AGCATCGATCATCATCTACGTACGTACGATCAGCCGATGCGC--------   42
>LDseq000005
AGCATCG-------------------------------------------    7
>LDseq000006
AGCATCG-------------------------------------------    7
>LDseq000007
AGCATCG-------------------------------------------    7
>LDseq000008
AGCATCG-------------------------------------------    7

In this example, the first four sequences would be assigned to group 1, and the last four sequences to group 2. The number of informative sites (without gaps) using the four first sequences (group 1) is:

Informative sites group 1 = 42 non-gapped positions * 4 sequences = 168

Using the accumulative set of sequences of group 1 + group 2, we have more sequences, but less non-gapped positions:

Informative sites group 1+2 = 7 non-gapped positions * 8 sequences = 56

Therefore, we will have more informative sites by using the four long sequences only and discarding the short ones. PDA would show the alignment with all the sequences, but would use the four long sequences only to calculate the polymorphism estimates (n = 4 in the results).

To distinguish which sequences were used in the analyses from those which were discarded, PDA uses a color code:

    for sequences that were included in the estimates, and
    for sequences that were NOT included in the estimates.

You can find this information in the ALIGNMENTS page.

If you always want to use all the sequences of the alignments to calculate the estimates, unselect the appropriate box in the initial form:

    Use the ESTIMATE OPTIMIZATION METHOD

 

4. PDA output:

The output of the program is stored in our server ofr a week. It includes a set of HTML pages with the results of the alignments and analyses, as well as a database with the same information but also all the sequences used and their annotations.
 

HTML output:

The main HTML results page is divided in different sections:

  1. The input parameters are described in the first section (organism/gene, input database, analyses performed, etc.)
     

  2. The second section allows you to download the database and see the LOG files (where the program saves possible errors that could occur during the analyses)
     

  3. Then, you can see all the alignments performed. If you click on "View complete information of alignments", all the information for each individual alignment will be displayed in four sections (each corresponding to some columns):
     

    • Alignment information

      • Alignment ID

      • Organism

      • Gene

      • Type of region

      • Region
         

    • Alignment quality  (Special caution must be taken on the alignments having these parameters a high value)

      • % excluded sites in the alignment (due to gaps or ambiguous positions). If the value is greater than 30% (or the user defined value), a warning message will appear, and the value will be marked in red.

      • Minimum/Maximum sequences lengths (% difference)
         

    • Alignment

      • Clustal align (text file)

      • Fasta align (text file)

      • Jalview align (graphic viewer / editor)

      • Clustal output

      • DND Tree file (phylogeny)
         

    • Sequences used

      • Genbank / EMBL accession number (and links to both databases)

      • Assigned sequence ID (identification number used in alignments)

      • Location (corresponding to the original sequence from Genbank/EMBL)
         

  4. Averages of the main statistics for Polymorphism, Synonymous and non-synonymous substitutions, Linkage disequilibrium and Codon bias, by gene regions (including all the studied organisms and genes). You can view the complete results for each alignment by clicking on the appropriate "View complete "Analysis" results" link.
     

  5. Histogram Maker Tool
     


MySQL / Microsoft Access database
:

The database can be downloaded as a compressed file from the main HTML results page, or can be requested directly to our server if the format chosen is MySQL (see a quick help). Its structure in tables is represented by the following figure:


Description of all the parameters:
 

Polymorphism
G+C G+C percentage  
n Number of sequences
m Number of nucleotides in the alignment (alignment length)
start First nucleotide in the analysis
end Last nucleotide in the analysis
excluded Number of excluded positions in the analysis (gap or ambiguous positions)
analyzed Number of analyzed positions
S Number of segregating sites (S) Nei 1987
S / m Number of segregating sites per nucleotide
Theta (from S) Theta θ per site (estimated from S)
Eta Minimum number of mutations (η) Tajima 1996
Eta / m Minimum number of mutations η per site
Theta (from Eta) Theta θ per site (estimated from η)
Theta Theta θ per DNA sequence (estimated from S) Tajima 1993
V Theta (norec) Variance of θ per DNA sequence (estimated from S) - without recombination
dV Theta (norec) Standard deviation of θ per DNA sequence (estimated from S) - without recombination
V Theta (rec) Variance of θ per DNA sequence (estimated from S) - free recombination
dV Theta (rec) Standard deviation of θ per DNA sequence (estimated from S) - free recombination
Theta per site Theta θ per site (estimated from S) Nei 1987
V Theta_site (norec) Variance of θ per site (estimated from S) - without recombination
dV Theta_site (norec) Standard deviation of θ per site (estimated from S) - without recombination
V Theta_site (rec) Variance of θ per site (estimated from S) - free recombination
dV Theta_site (rec) Standard deviation of θ per site (estimated from S) - free recombination
FSM Theta (from pi) Theta θ per site (estimated from π) - Under a Finite Sites Model Tajima 1996
FSM Theta (from S) Theta θ per site (estimated from S) - Under a Finite Sites Model
FSM Theta (from eta) Theta θ per site (estimated from η) - Under a Finite Sites Model
k Average number of nucleotide differences Tajima 1983
Vst k (norec) Stochastic variance of k - without recombination
Vs k (norec) Sampling variance of k - without recombination
V k (norec) Total variance of k - without recombination
Vst k (rec) Stochastic variance of k - free recombination
Vs k (rec) Sampling variance of k - free recombination
V k (rec) Total variance of k - free recombination
Pi Average number of nucleotide differences per site (π) Nei 1987
Pi_JC Average number of nucleotide differences per site (π) - with Jukes&Cantor correction Nei 1987;
Jukes and Cantor 1969
V Pi_JC Variance of π (Jukes&Cantor)
Tajima Tajima's D Tajima 1989

Synonymous and non-synonymous changes
# Stop codons Number of STOP codons  
# Codons Total number of codons  
# Sites Total number of sites (nucleotides)  
# SS Number of Synonymous sites  
Pi (SS) π in synonymous sites  
Pi_JC (SS) π (with Jukes&Cantor correction) in synonymous sites  
V Pi_JC (SS) Variance of π (with Jukes&Cantor correction) in synonymous sites  
# NS Number of Non-synonymous sites  
Pi (NS) π in non-synonymous sites  
Pi_JC (NS) π (with Jukes&Cantor correction) in non-synonymous sites  
V Pi_JC (NS) Variance of π (with Jukes&Cantor correction) in non-synonymous sites  

Pairwise comparisons (between pairs of sequences)
SynDif Number of Synonymous differences Nei and Gojobori 1986
NSynDif Number of Non-synonymous differences
SynPos Number of Synonymous positions
NSynPos Number of Non-synonymous positions
Ks Number of synonymous polymorphisms per synonymous site
Ka Number of non-synonymous polymorphisms per non-synonymous site
Ks_JC Number of synonymous polymorphisms per synonymous site - with Jukes&Cantor correction
Ka_JC Number of non-synonymous polymorphisms per non-synonymous site - with Jukes&Cantor correction
V Ks_JC Variance of Ks (Jukes&Cantor)
V Ka_JC Variance of Ka (Jukes&Cantor)

Linkage disequilibrium
# Polym sites Number of polymorphic sites  
# Pairwise comparisons Number of pairwise comparisons (between pairs of polymorphic sites)
ZnS R2 average over all pairwise comparisons Kelly 1997

Pairwise comparisons (between pairs of polymorphic sites)
Dist Distance between both compared sites (in nucleotides, considering alignment gaps)  
D D Lewontin and Kojima 1960
D' D' Lewontin 1964
R R Hill and Robertson 1968
R2 R2
SChi2 χ2 test  
Sig (SChi2) Significance of the χ2 test
Fisher Fisher test
Sig (Fisher) Significance of the Fisher test

Codon Bias
# Codons Total number of codons  
# Sites Total number of sites (nucleotides)
# Stop Codons Number of STOP codons
# Codons (no !) Number of codons excluding STOP codons
# Codons (no !, W, M) Number of codons excluding STOP codons and those coding for a unique aminoacid (W=Trp and M=Met) (nuclear universal genetic code)
ENC Effective Number of Codons Wright 1990
CAI Codon Adaptation Index (the submitted sequences their selves are taken as the reference set) Sharp and Li 1987
SChi2 Scaled χ2 Shields 1988
G+Cc G+C content in all coding positions Wright 1990
G+C2 G+C content in second positions of codons
G+C3 G+C content in third positions of codons
RSCU Relative Synonymous Codon Usage Sharp 1986

 

Alignments quality:

You should pay special attention at the quality of alignments and revise them after each analysis. PDA offers two special parameters that give a value from 0 (the best) to 100 (the worst) to each alignment according to:

  1. The proportion excluded sites within an alignment due to gaps or ambiguous bases (end gaps will not be taken into account). Should this number be greater than 30% (or the user defined value), it will be marked in red and a warning message will appear in a small window. If it reaches the 100%, no analyses will be performed on the alignment.
     

  2. The length of the shortest and the longest sequences, and the percentage of difference between both sequences

You can find these parameters in the Alignments section of the HTML output, or in the Index_Analysis table of the database.

 

5. Histogram maker tool:

This tool allows you to create personalized histograms for every parameter estimated. You can:

  1. Restrict the information to specific gene regions

  2. Restrict the information to a organism and/or gene

  3. Choose the type of distribution (which parameter you want to represent in the histogram)

  4. Choose the order of the categories: histogram or frequency

  5. Set a number of categories (number of bars in the histogram or frequency representation)

 

6. Download PDA:

The source code of PDA can be downloaded from our FTP site under the GNU General Public License (GPL) and be used locally. This is highly recommended for big analyses. Please, go to the section "Download Source Code" and register in order to obtain a username and password and get access to the FTP server. See the Installation file that comes with the distribution or that can be downloaded here (1,238 KB) for specific instructions on how to use the program locally.

 

 





DGM UAB eBiointel