PROTEOME ANALYSIS

PROTEOME ANALYSIS

The term "Protein sequence analysis" in biology implies subjecting a peptide sequence to sequence alignment, sequence databases, repeated sequence searches, or other bioinformatics methods on a computer. Sequence analysis in molecular biology and bioinformatics is an automated, computer-based examination of characteristic fragments. It basically includes five biologically relevant topics, the comparison of sequences in order to find similar sequences (sequence alignment), prediction of protein structures, and comparison of homologous sequences to construct a molecular phylogeny.

The extraction, sorting and analyzing of sequence information about proteins is a major part of Bioinformatics. Sequence analysis is a process of trying to find something about a amino acid sequence, employing in-silico biology techniques. Due to rapid progress in algorithm development in computational biology, a large number of softwares are freely available that will compare unknown sequence to all of the sequence is available in the public domain. Currently, many of the international scientific journals require any newly discovered sequence to be submitted in a publicly available database before the discovery can be published. These new submitted sequences are then checked, annotated, cross-referenced and published. Each record is then curated and maintained in one of the many different databases available over the internet. The initial analysis carried out on the protein sequences are: composition analysis, molecular weight search, isoelectric point calculation, peptide mapping, hydrophobicity and hydrophilicity of the sequence, secondary structure prediction, fold prediction, transmembrance region prediction, coil structure prediction, signal peptide, motif and tertiary structure prediction etc. Once a new sequence is obtained from the database resources, then the following stepwise sequence analysis can be carried out.

There are mainly five major types for sequence analysis namely:

Type 1: Sequence retrieval and Preparation

Type 2: Similarity Searches and Phylogenic analysis

Type 3: Structure Prediction

Type 4: Profile and Pattern construction and search

Type5: Protein Function Prediction

TYPE -1: SEQUENCE RETRIEVAL AND PREPARATION:

Protein sequence for the corresponding protein retrieved from the protein sequence databases like Genpept, Uniprot and PIR.

URL: http://www.ncbi,nlm.nih.gov/

For sequence analysis, different servers require different input sequence formats which can be obtained using sequence format converters. One of such server available in BCM is ReadSeq.

URL: http://searchlauncher.bcm.tmc.edu/seq-util/Options/readseq.html

The Homepage of the GenPept server is as follows:

The home page of the server is as follows:

TYPE-2 SIMILARITY SEARCHES AND PHYLOGENIC ANALYSIS:

Similarity searches were done through sequence alignment analysis. Sequence alignment is the process of lining up two or more sequences to achieve maximal levels of identity and conservation for the purpose of assessing the degree of similarity and the possibility of homology. Sequence similarity analysis is the single most powerful method for structural and functional inference available in databases. Sequence similarity analysis allows the inference of homology between proteins and homology can help one to infer whether the similarity in sequences would have similarity in function. Fundamentally, sequence-based alignment searches are string-matching procedures. A sequence of interest (query sequence) is compared with sequences (targets) in a databank-either pair-wise or with multiple target sequences, by searching for a series of individual characters. Two sequences are aligned by writing them across a page in two rows. Identical or similar characters are placed in the same column and non-identical characters can be placed opposite a gap in the other sequence. Gap is a space introduced into an alignment to compensate for insertions and deletions in one sequence relative to another. Due to this reciprocity between insertion and deletion, they are usually called indel for short. In optimal alignment, non-identical characters and gaps are placed to bring as many identical or similar characters as possible to vertical register.

The objective of sequence alignment analysis is to analyze sequence data to make reliable prediction on protein structure, functional and evolution vis a vis the three-dimensional structure. When character is shared between two species or populations, that character is said to be identical. The degree of which two species or populations share identities is indicated by similarity.

Mainly sequence alignment is studied under two headings namely Pair-wise sequence alignment and Multiple sequence alignment.

Pair-wise sequence alignment:

Pair-wise alignment is a fundamental process in sequence comparison analysis. Pair-wise alignment of two sequences is relatively straightforward computational problem. In a pair-wise comparison, if gaps or local alignments are not considered, the optimal alignment method can be tried and the number of computations required for two sequences is roughly proportional to the square of the average length, as is the case in dot plot comparison. The problem becomes complicated, and not feasible by optimal alignment method, when gaps and local alignment considered. A maximum match between two sequences is defined to be the largest number of amino acids from on protein that can be matched with those of another problem, while allowing for all possible deletions. A penalty is introduced to provide a barrier to arbitrary gap insertion. Pair-wise alignment achieved in two ways namely Local and Global alignment.

Local Alignment:

Local alignment is an alignment of some portion of two nucleic acid or protein sequences. Smith-Waterman algorithm is best alignment method for sequences for which no evolutionary relatedness is known. The program finds the region or regions of highest similarity between two sequences, thus generating one or more islands of matches or sub-alignments in the aligned sequences. Local alignments are more suitable and meaningful for aligning sequences that are similar along some of their lengths but dissimilar in others, sequences that differ in length or sequences that share conserved regions or domains.

1. BLAST (Basic Local Alignment Search Tool) is one of the online tools available in NCBI. It is a popular user-friendly search tool for searching all the major sequence databases. It is a heuristic method to find the highest scoring locally optimal alignment between a query sequence and a database sequence. Blast programs are designed for fast database searching with minimal sacrifice of sensitivity to distant related sequences. It shows better results for protein sequences than nucleotide sequences. The default database is the non-redundant database, but the user still has the option to select one of their choices. The use of filters reduces problems of contamination with numerous artifacts in the databases.

URL: http://ncbi.nlm.nih.gov/blast/

Output:

Global Alignment:

Global alignment is an alignment of two nucleic acid or protein sequences over their entire length. The Needleman-Wunsch algorithm (GAP program) is one of the methods to carry out pair-wise alignment of sequences by comparing a pair of residues at a time. Comparisons are made from the smallest unit of significance, a pair of amino acids, one from each protein. All possible pairs are represented by a two-dimensional array and pathways through the array represent all possible comparisons. Statistical significance is determined by employing a scoring system; for a match=1 and mismatch=0 and penalty for a gap. Each cell in the matrix is examined, maximum score along any path leading to the cell is added to its present contents and the summation is constructed. In this way the maximum match pathway is constructed. The maximum match is the largest number that would result from summing the cell score values of every pathway, which is defined as the optimal alignment. Leaps to the non-adjacent diagonal cells in the matrix indicate the need for gap insertion, to bring the sequence into register. Complete diagonals of the array contain no gaps. Needleman algorithm tries to take all the characters of one sequence and align it with all the characters of a second sequence. This algorithm works well for sequences that show similarity across most of their lengths.

1. EMBOSS: Pair-wise alignment tool is utilized for global alignment.

URL: http://www.ebi.ac.uk/emboss/align/

Input:

Output:

MULTIPLE SEQUENCE ALIGNMENT:

Multiple sequence alignment is an alignment of three or more sequences with gaps inserted in the sequences such that residues with common structural positions and/or ancestral residues are aligned in the same column. At the heart of the sequence analysis method is the multiple alignments. This is because of the quantity and diversity of information in databases. Pair-wise alignments are fundamental and useful, but there are some problems with them. For instance, when using one of other popular sequence searching programs like BLAST which perform pair-wise alignments to find similar sequences in a database, one very often obtains many sequences that are significantly similar to the query sequence. Comparing each and every sequence to every other may be possible when one has just a few sequences, but it quickly becomes impractical as the number sequences increases. But in multiple sequence alignment, all similar sequences can be compared in one single figure or table. The basic idea is that the sequences are aligned on top of each other, so that a coordinate system is set p, where each row is the sequences for one protein, and each column is the same position in each sequence. Each column corresponds to a specific residue in the prototypical protein. As with pair-wise alignment, there will be gaps in some sequence, most often was shown by dash ‘-‘or dot ‘.’ character. Note that to construct a multiple alignment; one may have to introduce gaps in sequences at positions where there were no gaps in the corresponding pair-wise alignment. This means that multiple alignments typically contain more gaps than any given pair of aligned sequences.

1. ClustalW:

ClustalW is one of the standard programs implementing one variant of the progressive method in wide use for multiple alignments. The W denotes a specific version that has been developed from the original clustal program. The basic steps of the algorithm implemented in clustalW are:

Compute the pair-wise alignments for all against all sequences and similarities stored in matrix.
Convert the sequence similarity matrix values to distance measures, reflecting evolutionary distance between each pair of sequences.
Construct a guide tree for the order in which pairs of sequences are to be aligned and combined with previous alignments.
Progressively align the sequences/alignments together into each branch point of the guide tree, starting with the least distant pairs of sequences. At each branch point, one must do either a sequence-sequence, sequence-profile, or profile-profile alignment.

ClustalW is an example of an algorithm that has given up on trying to be perfect and instead uses an approximation strategy, combined with more or less intelligent tricks that guide the computation towards a successful result. This is called a heuristic algorithm.

One important point to keep in mind is that since clustalW is a heuristic algorithm, it cannot produce a solution that is guaranteed to be optima. But in practice, the result produces are good enough.

URL: http://www.ebi.ac.uk/clustalw/

Input:

Output:

Phylogenetic Analysis:

Phylogenetic analysis of a family of related sequences is a determination of how the family might have been derived during evolution. Placing the sequences as outer branches on a tree depicts the evolutionary relationships among the sequences. The branching relationships on the inner part of the tree then reflect the degree to which different sequences are related. There are five different ways in which phylogenetic analysis carried out namely Dayhoff mutation data matrix method, Block model, Clustering algorithm, Distance method and Cladistic method. Phylogenetic trees are graphical representation of genetic relationships and the evolutionary history of taxa or sequences. The separated sequences are referred to as taxa, defined as phylogenetically distinct units on the tree.

The tree is composed of outer branches, representing the taxa and nodes and branches representing relationships among the taxa. Distance, maximum parsimony and maximum likelihood methods are generally used to find the evolutionary trees. Multiple sequence alignment plays crucial role in phylogenetic tree construction. The method of converting MSA to a phylogenetic tree is used to reduce the problem of o multiple alignment to an iterative process of pair-wise alignments. The process work as follows: compute all pair-wise distances between given sequences compute a tree by single linkage clustering by using methods like UPGMA or Nearest Neighbor and align the sequences in an orderly fashion.

There are various programs available for performing various phylogenetic operations. Different programs and program options are different for DNA and protein sequences.

1. Phylip (Phylogeny Inference Package) is a package of programs for inferring phylogenies. Methods supported in the package include parsimony, distance matrix, and likelihood methods, including bootstrapping and consensus trees. Data types that can be handled include molecular sequences, gene frequencies, restriction sites, distance matrices and 0/1 discrete characters.

URL: http://evolution.genetics.washington.edu/phylip.html

TYPE3 STRUCTURE PREDICTION:

Protein structure prediction from a sequence is one of the high focus problems for researchers. This is a very useful application of bioinformatics as the experimental techniques like X-ray crystallography are time consuming. The fundamental issue is how we can predict the 3-D shape of a protein from its amino acid sequence. This issue is solved in bioinformatics by following different algorithms and methods. Protein structure prediction was achieved in three different levels mainly Primary structure analysis, Secondary structure prediction and Tertiary structure prediction.

Primary structure analysis:

There are various tools for predicting the physical properties using the sequence information.

1. SAPS (Statistical Analysis of Protein Sequences):

Input:

SAPS is program that provides extensive statistical information for any given sequence. The output is organized in the following sections: file name, sequence printout, compositional analysis, charge distributional analysis, distribution of other amino acid types, repetitive structures, multiplets, periodicity analysis, and spacing analysis. The output is several pages long.

URL: http://www.isrec.isb-sib.ch/software/SAPS_form.html

2. ProtScale :

ProtScale allows you to compute and represent the profile produced by any amino acid scale on a selected protein. An amino acid scale is defined by a numerical value assigned to each type of amino acid.

The most frequently used scales are the hydrophobicity or hydrophilicity scales and the secondary structure conformational parameters scales, but many other scales exist which are based on different chemical and physical properties of the amino acids. This program provides 55 predefined scales entered from the literature.

URL: http://ca.expasy.org/protscale.pl

Input:

Output:

Secondary structure prediction:

There are several protein secondary structure prediction methods available and the most important methods are Chou-Fasman method, GOR methods, Nearest neighbour methods, Hidden Markov models, Neural networks and Multiple alignments based self-optimization method.

1. GOR (Garnier, Osguthorpe and Robson) method:

GOR is a method that assumes that amino acids up to 8 residues on each side influence the secondary structure of the central residue. This program is now in its fourth version. The accuracy of GOR when checked against a set of 267 proteins of known structure is 64%. This implies that 64% of the amino acids were correctly predicted as being helix, sheet or coil. The algorithm uses a sliding window of 17 amino acids. All possible pairs of amino acids in this window are checked for their information content as to predicting the structure of the central amino acid by comparing them to a set of 266 other proteins of known structure. The method works better for helix than for sheet, because sheet is dependent on longer-range interactions between non-adjacent sequence fragments.

URL: http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=npsa_gor4.html

Input:

Output:

2. COILS:

Coils is a program that compares a sequence to a database of known parallel two-stranded coiled coils and derives a similarity score. After comparing this score to the distribution of scores in globular and coiled-coil proteins, the program then calculates the probability that the sequence will adopt a coiled-coil conformation.

URL: http://www.ch.embnet.org/software/COILS_form.html

Input:

Output:

Tertiary Structure Prediction:

There are three fundamental approaches in using the sequence data for making protein structure prediction. Two approaches namely Homology modeling and Threading uses pattern recognition methods. The pattern recognition approach is used to detect similarity between sequences. This gives indications to infer related structures and functions. The other approach is to the sequence data without any template. This approach is called ab initio prediction. Ab initio approach is truly a prediction approach and is used to deduce structure and infer function directly from sequence.

Homology Modeling:

The prediction of protein three-dimensional (3D) structure from amino acid sequence is most successful when the structures of one or more homologues are known. A similar structure to the native structure can be predicted if sequence similarity is about or above 35%. Structural information can then be extrapolated to the new sequence and a 3D model may be derived, well before X-ray crystallography or NMR determines the structure of the new protein. This approach is most appropriately known as comparative modeling, but it also referred to as homology modeling or knowledge based modeling.

1. SWISS-MODEL:

It is a fully automated protein structure homology-modeling server, accessible via the ExPASy web server, or from the program DeepView (Swiss Pdb-Viewer). It goes through the following 5 steps : search for suitable templates, check sequence identity with target, create ProMod jobs, Generate models with ProModI and Energy minimization with Gromos96.

URL: http://swissmodel.expasy.org//SWISS-MODEL.html

Input1:

Input2:

Result of SWISS MODEL provided only through mail.

Threading:

When the sequence of a query protein has no detectable similarity to other protein structures (<20%), other methods of 3-D protein structure prediction may be enlarged. One such method is sequence threading. It is otherwise known as remote homology modeling or Fold Recognition modeling. Threading involves placing or “threading” an amino acid sequence onto databases of different secondary and tertiary structures. In effect, the procedure is aimed at predicting how well a fold will fit a sequence rather than predicting how well a sequence will fold. A target sequence is “threaded” through a library of 3-D folds to try to find a match. There are two approaches- 2-D threading and 3-D threading. 2-D threading is a “prediction-based” method that uses secondary structure as the primary evaluation criterion. The 3-D threading uses distance-based or profile-based energy functions. The amino acid sequence of a query protein is examined for compatibility with the structure core of a known protein structure. The sequence is threaded into a database of protein cores to look for matches.

URL: http://www.sbg.bio.ic.ac.uk/~3dpssm/

Input:

Abinitio Approach:

In contrast to the above methods, the goal of ab initio prediction is to build a model for a given sequence without using a template. Ab initio prediction relies on the thermodynamic hypothesis of protein folding. The ab initio prediction methods are based on the premise that the native structure of a protein sequence corresponds to its global free energy minimum state. Accoringly, the methods are generally formulated as optimizations. Molecular mechanics and molecular dynamics are used extensively in this type of structure prediction.

HMMSTR /I-sites/Rosetta Prediction Server:

This server predicts the tertiary structure of proteins from the sequence. I-sites predicts local structure, expressed as backbone torsion angles, using a library of sequence-structure motifs. ROSETTA is a Monte Carlo Fragment Insertion protein folding program. HMMSTR is HMM-based tool for local and secondary structure prediction based on I-sites Library. This server provide structure only if no homology present in databases.

URL: http://www.bioinfo.rpi.edu/~bystrc/hmmstr/server.php

Input:

Quarternary structure Prediction:

Quarternary structure deals with the specific arrangement of subunits(polypeptides), with respect one another in the protein compelx. Oligomeric protein usually possess quarternary structure.

Quarternary Structure Predictor:

The term "mericity" is used here to refer to "the number of subunits in a multisubunit protein". Mericity is a quaternary property of proteins. This experimental server accepts a query protein sequence (in single letter amino acid code form) and returns a prediction of the class (homodimer or non-homodimer, i.e. mericity=2 or not 2), a rule number, and a rule confidence level. A particular sequence may be covered by more than one rule. The true error rate of this classifier is approximately 30 per cent as determined by a 10-fold cross-validation experiment. These rules are highly sensitive to certain patterns in sequences, therefore, fragments and artificial sequences may give misleading results. It predicts the ability of the polypeptide to form homo or hetero dimmers.

URL: http://www.mericity.com/

Input:

Structure Validation:

This step is an important one in connection with structure prediction because predicted structure should be validated for its acceptance. Structure validation provided by servers like WHAT IF and Vadar.

WHAT IF:

The program WHAT IF provides nearly 2000 options in fields as diverse as homology modelling, drug docking, electrostatics calculations, structure validation and visualisation. This set of servers gives everybody access to some of these options. It provides its validation in the form of text file which contain ok, warning and error information about the protein structure. At the end of the validation message score given to the predicted protein.

Input:

Output:

VADAR:

VADAR (Volume, Area, Dihedral Angle Reporter) is a compilation of more than 15 different algorighms and programs for analyzing and assessing pepetide and protein structures from their PDB coordinate data. The results have been validated through extensive comparison to published data and careful visual inspection. The VADAR web server supports the submission of either PDB formatted files or PDB accession numbers. VADAR produces extensive tables and high quality graphs for quantitatively and qualitatively assessing protein structures determined by X-ray crystallography, NMR spectroscopy, 3D-thrading or homology modeling.

URL: http://redpoll.pharmacy.ualberta.ca/vadar/

Input:

Output:

TYPE 4 -PROFILE AND PATTERN CONSTRUCTION AND SEARCH:

Profiles:

Profiles are mathematical representation of conserved regions that are built from multiple sequence alignment. Profiles encompass full domain alignments, by defining which residues are allowed at given positions in the sequence, which positions are highly conserved and which positions/regions tolerate insertions.

Profiles program has four components namely an assembly of a family of related sequences into a multiple sequence alignment, construction of a profile from alignment, comparison of the profile to a database and display of the best similarity found with search. Profiles help to find the similarities between these sequences and help in identification and analysis of distant related proteins.

A position specific scoring table (PSSM) is constructed on the lines of PAM or BLOSUM. Much of the profile based search programs are based on statistical method, called Hidden Markov models (HMMs). Hidden Markov model is a Markov chain, which offers a more systematic approach to estimating parameters for domain alignments, by employing position dependent scores to characterize and build a model for an entire family of sequences.

ProfileScan:

It uses a database of profiles to find structural and sequence motifs in protein sequences. ProfileScan finds structural and sequence motifs in protein sequences. These motifs are represented as profiles in a library. ProfileScan aligns each profile motif to the sequence and displays all alignments between the profile and sequence that have a normalized score above a set threshold.

URL: http://hits.isb-sib.ch/cgi-bin/PFSCAN

Input:

Patterns:

Patterns also represent the common characteristics of a protein family, but it does not contain any weighting information. Pattern recognition programs follow reverse process of sequence analysis. Rather than predict how a sequence will fold, they predict how well a fold will match a sequence. That is, matching of sequence with a given topology rather than search for a topology with a given sequence. Pattern recognition methods attempt to detect similarities between 3-D structures that are not accompanied by any significant sequence similarity. The general approach involves calculating of a table of propensities that gives the probability for each type of amino aid being found in a given environment. For a given structure each position can be assigned to one of the environments. Dynamic programming is then used to find the best match of the sequence to the pattern of environments found in a given fold.

PRATT:

Pratt is a tool that allows the user to search for patterns conserved in a set of protein sequences. The user can specify what kind of patterns should be searched for, and how many sequences should match a pattern to be reported. The patterns that can be found is a subset of the set of patterns that can be described using Prosite notation.

URL: http://expasy.org/tools/pratt/

Input:

Profile and Pattern search:

URL: http://motif.genome.jp/MOTIF2.html

In this site, motifs, Profiles and patterns are searched and profiles generated. This server not only finds out sequence motifs in your query sequence, but also provides functional and genomic information of the found motifs using DBGET and LinkDB as the hyperlinked annotations. The results will also be presented graphically, and especially, where available, 3D structures of the found motifs can be examined by RasMol program when the hits are found in PROSITE database. Given a profile which was generated from the multiple sequence alignment, or, retrieved from motif library such as PROSITE or Pfam, one can align a protein sequence with the profile. The procedure is similar to the one to search against the motif library database, however, one should provide a name of the file containing profile matrix instead of the database names.

This server also supports TRANSFAC database which collects eukaryotic cis-acting regulatory DNA elements and trans-acting factors. Given a profile, protein sequence databases on GenomeNet service are retrieved to find out the protein families that have the same motif.

The profile, either in PROSITE or Pfam format, could be calculated from the multiple sequence alignment or retrieved from motif library such as PROSITE or Pfam. The Pfsearch program is used to retrieve with PROSITE format profile and Hmmsearch is used for Pfam format one. Target sequence libraries are Swiss-Prot, PDBSTR, PIR, PRF and Genes. This allows one to search protein sequence libraries with given patterns. Target sequence libraries are Swiss-Prot, PDBSTR, PIR, PRF and Genes. Sequence pattern must be specified in the PROSITE pattern format only. Two types of profile data, either in PROSITE or Pfam format, are calculated from the multiple alignment sequences. using PFMake or HMMBuild respectively.

TYPE5 PROTEIN FUNCTION PREDICTION:

Protein sequence determines protein structure determines protein function. Therefore, for function prediction, initially structure predicted and then function predicted. Predicting protein function from sequence adds two additional problems in comparison to the unsolved task of structure prediction:

Function is not entirely determined by sequence; the environment is crucially important.
‘Protein Function’ is a rather intuitive but ill-defined term. Function is a complex phenomenon associated with many mutually overlapping levels: chemical, biochemical, cellular, physiological, organism mediated and developmental.

These levels are related in complex ways e.g. protein kinases can be related to different cellular functions and to a chemical function plus a complex control mechanism by interaction with other proteins. Protein function prediction efforts generally involve attempts to predict biochemical function, cellular role predictions and subcellular location predictions.

ProtFun Server:

It produces ab initio predictions of protein function from sequence. The method queries a large number of other feature predictin servers to obtain information on various post-translational and localizatinal aspects of the protein, which are integrated into final predictions of the cellular role, enzyme class and selected gene ontology categories of the submitted sequence. It is possible to inspect the individual feature predictins used and integrated by ProtFun.

URL: http://www.cbs.dtu.dk/services/ProtFun/

Input:

EXPASY TOOLS

Protein identification and characterization

Identification and characterization with peptide mass fingerprinting data

Aldente - Identify proteins with peptide mass fingerprinting data. A new, fast and powerful tool that takes advantage of Hough transformation for spectra recalibration and outlier exclusion
FindMod - Predict potential protein post-translational modifications and potential single amino acid substitutions in peptides. Experimentally measured peptide masses are compared with the theoretical peptides calculated from a specified Swiss-Prot entry or from a user-entered sequence, and mass differences are used to better characterize the protein of interest.
FindPept - Identify peptides that result from unspecific cleavage of proteins from their experimental masses, taking into account artefactual chemical modifications, post-translational modifications (PTM) and protease autolytic cleavage
GlycoMod - Predict possible oligosaccharide structures that occur on proteins from their experimentally determined masses (can be used for free or derivatized oligosaccharides and for glycopeptides)
Mascot - Peptide mass fingerprint from Matrix Science Ltd., London -http://www.matrixscience.com/search_form_select.html
PepMAPPER - Peptide mass fingerprinting tool from UMIST, UK - http://wolf.bms.umist.ac.uk/mapper/
PFMUTS - Shows the possible single and double mutations of a peptide fragment from MALDI peptide mass fingerprinting - http://www.mcs.vuw.ac.nz/~aleksand/pfmuts/pfmuts.html
ProFound - Search known protein sequences with peptide mass information from Rockefeller and NY Universities [or from Genomic Solutions] - http://prowl.rockefeller.edu/
ProteinProspector - UCSF tools for peptide masses data (MS-Fit, MS-Pattern, MS-Digest, etc.) - http://prospector.ucsf.edu/

Identification and characterization with MS/MS data

Popitam - Identification and characterization tool for peptides with unexpected modifications (e.g. post-translational modifications or mutations) by tandem mass spectrometry
Phenyx - Protein and peptide identification/characterization from MS/MS data from GeneBio, Switzerland
Mascot - Sequence query and MS/MS ion search from Matrix Science Ltd., London - http://www.matrixscience.com/search_form_select.html
OMSSA - MS/MS peptide spectra identification by searching libraries of known protein sequences - http://pubchem.ncbi.nlm.nih.gov/omssa/
PepFrag - Search known protein sequences with peptide fragment mass information from Rockefeller and NY Universities [or from Genomic Solutions] - http://prowl.rockefeller.edu/
ProteinProspector - UCSF tools for fragment-ion masses data (MS-Tag, MS-Seq, MS-Product, etc.) - http://prospector.ucsf.edu/
SearchXLinks - Analysis of mass spectra of modified, cross-linked, and digested proteins whose amino acid sequence is known, from Caesar, Germany - http://www.searchxlinks.de/

Identification with isoelectric point, molecular weight and/or amino acid composition

AACompIdent - Identify a protein by its amino acid composition
AACompSim - Compare the amino acid composition of a UniProtKB/Swiss-Prot entry with all other entries
TagIdent - Identify proteins with isoelectric point (pI), molecular weight (Mw) and sequence tag, or generate a list of proteins close to a given pI and Mw
MultiIdent - Identify proteins with isoelectric point (pI), molecular weight (Mw), amino acid composition, sequence tag and peptide mass fingerprinting data

Other prediction or characterization tools

ProtParam - Physico-chemical parameters of a protein sequence (amino-acid and atomic compositions, isoelectric point, extinction coefficient, etc.)
Compute pI/Mw - Compute the theoretical isoelectric point (pI) and molecular weight (Mw) from a UniProt Knowledgebase entry or for a user sequence
GlycanMass - Calculate the mass of an oligosaccharide structure
PeptideCutter - Predicts potential protease and cleavage sites and sites cleaved by chemicals in a given protein sequence
PeptideMass - Calculate masses of peptides and their post-translational modifications for a UniProtKB/Swiss-Prot or UniProtKB/TrEMBL entry or for a user sequence
IsotopIdent - Predicts the theoretical isotopic distribution of a peptide, protein, polynucleotide or chemical compound - http://education.expasy.org/student_projects/isotopident/

Other tools for 2-DE or MS data (vizualisation, analysis, etc.)

ImageMaster / Melanie - Software for 2-D PAGE analysis
MSight - Mass Spectrometry Imager

Make2D-DB II - A package to build a web-based proteomics database

DNA -> Protein

Translate - Translates a nucleotide sequence to a protein sequence
Transeq - Nucleotide to protein translation from the EMBOSS package - http://www.ebi.ac.uk/emboss/transeq/
Graphical Codon Usage Analyser - Displays the codon bias in a graphical manner- http://gcua.schoedl.de/
BCM search launcher - Six frame translation of nucleotide sequence(s) - http://searchlauncher.bcm.tmc.edu/seq-util/Options/sixframe.html
Backtranslation - Translates a protein sequence back to a nucleotide sequence - http://www.entelechon.com/eng/backtranslation.html
Reverse Translate - Translates a protein sequence back to a nucleotide sequence - http://www.bioinformatics.org/sms2/rev_trans.html
Genewise - Compares a protein sequence to a genomic DNA sequence, allowing for introns and frameshifting errors - http://www.sanger.ac.uk/Software/Wise2/genewiseform.shtml
FSED - Frameshift error detection - http://ir2lcb.cnrs-mrs.fr/d_fsed/fsed.html
LabOnWeb - Elongation, expression profiles and sequence analysis of ESTs using Compugen LEADS clusters - http://www.labonweb.com/

List of gene identification software sites - http://www.cbi.pku.edu.cn/mirror/GenomeWeb/nuc-geneid.html

Similarity searches

BLAST Network Service on ExPASy
BLAST at EMBnet-CH/SIB (Switzerland)
BLAST at NCBI - http://www.ncbi.nlm.nih.gov/BLAST/
WU-BLAST at Bork's group in EMBL (Heidelberg) - http://dove.embl-heidelberg.de/Blast2/
WU-BLAST and BLAST at the EBI (Hinxton) - http://www.ebi.ac.uk/blast2/
BLAST at PBIL (Lyon) - http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=npsa_blast.html

Fasta3 - FASTA version 3 at the EBI - http://www.ebi.ac.uk/fasta33/
FDF - Smith/Waterman type searches on Paracel's Fast Data Finder (FDF) at EMBnet-CH - http://www.ch.embnet.org/software/FDF_form.html
MPsrch - Smith/Waterman sequence comparison at EBI - http://www.ebi.ac.uk/MPsrch/
PropSearch - Structural homolog search using a 'properties' approach at Montpellier - http://www.infobiosud.univ-montp1.fr/SERVEUR/PROPSEARCH/propsearch.html
SAMBA - Systolic Accelerator for Molecular Biological Applications - http://www.irisa.fr/SAMBA/
SAWTED - Structure Assignment With Text Description - http://www.bmm.icnet.uk/~sawted/
Scanps - Similarity searches using Barton's algorithm - http://www.ebi.ac.uk/scanps/
SEQUEROME - BLAST similarity search and sequence profiling at Georgetown University - http://sequerome.georgetown.edu/
SHOPS - Analysis of the genomic operon context for any group of proteins - http://www.bioinformatics.med.uu.nl/shops/

Pattern and profile searches

InterPro Scan - Integrated search in PROSITE, Pfam, PRINTS and other family and domain databases - http://www.ebi.ac.uk/InterProScan/
ScanProsite - Scans a sequence against PROSITE or a pattern against the UniProt Knowledgebase (Swiss-Prot and TrEMBL)
MotifScan - Scans a sequence against protein profile databases (including PROSITE) - http://myhits.isb-sib.ch/cgi-bin/motif_scan
Pfam HMM search; scans a sequence against the Pfam protein families db [At Washington University or at Sanger Centre]
FingerPRINTScan - Scans a protein sequence against the PRINTS Protein Fingerprint Database - http://www.bioinf.man.ac.uk/fingerPRINTScan/
3of5 - Complex Pattern Search - http://www.dkfz.de/mga2/3of5/3of5.html
ELM - Eukaryotic Linear Motif resource for functional sites in proteins - http://elm.eu.org/
PRATT - Interactively generates conserved patterns from a series of unaligned proteins; [at EBI / ExPASy]
PPSEARCH - Scans a sequence against PROSITE (allows a graphical output); at EBI - http://www.ebi.ac.uk/ppsearch/
PROSITE scan - Scans a sequence against PROSITE (allows mismatches); at PBIL - http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=npsa_prosite.html
PATTINPROT - Scans a protein sequence or a protein database for one or several pattern(s); at PBIL - http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=npsa_pattinprot.html
SMART - Simple Modular Architecture Research Tool; at EMBL - http://smart.embl-heidelberg.de/
TEIRESIAS - Generate patterns from a collection of unaligned protein or DNA sequences; at IBM - http://cbcsrv.watson.ibm.com/Tspd.html
Hits - Relationships between protein sequences and motifs

Post-translational modification prediction

ChloroP - Prediction of chloroplast transit peptides
LipoP - Prediction of lipoproteins and signal peptides in Gram negative bacteria
MITOPROT - Prediction of mitochondrial targeting sequences
PATS - Prediction of apicoplast targeted sequences
PlasMit - Prediction of mitochondrial transit peptides in Plasmodium falciparum
Predotar - Prediction of mitochondrial and plastid targeting sequences
PTS1 - Prediction of peroxisomal targeting signal 1 containing proteins
SignalP - Prediction of signal peptide cleavage sites
NetAcet - Prediction of N-acetyltransferase A (NatA) substrates (in yeast and mammalian proteins)
NetOGlyc - Prediction of O-GalNAc (mucin type) glycosylation sites in mammalian proteins
NetNGlyc - Prediction of N-glycosylation sites in human proteins
OGPET - Prediction of O-GalNAc (mucin-type) glycosylation sites in eukaryotic (non-protozoan) proteins
DictyOGlyc - Prediction of GlcNAc O-glycosylation sites in Dictyostelium
YinOYang - O-beta-GlcNAc attachment sites in eukaryotic protein sequences
big-PI Predictor - GPI Modification Site Prediction
DGPI - Prediction of GPI-anchor and cleavage sites (Mirror site)
GPI-SOM - Identification of GPI-anchor signals by a Kohonen Self Organizing Map
Myristoylator - Prediction of N-terminal myristoylation by neural networks
NetPhos - Prediction of Ser, Thr and Tyr phosphorylation sites in eukaryotic proteins
NetPicoRNA - Prediction of protease cleavage sites in picornaviral proteins
NMT - Prediction of N-terminal N-myristoylation
PrePS - Prenylation Prediction Suite
Sulfinator - Prediction of tyrosine sulfation sites
SUMOplot - Prediction of SUMO protein attachment sites
TermiNator - Prediction of N-terminal modification

Topology prediction

PSORT - Prediction of protein subcellular localization
TargetP - Prediction of subcellular location
DAS - Prediction of transmembrane regions in prokaryotes using the Dense Alignment Surface method (Stockholm University)
HMMTOP - Prediction of transmembrane helices and topology of proteins (Hungarian Academy of Sciences)
PredictProtein - Prediction of transmembrane helix location and topology (Columbia University)
SOSUI - Prediction of transmembrane regions (Nagoya University, Japan)
TMAP - Transmembrane detection based on multiple sequence alignment (Karolinska Institut; Sweden)
TMHMM - Prediction of transmembrane helices in proteins (CBS; Denmark)
TMpred - Prediction of transmembrane regions and protein orientation (EMBnet-CH)
TopPred - Topology prediction of membrane proteins (France)

Primary structure analysis

ProtParam - Physico-chemical parameters of a protein sequence (amino-acid and atomic compositions, isoelectric point, extinction coefficient, etc.)
Compute pI/Mw - Compute the theoretical isoelectric point (pI) and molecular weight (Mw) from a UniProt Knowledgebase entry or for a user sequence
ScanSite pI/Mw - Compute the theoretical pI and Mw, and multiple phosphorylation states
MW, pI, Titration curve - Computes pI, composition and allows to see a titration curve
Radar - De novo repeat detection in protein sequences
REP - Searches a protein sequence for repeats
REPRO - De novo repeat detection in protein sequences
TRUST - De novo repeat detection in protein sequences
SAPS - Statistical analysis of protein sequences at EMBnet-CH [Also available at EBI]
Coils - Prediction of coiled coil regions in proteins (Lupas's method) at EMBnet-CH [Also available at PBIL]
Paircoil - Prediction of coiled coil regions in proteins (Berger's method)
Paircoil2 - Prediction of the parallel coiled coil fold from sequence using pairwise residue probabilitis with the Paircoil algorithm.
Multicoil - Prediction of two- and three-stranded coiled coils
2ZIP - Prediction of Leucine Zippers
PESTfind - Identification of PEST regions at EMBnet Austria
HLA_Bind - Prediction of MHC type I (HLA) peptide binding
PEPVAC - Prediction of supertypic MHC binders
RANKPEP - Prediction of peptide MHC binding
SYFPEITHI - Prediction of MHC type I and II peptide binding
ProtScale - Amino acid scale representation (Hydrophobicity, other conformational parameters, etc.)
Drawhca - Draw an HCA (Hydrophobic Cluster Analysis) plot of a protein sequence
Protein Colourer - Tool for coloring your amino acid sequence
Three To One - Tool to convert a three-letter coded amino acid sequence to single letter code
Colorseq - Tool to highlight (in red) a selected set of residues in a protein sequence
HelixWheel / HelixDraw - Representations of a protein fragment as a helical wheel
RandSeq - Random protein sequence generator

Secondary structure prediction

AGADIR - An algorithm to predict the helical content of peptides
APSSP - Advanced Protein Secondary Structure Prediction Server
GOR - Garnier et al, 1996
HNN - Hierarchical Neural Network method (Guermeur, 1997)
Jpred - A consensus method for protein secondary structure prediction at University of Dundee
JUFO - Protein secondary structure prediction from sequence (neural network)
nnPredict - University of California at San Francisco (UCSF)
Porter - University College Dublin
PredictProtein - PHDsec, PHDacc, PHDhtm, PHDtopology, PHDthreader, MaxHom, EvalSec from Columbia University
Prof - Cascaded Multiple Classifiers for Secondary Structure Prediction
PSA - BioMolecular Engineering Research Center (BMERC) / Boston
PSIpred - Various protein structure prediction methods at Brunel University
SOPMA - Geourjon and Deléage, 1995
SSpro - Secondary structure prediction using bidirectional recurrent neural networks at University of California
DLP - Domain linker prediction at RIKEN

Tertiary structure

Tertiary structure analysis

iMolTalk - An Interactive Protein Structure Analysis Server
MolTalk - A computational environment for structural bioinformatics
Seq2Struct - A web resource for the identification of sequence-structure links
STRAP - A structural alignment program for proteins
TLSMD - TLS (Translation/Libration/Screw) Motion Determination

Tertiary structure prediction

Comparative modeling

SWISS-MODEL - An automated knowledge-based protein modelling server
3Djigsaw - Three-dimensional models for proteins based on homologues of known structure
CPHmodels - Automated neural-network based protein modelling server
ESyPred3D - Automated homology modeling program using neural networks
Geno3d - Automatic modelling of protein three-dimensional structure
SDSC1 - Protein Structure Homology Modeling Server

Threading

3D-PSSM - Protein fold recognition using 1D and 3D sequence profiles coupled with secondary structure information (Foldfit)
Fugue - Sequence-structure homology recognition
HHpred - Protein homology detection and structure prediction by HMM-HMM comparison
Libellula - Neural network approach to evaluate fold recognition results
LOOPP - Sequence to sequence, sequence to structure, and structure to structure alignment
SAM-T02 - HMM-based Protein Structure Prediction
Threader - Protein fold recognition
ProSup - Protein structure superimposition
SWEET - Constructing 3D models of saccharides from their sequences

Ab initio

HMMSTR/Rosetta - Prediction of protein structure from sequence

Assessing tertiary structure prediction

Anolea - Atomic Non-Local Environment Assessment
Biotech Validation Suite for Protein Structures
EVA - EValuation of Automatic protein structure prediction
LiveBench - Continuous Benchmarking of Structure Prediction Servers
PROCHECK - Verification of the stereochemical quality of a protein structure
What If - Protein structure analysis program for mutant prediction, structure verification, molecular graphics

Molecular modeling and visualization tools

Swiss-PdbViewer - A program to display, analyse and superimpose protein 3D structures
Astex Viewer
Jmol
MolMol
PyMol
Rasmol
VMD
YASARA - Molecular graphics, modeling, simulations and eLearning

Prediction of disordered regions

DisEMBL - Protein disorder prediction
GlobPlot - Protein disorder/order/globularity/domain predictor

Sequence alignment

Binary

SIM + LALNVIEW - Alignment of two protein sequences with SIM, results can be viewed with LALNVIEW
LALIGN - Finds multiple matching subsegments in two sequences
Dotlet - A Java applet for sequence comparisons using the dot matrix method

Multiple

Decrease redundancy - Reduce a set of sequences into a non-redundant set
Nomad (Neighborhood Optimization for Multiple Alignment Discovery) - Ungapped local multiple alignment, optimized for protein sequences, even when distantly related
CLUSTALW [At EBI, PBIL, My Hits or at EMBnet-CH]
KALIGN - an accurate and fast multiple sequence alignment algorithm [At Karolinska Institute or at EBI]
MAFFT [At Kyushu University, EBI or at MyHits]
Muscle [At Berkeley or at BioAssist]
T-Coffee [At MyHits, BioAssist or at EBI]
MSA - at Genestream (IGH)
DIALIGN - Multiple sequence alignment based on segment-to-segment comparison, at University of Bielefeld, Germany
Match-Box - at University of Namur, Belgium - at Washington University
Multalin [At INRA or at PBIL]
MUSCA - Multiple sequence alignment using pattern discovery, at IBM

Alignment analysis

AMAS - Analyse Multiply Aligned Sequences
Bork's alignment tools - Various tools to enhance the results of multiple alignments (including consensus building).
CINEMA - Color Interactive Editor for multiple alignments
ESPript - Tool to print a multiple alignment
PhyloGibbs - Gibbs motif sampler incorporating phylogeny and tracking statistics
SVA - Sequence Variability Analyser for multiple alignments
WebLogo - Sequence logos at Berkeley/USA
plogo - Sequence logos at CBS/Denmark
GENIO/logo - Sequence logos at Stuttgart/Germany
SeqLogo - Sequence logos at MIF/USA
WebLogo - Sequence logos at Cambridge/UK

Phylogenetic analysis

Phylogenetic programs - List of phylogenetic packages and free servers (PHYLIP pages)
PHYLIP - Server for phylogenetic analysis using the PHYLIP package
BIONJ - Server for NJ phylogenetic analysis
PHYML - Server for ML phylogenetic analysis
PHYLIP - Package of programs for inferring phylogenies (Joe Felsenstein)
MOLPHY - Package of programs for phylogenetic analysis
MrBayes - Package for the Bayesian estimation of phylogeny
PAML - Package for phylogenetic analysis by Maximum Likelihood
TREE PUZZLE - Package for phylogenetic analysis by Maximum Likelihood
ConSurf - Projection of evolutionary conservation scores of residues on protein structures
Evolutionary Trace Server (TraceSuite II) - Maps evolutionary traces to structures

Biological text analysis

AcroMed - A computer generated database of biomedical acronyms and the associated long forms extracted from the recent Medline abstracts
BioMinT - Mining the biomedical literature
GPSDB - Gene and Protein Synonym DataBase
MedMiner - Extract and organize relevant sentences in the literature based on a gene, gene-gene or gene-drug query
XplorMed - Explore a set of abstracts derived from a bibliographic search in MEDLINE