The V-SAGE
The approach described here, termed 'Virtual SAGE' (V-SAGE), takes
the efficiency, speed, and reliability of data mining from classical SAGE,
and combines it with the expediency of gene identification that characterizes
EST analysis. The concept is based on establishing a correlation between
several tags extracted from EST sequence collections at different distances
from the poly (A) region. By extracting tags at the extreme 3'-end and
internal tags from the EST sequences, complexity is reduced, clustering
of similar transcripts into larger groups (populations of tags) is possible,
BLAST analysis marks contigs, and 3'-terminal variants, mostly in the 3'-UTR,
within these groups are rapidly identified.
The V-SAGE software structure.
The V-SAGE software that has been generated is an in silico emulation
of the established SAGE protocol. The input data consists of a FASTA file
that contains a string or strings of EST sequences from the library of
interest that have been determined from the 3' end. Data processing includes
the following steps: 1. Program identifying the poly-A region, (minimum
eight A residues) 2. Extracting the first 10 bases tag immediately upstream
of the poly(A)+ region. 3. Collecting the set of 10-base tags, which are
immediately 3'-adjacent to the site that is most 3' of a selected restriction
endonuclease cleavage site, e.g., for NlaIII (CATG), and records the distance
from this site to the poly(A)+-adjacent tag in nucleotides. 4. Assigning
a clone name identifier to each pair of tags. The result of the data processing
consists of the set of tags located upstream of the poly(A)+ region of
each EST. This set is a unique identifier for any transcript, which can
then be used for further analyses, such as a digital representation of
cellular gene expression, for studies of 3'-UTR variability, and also as
a map for regular SAGE transcript profiling. The principle of V-SAGE is
applicable to any number of sequence collections and short oligonucleotide
strings with certain precautions. The use of four-bp cutting restriction
endonucleases, for example, is determined by the average length of the
sequences targeted. In our example, the average length of approximately
500 nucleotides (theoretical optimum is 256 bp, the experimental data provide
a range from 100 to 600 nucleotides) provided suitable frequency. This
length does however not allow for the extraction of tags by restriction
sites for 6 bp recognizing enzymes, which would require sequence length
of at least 1 kb, and preferably higher. When longer sequences are available,
splice variants within the coding region of transcripts, for example, can
also be identified by V-SAGE. |
Web executable
V-SAGE script.
Browse local FASTA
file.
Hit submit button.
CGI programming
by
Michael Brukman
http://misha.brukman.net
Perl script by
Lihua Jiang.
To run of VSAGE
on a local computer you need to download file
perlcode_v2.pl
|