| Frequently Asked Questions: Blat
|
|
|
|
Blat vs. Blast |
|
|
|
| |
Question:
"What are the differences between Blat and Blast?"
Response:
Blat is an alignment tool like BLAST, but it is structured differently. On
DNA, Blat works by keeping an index of an entire genome in memory.
Thus, the target database of BLAT is not a set of GenBank sequences, but instead an
index derived from the assembly of the entire genome. The
index -- which uses around a megabyte of RAM -- consists of all non-overlapping 11-mers. This smaller size means that Blat is far more easily
mirrored. Blat of DNA
is designed to quickly find sequences of 95% and greater similarity of length
40 bases or more. It may miss more divergent or short sequence alignments.
On proteins, Blat uses 4-mers rather than 11-mers, finding protein sequences
of 80% and greater similarity to the query of length 20+ amino acids. The protein index requires slightly more than 2 gigabytes of RAM.
In practice -- due to sequence divergence rates over evolutionary time -- DNA
Blat works well within humans and primates, while protein Blat
continues to find good matches within terrestrial vertebrates and even earlier
organisms for conserved proteins. Within humans, protein Blat gives a much better
picture of gene families (paralogs) than DNA Blat. However, BLAST and
psi-BLAST at NCBI can find much more remote matches.
From a practical standpoint, Blat has several advantages over BLAST:
- speed (no queues, response in seconds) at the price of lesser homology depth
- the ability to submit a long list of simultaneous queries in fasta format
- five convenient output sort options
- a direct link into the UCSC browser
- alignment block details in natural genomic order
- an option to launch the alignment later as part of a custom track
Blat is commonly used to look up the location of a
sequence in the genome or determine the exon structure of an mRNA, but expert
users can run large batch jobs and make internal parameter sensitivity
changes by installing command line Blat on their own Linux server.
| |
|
|
|
Blat use restrictions |
|
|
|
| |
Question:
"I received a high-volume traffic warning from your Blat
server informing me that I had exceeded the server use
limitations. Can you give me information on the UCSC
Blat server use parameters?"
Response:
Due to the high demand on our Blat servers, we restrict
service for users who programatically query Blat or do
large batch queries. Program-driven use of Blat is
limited to a maximum of one hit every 15
seconds and no more than 5,000 hits per day. Please limit
batch queries to 25 sequences or less.
For users with high-volume Blat demands, we recommend
downloading Blat for local use. For more information,
see Downloading Blat source and
documentation.
| |
|
|
|
Downloading Blat source and documentation |
|
|
|
| |
Question:
"Is the Blat source available for download? Is there
documentation available?"
Response:
Blat source and executables are freely available for
academic, nonprofit and personal use. Commercial licensing
information is available on the
Genome Blat website.
Blat source may be downloaded from
http://www.soe.ucsc.edu/~kent
(look for the blatSrc* zip file with the most recent
date). For
Blat executables, go to
http://www.soe.ucsc.edu/~kent/exe/; binaries are sorted by platform.
Documentation on Blat program specifications is available
here.
| |
|
|
|
Replicating web-based Blat
parameters in command-line version
|
|
|
|
| |
Question:
"I'm setting up my own Blat server and would like to use
the same parameter values that the UCSC web-based Blat
server uses."
Response:
Use the following settings to replicate
the search results of the UCSC Blat server. Note that
you may still observe some slight differences between
command line results and web-based results, depending
on the seach being performed.
gfServer:
- untranslated server:
gfServer start blatX portX -log=untrans.log *.2bit
- translated server:
gfServer start blatX portY -trans -mask -log=trans.log *.2bit
- untranslated server plus PCR server:
gfServer start blatX portX -stepSize=5 -log=untrans.log *.2bit
- For enabling DNA/DNA and DNA/RNA
matches, only the host, port and nib files are needed.
The same port may be used for both untranslated blat
and isPcr. You'll need a separate server on a separate
port to enable translated (protein-based) matches.
gfClient:
- Set -minScore=0 and
-minIdentity=0. This will result in some
low-scoring, generally spurious hits, but for
interactive use it's sufficiently easy to ignore them
(because results are sorted by score) and sometimes
the low-scoring hits come in handy. The
-ooc parameter is not set for
gfClient.
For more information on the parameters available for
blat, gfServer, and gfClient, see the
blat
specifications.
| |
|
|
|
Using the -ooc flag |
|
|
|
| |
Question:
"What does the -ooc flag do?"
Response:
Using any -ooc option in blat, such
as -ooc=11.ooc, simply serves to speed up
searches similar to repeat-masking sequence. The
11.ooc file contains sequences
determined to be over-represented in the genome
sequence. To speed up searches, these sequences are not
used when seeding an alignment against the genome. For
reasonably-sized sequences, this will not create a
problem and will significantly reduce processing time.
By not using the 11.ooc file, you will increase
alignment time, but will also slightly increase
sensitivity. This may be important if you are aligning
shorter sequences or sequences of poor quality. For example,
if a particular sequence consists primarily of
sequences in the 11.ooc file, it will
never be seeded correctly for an alignment if the
-ooc flag is used.
In summary,
if you are not finding certain sequences and can afford
the extra processing time, you may want to run blat
without the 11.ooc file if your particular
situation warrants its use.
| |
|
|
|
Percent identity score calculation |
|
|
|
| |
Question:
"How is the percent identity score calculated?"
Response:
The percent identity score is calculated like this:
100.0 - pslCalcMilliBad(psl, TRUE) * 0.1
Here is the source for pslCalcMilliBad:
int pslCalcMilliBad(struct psl *psl, boolean isMrna)
/* Calculate badness in parts per thousand. */
{
int sizeMul = pslIsProtein(psl) ? 3 : 1;
int qAliSize, tAliSize, aliSize;
int milliBad;
int sizeDif;
int insertFactor;
qAliSize = sizeMul * (psl->qEnd - psl->qStart);
tAliSize = psl->tEnd - psl->tStart;
aliSize = min(qAliSize, tAliSize);
if (aliSize <= 0)
return 0;
sizeDif = qAliSize - tAliSize;
if (sizeDif < 0)
{
if (isMrna)
sizeDif = 0;
else
sizeDif = -sizeDif;
}
insertFactor = psl->qNumInsert;
if (!isMrna)
insertFactor += psl->tNumInsert;
milliBad = (1000 * (psl->misMatch*sizeMul + insertFactor +
round(3*log(1+sizeDif)))) / (sizeMul * (psl->match + psl->repMatch +
psl->misMatch));
return milliBad;
}
The complexity in milliBad arises primarily from how it
handles inserts. Ignoring the inserts, the calculation
is simply mismatches expressed as parts per thousand.
However, the algorithm factors in insertion penalties as
well, which are relatively weak compared to say blasts
but still present. When huge inserts are allowed (which is
necessary to accommodate introns), it is typically
necessary to resort to logarithms like this calculation
does.
| |
|
|
| |