UCSC Genome Bioinformatics
   Home  -   Genomes  -   Blat  -   Tables  -   PCR  -   Help
  Frequently Asked Questions: Data File Formats
 

Return to FAQ Table of Contents



  BED format
 

BED format provides a flexible way to define the data lines that are displayed in an annotation track. BED lines have three required fields and nine additional optional fields. The number of fields per line must be consistent throughout any single set of data in an annotation track.

The first three required BED fields are:

  1. chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or contig (e.g. ctgY1).
  2. chromStart - The starting position of the feature in the chromosome or contig. The first base in a chromosome is numbered 0.
  3. chromEnd - The ending position of the feature in the chromosome or contig. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.

The 9 additional optional BED fields are:

  1. name - Defines the name of the BED line. This label is displayed to the left of the BED line in the Genome Browser window when the track is open to full display mode or directly to the left of the item in pack mode.
  2. score - A score between 0 and 1000. If the track line useScore attribute is set to 1 for this annotation data set, the score value will determine the level of gray in which this feature is displayed (higher numbers = darker gray).
  3. strand - Defines the strand - either '+' or '-'.
  4. thickStart - The starting position at which the feature is drawn thickly (for example, the start codon in gene displays).
  5. thickEnd - The ending position at which the feature is drawn thickly (for example, the stop codon in gene displays).
  6. reserved - This should always be set to zero.
  7. blockCount - The number of blocks (exons) in the BED line.
  8. blockSizes - A comma-separated list of the block sizes. The number of items in this list should correspond to blockCount.
  9. blockStarts - A comma-separated list of block starts. All of the blockStart positions should be calculated relative to chromStart. The number of items in this list should correspond to blockCount.

Example:
Here's an example of an annotation track that uses a complete BED definition:

track name=pairedReads description="Clone Paired Reads" useScore=1
chr1 1000 5000 cloneA 960 + 1000 5000 0 2 567,488, 0,3512
chr1 2000 6000 cloneB 900 - 2000 6000 0 2 433,399, 0,3601


  PSL format
 

PSL lines represent alignments, and are typically taken from files generated by BLAT or psLayout. See the BLAT documentation for more details. All of the following fields are required on each data line within a PSL file:

  1. matches - Number of bases that match that aren't repeats
  2. misMatches - Number of bases that don't match
  3. repMatches - Number of bases that match but are part of repeats
  4. nCount - Number of 'N' bases
  5. qNumInsert - Number of inserts in query
  6. qBaseInsert - Number of bases inserted in query
  7. tNumInsert - Number of inserts in target
  8. tBaseInsert - Number of bases inserted in target
  9. strand - '+' or '-' for query strand. For translated alignments, second '+'or '-' is for genomic strand
  10. qName - Query sequence name
  11. qSize - Query sequence size
  12. qStart - Alignment start position in query
  13. qEnd - Alignment end position in query
  14. tName - Target sequence name
  15. tSize - Target sequence size
  16. tStart - Alignment start position in target
  17. tEnd - Alignment end position in target
  18. blockCount - Number of blocks in the alignment (a block contains no gaps)
  19. blockSizes - Comma-separated list of sizes of each block
  20. qStarts - Comma-separated list of starting positions of each block in query
  21. tStarts - Comma-separated list of starting positions of each block in target

Example:
Here is an example of an annotation track in PSL format. Note that line breaks have been inserted into the PSL lines in this example for documentation display purposes. Click here for a copy of this example that can be pasted into the browser without editing.

track name=fishBlats description="Fish BLAT" useScore=1
59 9 0 0 1 823 1 96 +- FS_CONTIG_48080_1 1955 171 1062 chr1
    47748585 13073589 13073753 2 48,20,  171,1042,  34674832,34674976,
59 7 0 0 1 55 1 55 +- FS_CONTIG_26780_1 2825 2456 2577 chr1
    47748585 13073626 13073747 2 21,45,  2456,2532,  34674838,34674914,
59 7 0 0 1 55 1 55 -+ FS_CONTIG_26780_1 2825 2455 2676 chr1
    47748585 13073727 13073848 2 45,21,  249,349,  13073727,13073827,

Be aware that the coordinates for a negative strand in a PSL line are handled in a special way. In the qStart and qEnd fields, the coordinates indicate the position where the query matches from the point of view of the forward strand, even when the match is on the reverse strand. However, in the qStarts list, the coordinates are reversed.

Example:
Here is a 30-mer containing 2 blocks that align on the minus strand and 2 blocks that align on the plus strand (this sometimes can happen in response to assembly errors):

0         1         2         3 tens position in query   
0123456789012345678901234567890 ones position in query   
            ++++          +++++ plus strand alignment on query   
    --------    ----------      minus strand alignment on query   
Plus strand:   
     qStart=12 
     qEnd=31 
     blockSizes=4,5 
     qStarts=12,26   
                      
Minus strand:   
     qStart=4 
     qEnd=26 
     blockSizes=10,8 
     qStarts=5,19    

Essentially, the minus strand blockSizes and qStarts are what you would get if you reverse-complemented the query. However, the qStart and qEnd are not reversed. To convert one to the other:

     qStart = qSize - revQEnd
     qEnd = qSize - revQStart


  GFF format
 

GFF (General Feature Format) lines are based on the GFF standard file format. GFF lines have nine required fields that must be tab-separated. If the fields are separated by spaces instead of tabs, the track will not display correctly. For more information on GFF format, refer to http://www.sanger.ac.uk/Software/formats/GFF.

Here is a brief description of the GFF fields:

  1. seqname - The name of the sequence. Must be a chromosome or a contig.
  2. source - The program that generated this feature.
  3. feature - The name of this type of feature. Some examples of standard feature types are "CDS", "start_codon", "stop_codon", and "exon".
  4. start - The starting position of the feature in the sequence. The first base is numbered 1.
  5. end - The ending position of the feature (inclusive).
  6. score - A score between 0 and 1000. If the track line useScore attribute is set to 1 for this annotation data set, the score value will determine the level of gray in which this feature is displayed (higher numbers = darker gray). If there is no score value, enter ".".
  7. strand - Valid entries include '+', '-', or '.' (for don't know/don't care).
  8. frame - If the feature is a coding exon, frame should be a number between 0-2 that represents the reading frame of the first base. If the feature is not a coding exon, the value should be '.'.
  9. group - All lines with the same group are linked together into a single item.

Example:
Here's an example of a GFF-based track. Click here for a copy of this example that can be pasted into the browser without editing. NOTE: Paste operations on some operating systems will replace tabs with spaces, which will result in an error when the GFF track is uploaded. You can circumvent this problem by pasting the URL of the above example (http://genome.ucsc.edu/goldenPath/help/regulatory.txt) instead of the text itselfinto the custom annotation track text box.

track name=regulatory description="TeleGene(tm) Regulatory Regions"
chr1  TeleGene enhancer  1000000  1001000  500 +  .  touch1
chr1  TeleGene promoter  1010000  1010100  900 +  .  touch1
chr1  TeleGene promoter  1020000  1020000  800 -  .  touch2


  GTF format
 

GTF (Gene Transfer Format) is a refinement to GFF that tightens the specification.The first eight GTF fields are the same as GFF. The group field has been expanded into an attribute field that includes a list ofsemicolon-separated attribute/value pairs. For more information on this format,see http://genes.cs.wustl.edu/GTF2.html.

Some examples of entries for the attribute field include:

  • gene_id value - A globally unique identifier for the genomic source of the sequence.
  • transcript_id value - A globally unique identifier for the predicted transcript.

Example:
Here is an example of the ninth field in a GTF data line:

gene_id Em:U62317.C22.6.mRNA; transcript_id Em:U62317.C22.6.mRNA; exon_number 1

The Genome Browser groups together GTF lines that have the same transcript_id value. It only looks at EXON and CDStype features.



  MAF format
 

The multiple alignment format stores a series of multiple alignments in a format that is easy to parse and relatively easy to read. This format stores multiple alignments at the DNA level between entire genomes. The previously existing formats are suitable for multiple alignments of single proteins or regions of DNA without rearrangements, and would require considerable extension to cope with genomic issues such as forward and reverse strand directions, multiple pieces to the alignment, and so forth.

General Structure

The .maf format is line-oriented. Each multiple alignment ends with a blank line. Each sequence in an alignment is on a single line, which can get quite long, but there is no length limit. Words in a line are delimited by any white space. Lines starting with # are considered to be comments. Lines starting with ## can be ignored by most programs, but contain meta-data of one form or another.

The file is divided into paragraphs that terminate in a blank line. Within a paragraph, the first word of a line indicates its type. Each multiple alignment is in a separate paragraph that begins with an "a" line and contains an "s" line for each sequence in the multiple alignment. For now, parsers should ignore other types of paragraphs and other types of lines within an alignment paragraph.

The Header Line

The first line of a .maf file begins with ##maf. This word is followed by white-space-separated variable=value pairs. There should be no white space surrounding the "=".

 ##maf version=1 scoring=tba.v8 
The currently defined variables are:
  • version - Required. Currently set to one.
  • scoring - Optional. A name for the scoring scheme used for the alignments. The current scoring schemes are:
    • bit - roughly corresponds to blast bit values (roughly 2 points per aligning base minus penalties for mismatches and inserts).
    • blastz - blastz scoring scheme -- roughly 100 points per aligning base.
    • probability - some score normalized between 0 and 1.
  • program - Optional. Name of the program generating the alignment.
Undefined variables are ignored by the parser.

The Alignments Parameter Line

The second line displays the parameters that were used to run the alignment program.

 # tba.v8 (((human chimp) baboon) (mouse rat))

Alignment Block Lines (lines starting with 'a' -- parameters for a new alignment block

 a score=23262.0
Each alignment begins with an 'a' line that set variables for the entire alignment block. The 'a' is followed by name=value pairs. There are no required name=value pairs. The currently defined variables are:
  • score -- Optional. Floating point score. If this is present, it is good practice to also define scoring in the first line.
  • pass -- Optional. Positive integer value. For programs that do multiple pass alignments such as blastz, this shows which pass this alignment came from. Typically, pass 1 will find the strongest alignments genome-wide, and pass 2 will find weaker alignments between two first-pass alignments.

Lines starting with 's' -- a sequence within an alignment block

 
 s hg16.chr7    27707221 13 + 158545518 gcagctgaaaaca
 s panTro1.chr6 28869787 13 + 161576975 gcagctgaaaaca
 s baboon         249182 13 +   4622798 gcagctgaaaaca
 s mm4.chr6     53310102 13 + 151104725 ACAGCTGAAAATA
The 's' lines together with the 'a' lines define a multiple alignment. The 's' lines have the following fields which are defined by position rather than name=value pairs.
  • src -- The name of one of the source sequences for the alignment. For sequences that are resident in a browser assembly, the form 'database.chromosome' allows automatic creation of links to other assemblies. Non-browser sequences are typically reference by the species name alone.
  • start -- The start of the aligning region in the source sequence. This is a zero-based number. If the strand field is '-' then this is the start relative to the reverse-complemented source sequence.
  • size -- The size of the aligning region in the source sequence. This number is equal to the number of non-dash characters in the alignment text field below.
  • strand -- Either '+' or '-'. If '-', then the alignment is to the reverse-complemented source.
  • srcSize -- The size of the entire source sequence, not just the parts involved in the alignment.
  • text -- The nucleotides (or amino acids) in the alignment and any insertions (dashes) as well.

A Simple Example

Here is a simple example of a three alignment blocks derived from five starting sequences. Repeats are shown as lowercase, and each block may have a subset of the input sequences. All sequence columns and rows must contain at least one nucleotide (no full columns or rows of insertions).

 
 ##maf version=1 scoring=tba.v8 
 # tba.v8 (((human chimp) baboon) (mouse rat)) 
 # multiz.v7
 # maf_project.v5 _tba_right.maf3 mouse _tba_C
 # single_cov2.v4 single_cov2 /dev/stdin
                    
 a score=23262.0     
 s hg16.chr7    27578828 38 + 158545518 AAA-GGGAATGTTAACCAAATGA---ATTGTCTCTTACGGTG
 s panTro1.chr6 28741140 38 + 161576975 AAA-GGGAATGTTAACCAAATGA---ATTGTCTCTTACGGTG
 s baboon         116834 38 +   4622798 AAA-GGGAATGTTAACCAAATGA---GTTGTCTCTTATGGTG
 s mm4.chr6     53215344 38 + 151104725 -AATGGGAATGTTAAGCAAACGA---ATTGTCTCTCAGTGTG
 s rn3.chr4     81344243 40 + 187371129 -AA-GGGGATGCTAAGCCAATGAGTTGTTGTCTCTCAATGTG
                    
 a score=5062.0                    
 s hg16.chr7    27699739 6 + 158545518 TAAAGA
 s panTro1.chr6 28862317 6 + 161576975 TAAAGA
 s baboon         241163 6 +   4622798 TAAAGA 
 s mm4.chr6     53303881 6 + 151104725 TAAAGA
 s rn3.chr4     81444246 6 + 187371129 taagga

 a score=6636.0
 s hg16.chr7    27707221 13 + 158545518 gcagctgaaaaca
 s panTro1.chr6 28869787 13 + 161576975 gcagctgaaaaca
 s baboon         249182 13 +   4622798 gcagctgaaaaca
 s mm4.chr6     53310102 13 + 151104725 ACAGCTGAAAATA


  WIG format
 

Wiggle format (WIG) allows the display of continuous-valued data in a track format. Click here for more information.