FlyBase Reference Manual D.

RefMan Sections RefMan Table of Contents FlyBase Documents

FlyBase Reference Manual D. Bulk FlyBase Data Retrieval
Last Updated: 15 June 2005

Reference Manual D. Bulk FlyBase Data Retrieval

D.1. Bulk FlyBase data by Web query

D.1.1. Bulk reports from a search page
D.1.2. Batch download by ID
D.1.3. Bulk retrieval for any query

D.2. Bulk FlyBase data by FTP

D.2.1. Sequence and Annotation Data by FTP or rsync
D.2.2. Bulk data in ACODE and XML formats
D.2.3. Retrieving files

D.1. Bulk FlyBase Data by Web query

From public FlyBase web sites, you can use the web interface to search and retrieve bulk data, using program calls, and retrieving non-html, parsable outputs in table/spreadsheet or database formats. One can automate calls to FlyBase from various perl or other programs to pull data. With the proper settings, you can bypass most of the non-parseable / changeable HTML in favor of spreadsheet table data for simple subsets, or the full complexity of data in ACODE or XML formats.

D.1.1. Bulk Reports from a Search Web Page

From a FlyBase server, submit a Genes or other data section query. Use the default settings to receive all records for D. melanogaster, or specify additional search criteria as desired.
Query: [libs={FBgn PFgn}-org:Dmel] but not [libs-cla:uncertain], No. matches= 37278 (this number will change with each Genes update)
Then Batch Download All items in, for instance, "Spreadsheet, tabbed" format; this will return these summary fields:

FlyBase_ID 
Symbol 
Full_name 
Class_of_gene 
Date 
Gene_product 
Molecular_function  (Gene Ontology) 
Biological_process  (Gene Ontology) 
Cellular_component  (Gene Ontology) 
Rep._DNA_sequence 
Rep._protein_sequence 
Protein_domains 
Genomic_sequence_analysis 
Cytogenetic_map 
Recombination_map 
Synonyms 
Similar_genes 
Expression_pattern  
Allele_phenotypes

Or you can select specific fields of data in the Batch Download section. Click the "Select fields" hyperlink, and select any subset of available data fields for your table. The field codes described for Select Fields are at http://flybase.net/docs/lk/fbhelp/refman-fld-list.html.You can also type these into the Select Fields box. NOTE: If the "Select Fields" facility is used here, then the "Report content" option is ignored and defaults to Abridged report.

Limitations on Formats

There are limitations on the contents returned by different format selections:

the Document formats (text, hypertext) return the same contents as individual web reports, and respect the Report content option (synopsis, abridged, full, etc.) unless "Select Fields" is used. Output in other formats (Spreadsheet, Database) do not change in response to changing the "Report content" option. The Document formats also respect any Select Field options. When Select Field options are used the data retrieved are summary data, equivalent to the Synopsis and Abridged Report Contents shown in FlyBase documents. When Select Field options are used then the "Report content" option is ignored and defaults to Abridged report.
the Spreadsheet formats draw on top level, summarized data fields, equivalent to the Synopsis and Abridged Report Contents shown in FlyBase documents.See Reference Manual C.4 for further description of data selection. Spreadsheet formats are not fully compatible with the complex hierarchical structure of FlyBase data, as seen in Full Report content. Spreadsheet format is designed for people who want summary data for easy use in spreadsheets. If you want comprehensive data, or data enclosed in the substructure (such as you find in the gene Full Report) you will need to use Database or Document formats without using the "Select Fields" facility.
the Database formats always return all contents of data records, regardless of field settings or report content options. This is most suited to data miners and others with programming tools to parse the database formats.

Warning about Microsoft Excel Spreadsheets: Excel by default will automatically reformat and destroy data imported from many bioinformatics sources. Gene names can be converted to erroneous date or other values. Chromosome band values can be converted to erroneous floating point numbers, et cetera. This "General" translation problem can be avoided by using "Text" field import format. FlyBase will endeavor to add Excel-specific output in the future to help avoid this. See "Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics" by Zeeberg et al., 2004, BMC Bioinformatics 2004, 5:80 doi:10.1186/1471-2105-5-80.

D.1.2. Batch Download by ID

Batch Download by ID supports retrieval of FlyBase data in computable forms using a list of FlyBase IDs or valid symbols. Details of fields available per data class, and some example data mining perl scripts, are at http://flybase.net/docs/lk/fbhelp/.

NOTE: New Genes data that have not yet been integrated with the full FlyBase data set are not available via this tool.

Select the set of fields you want in reports via the Select Fields box. The default field set for genes is:
ID,GSYM,NAM,CLA,DT,GPD,ENZ,FNC,CEL,RSQ,RPA,PDOM,ASQ,CLOC,GLOC,SYN,HG
and corresponds to the list in Refman D.1.1. The default field set changes for each data class, as the available fields differ in data classes. Field tags are case sensitive.

Generally when you have multiple IDs, use this batch by ID form, and for a single ID, the best url to use is
http://flybase.net/.bin/fbidq.html?FBID (for html format) or
http://flybase.net/.bin/fbidq?FBID&mimetype=xxx

Where mimetype=xxx may be one of text/html, text/plain, text/tsv, text/csv, text/xml, text/acode. These formats correspond to the above choices of Document Hypertext, Document Text, Spreadsheet tabbed, Spreadsheet commas,
Database XML and Database Acode. The Limitations on Formats discussed in Refman D.1.1. apply to returns from these formats.

Append '&fields=ID,GSYM,NAM,...' to the url to select certain data fields.
The 'asksrs' program will search various data fields, which is not needed and more costly when you have an ID.

If you have a Drosophila Gene Collection (DGC) EST clone ID but not a FlyBase ID, use the correspondence table for FlyBase ID and DGC ID at http://flybase.net/annot/, in the Data section, as FlyBase Gene - DGC association table.txt.

D.1.3. Bulk Retrieval for Any Query

To automate any query, do it once 'by-hand', and copy the Bookmark URL you find at top of query results:
Bookmark FBquery: [FBgn-org:Dmel] & [FBgn-cla:pseudogene]
URL= http://flybase.org/.bin/asksrs.html?%5BFBgn-org%3ADmel%5D%26%5BFBgn-cla%3Apseudogene%5D
Note: those '%xx' codes are encodings of symbols in the URL; the unecoded version won't work properly:
http://flybase.org/.bin/asksrs.html?[FBgn-org:Dmel]&[FBgn-cla:pseudogene]

Then edit the URL to perform a batch download of data:

for Tabbed format, add '-gvtext/tsv&' after '?'
for Comma format, add '-gvtext/csv&' after '?'
for Acode format, add '-gvtext/acode&' after '?'
for XML format, add '-gvtext/xml&' after '?'
for plain text format, add '-gvtext/plain&' after '?'
for HTML format, add '-gvtext/html&' after '?'
for all results in one batch, add '-m99999&' after '?' (default is 20 matches/page)

URL= http://flybase.org/.bin/asksrs.html?-m99999&-gvtext/tsv&%5BFBgn-org%3ADmel%5D%26%5BFBgn-cla%3Apseudogene%5D

D.2. Bulk FlyBase Data by FTP

The whole of FlyBase is available by File Transfer Protocol (FTP). For simplified access to Sequence and Annotation data, see below.

flybase.org (129.79.225.25). Log in with the user name "anonymous" and use your e-mail address as password. FlyBase is in the directory flybase/.

The complete file structure of the flybase/ directory can be obtained by the following command within FTP:

ls -lR file_name
where file_name is a file on your computer to receive the listing.

A detailed description of many of the data files within each directory is provided in FlyBase Reference Manual B - Detailed Descriptions of FlyBase Structure and Data. The installation of FlyBase on IUBio makes considerable use of symbolic links between file names. This means that a file may reside in one directory but can be accessed by a different filename in another directory. For example, the file flybase/genes/synonyms.doc is linked to the flybase/docs directory, so if you examine that directory you will see this file as:

synonyms.doc -> ../genes/synonyms.doc

These links are not always made explicit in the FlyBase Reference Manual. Rather we describe the file structure as we think that it will most interest a biologist.

D.2.1. Sequence and Annotation Data by FTP or rsync

Sequence and annotation data can be retrieved in bulk from the Data section of the Genome Annotation and Sequences page.

FlyBase D. melanogaster genome sequence download page
D. melanogaster sequence and annotation files (fasta, gff, pgsql, xml-game)
D. pseudoobscura sequence and annotation files (fasta, gff, pgsql)
Other sequence data

D.2.2. Bulk Data in ACODE and XML Formats

FlyBase provides a set of text files that cover all the data served to the public. They are organized by data class: genes, aberrations, references, peptides, etc., including body part vocabulary and gene ontology. Each data class file has a set of records, one per FlyBase ID, for these data, and some number of fields and subrecords in each record.

The native format we use is "acode", a field key=value structure that has some hierarchy that is simpler and more efficient than XML. We also make this available in XML format. These data are found at
ftp://flybase.org/flybase-data/
In the acode/data/ section is the native FlyBase data format, with accessory information (srs parsers, etc.) in other acode/.. folders. The XML format of this is in the flybase-data/xml/ folder.

Find descriptions of the data field codes in these files in the acode/data-docs/ folder, or at http://flybase.org/.data/docs/fbhelp/

D.2.3. Retrieving Files

Once logged in to an FTP server the following commands can be used to obtain one or more FlyBase files onto your own computer:

ftp> cd directory-name (i.e., cd flybase if using IUBio)
ftp> get genes/genes.rpt et cetera or
ftp> mget *.rpt (to retrieve all report files)
ftp> quit

The names of copied files must be compatible with your local system. Files with names or extensions that are illegal on your system can be copied by specifying a new name for the copied file after the name of the file to be copied.

Some useful FTP commands:

pwd shows full path of current directory
dir list file in the current directory
cd dir-name move down one level to the named directory
cd path/dir-name move down to any directory level
cd .. move up one directory level
cd ../.. move to the top directory
bin change transfer mode to binary
ASCII change transfer mode to ASCII
prompt toggle file copy prompt to 'off' from 'on' or 'on' from 'off'
lcd dir-name change the directory on your local computer
quit exit

Some of the FlyBase files are very large. Large files should be compressed using the Unix 'tar' (tape-archive) facility before being sent. This is done on IUBio by appending the text .tar.Z to the end of a file or directory name and by setting the ftp server to send in binary mode:

ftp> binary
ftp> get flybase.tar.Z
ftp> quit

This command would get you the entire FlyBase directory of files. On your computer you would then have to use the uncompress and untar programs. These are standard Unix utilities and are available for other systems (see the utils/ folder on the IUBio server).