RefMan Sections     RefMan Table of Contents     FlyBase Documents

FlyBase Reference Manual D. Bulk FlyBase Data Retrieval
Last Updated: 15 June 2005

D.1. Bulk FlyBase Data by Web query

From public FlyBase web sites, you can use the web interface to search and retrieve bulk data, using program calls, and retrieving non-html, parsable outputs in table/spreadsheet or database formats. One can automate calls to FlyBase from various perl or other programs to pull data. With the proper settings, you can bypass most of the non-parseable / changeable HTML in favor of spreadsheet table data for simple subsets, or the full complexity of data in ACODE or XML formats.

D.1.1. Bulk Reports from a Search Web Page

From a FlyBase server, submit a Genes or other data section query. Use the default settings to receive all records for D. melanogaster, or specify additional search criteria as desired.
Query: [libs={FBgn PFgn}-org:Dmel] but not [libs-cla:uncertain], No. matches= 37278 (this number will change with each Genes update)
Then Batch Download All items in, for instance, "Spreadsheet, tabbed" format; this will return these summary fields:

FlyBase_ID 
Symbol
Full_name
Class_of_gene
Date
Gene_product
Molecular_function (Gene Ontology)
Biological_process (Gene Ontology)
Cellular_component (Gene Ontology)
Rep._DNA_sequence
Rep._protein_sequence
Protein_domains
Genomic_sequence_analysis
Cytogenetic_map
Recombination_map
Synonyms
Similar_genes
Expression_pattern
Allele_phenotypes

Or you can select specific fields of data in the Batch Download section. Click the "Select fields" hyperlink, and select any subset of available data fields for your table. The field codes described for Select Fields are at http://flybase.net/docs/lk/fbhelp/refman-fld-list.html.You can also type these into the Select Fields box. NOTE: If the "Select Fields" facility is used here, then the "Report content" option is ignored and defaults to Abridged report.

Limitations on Formats

There are limitations on the contents returned by different format selections:

Warning about Microsoft Excel Spreadsheets: Excel by default will automatically reformat and destroy data imported from many bioinformatics sources. Gene names can be converted to erroneous date or other values. Chromosome band values can be converted to erroneous floating point numbers, et cetera. This "General" translation problem can be avoided by using "Text" field import format. FlyBase will endeavor to add Excel-specific output in the future to help avoid this. See "Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics" by Zeeberg et al., 2004, BMC Bioinformatics 2004, 5:80 doi:10.1186/1471-2105-5-80.

D.1.2. Batch Download by ID

Batch Download by ID supports retrieval of FlyBase data in computable forms using a list of FlyBase IDs or valid symbols. Details of fields available per data class, and some example data mining perl scripts, are at http://flybase.net/docs/lk/fbhelp/.

NOTE: New Genes data that have not yet been integrated with the full FlyBase data set are not available via this tool.

Select the set of fields you want in reports via the Select Fields box. The default field set for genes is:
ID,GSYM,NAM,CLA,DT,GPD,ENZ,FNC,CEL,RSQ,RPA,PDOM,ASQ,CLOC,GLOC,SYN,HG
and corresponds to the list in Refman D.1.1. The default field set changes for each data class, as the available fields differ in data classes. Field tags are case sensitive.

Generally when you have multiple IDs, use this batch by ID form, and for a single ID, the best url to use is
http://flybase.net/.bin/fbidq.html?FBID (for html format) or
http://flybase.net/.bin/fbidq?FBID&mimetype=xxx

Where mimetype=xxx may be one of text/html, text/plain, text/tsv, text/csv, text/xml, text/acode. These formats correspond to the above choices of Document Hypertext, Document Text, Spreadsheet tabbed, Spreadsheet commas,
Database XML and Database Acode. The Limitations on Formats discussed in Refman D.1.1. apply to returns from these formats.

Append '&fields=ID,GSYM,NAM,...' to the url to select certain data fields.
The 'asksrs' program will search various data fields, which is not needed and more costly when you have an ID.

If you have a Drosophila Gene Collection (DGC) EST clone ID but not a FlyBase ID, use the correspondence table for FlyBase ID and DGC ID at http://flybase.net/annot/, in the Data section, as FlyBase Gene - DGC association table.txt.

D.1.3. Bulk Retrieval for Any Query

To automate any query, do it once 'by-hand', and copy the Bookmark URL you find at top of query results:
Bookmark FBquery: [FBgn-org:Dmel] & [FBgn-cla:pseudogene]
URL= http://flybase.org/.bin/asksrs.html?%5BFBgn-org%3ADmel%5D%26%5BFBgn-cla%3Apseudogene%5D
Note: those '%xx' codes are encodings of symbols in the URL; the unecoded version won't work properly:
http://flybase.org/.bin/asksrs.html?[FBgn-org:Dmel]&[FBgn-cla:pseudogene]

Then edit the URL to perform a batch download of data:

URL= http://flybase.org/.bin/asksrs.html?-m99999&-gvtext/tsv&%5BFBgn-org%3ADmel%5D%26%5BFBgn-cla%3Apseudogene%5D

D.2. Bulk FlyBase Data by FTP

The whole of FlyBase is available by File Transfer Protocol (FTP). For simplified access to Sequence and Annotation data, see below.

The complete file structure of the flybase/ directory can be obtained by the following command within FTP:

A detailed description of many of the data files within each directory is provided in FlyBase Reference Manual B - Detailed Descriptions of FlyBase Structure and Data. The installation of FlyBase on IUBio makes considerable use of symbolic links between file names. This means that a file may reside in one directory but can be accessed by a different filename in another directory. For example, the file flybase/genes/synonyms.doc is linked to the flybase/docs directory, so if you examine that directory you will see this file as:

These links are not always made explicit in the FlyBase Reference Manual. Rather we describe the file structure as we think that it will most interest a biologist.

D.2.1. Sequence and Annotation Data by FTP or rsync

Sequence and annotation data can be retrieved in bulk from the Data section of the Genome Annotation and Sequences page.

D.2.2. Bulk Data in ACODE and XML Formats

FlyBase provides a set of text files that cover all the data served to the public. They are organized by data class: genes, aberrations, references, peptides, etc., including body part vocabulary and gene ontology. Each data class file has a set of records, one per FlyBase ID, for these data, and some number of fields and subrecords in each record.

The native format we use is "acode", a field key=value structure that has some hierarchy that is simpler and more efficient than XML. We also make this available in XML format. These data are found at
ftp://flybase.org/flybase-data/
In the acode/data/ section is the native FlyBase data format, with accessory information (srs parsers, etc.) in other acode/.. folders. The XML format of this is in the flybase-data/xml/ folder.

Find descriptions of the data field codes in these files in the acode/data-docs/ folder, or at http://flybase.org/.data/docs/fbhelp/

D.2.3. Retrieving Files

Once logged in to an FTP server the following commands can be used to obtain one or more FlyBase files onto your own computer:

ftp> cd directory-name   (i.e., cd flybase if using IUBio)
ftp> get genes/genes.rpt et cetera or
ftp> mget *.rpt (to retrieve all report files)
ftp> quit

The names of copied files must be compatible with your local system. Files with names or extensions that are illegal on your system can be copied by specifying a new name for the copied file after the name of the file to be copied.

Some useful FTP commands:

  • pwd shows full path of current directory
  • dir list file in the current directory
  • cd dir-name move down one level to the named directory
  • cd path/dir-name move down to any directory level
  • cd .. move up one directory level
  • cd ../.. move to the top directory
  • bin change transfer mode to binary
  • ASCII change transfer mode to ASCII
  • prompt toggle file copy prompt to 'off' from 'on' or 'on' from 'off'
  • lcd dir-name change the directory on your local computer
  • quit exit

Some of the FlyBase files are very large. Large files should be compressed using the Unix 'tar' (tape-archive) facility before being sent. This is done on IUBio by appending the text .tar.Z to the end of a file or directory name and by setting the ftp server to send in binary mode:

ftp> binary
ftp> get flybase.tar.Z
ftp> quit

This command would get you the entire FlyBase directory of files. On your computer you would then have to use the uncompress and untar programs. These are standard Unix utilities and are available for other systems (see the utils/ folder on the IUBio server).