Chapter 5 The NCBI Database and Services

(The NCBI databases and services)

5.1 Overview

5.1.1 Abstract:

The NCBI hosts some of the world's most important bioinformatics databases and services. This learning unit explores them in the context of our search for information on yeast Mbp1 and its homologue in MYSPE.

5.1.2 Objectives:

This unit will:

  • introduce the Entrez system of NCBI databases and its associated services;
  • demonstrate how to navigate from a generic search to a specific record in the RefSeq Protein database and what information is linked from there;
  • teach Entrez field codes and qualifiers for searches.

5.1.3 Outcomes:

After working through this unit you:

  • can find the RefSeq Protein record for the Mbp1 homologue in MYSPE;
  • are familar with the NCBI databases, and how Entrez cross-references them;
  • can confidently apply the correct field codes to search for specific entries.

5.1.4 Deliverables:

Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.

Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.

Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.

5.1.5 Prerequisites:

This unit builds on material covered in the following prerequisite units:

BIN-Databases (Bioinformatics Databases)

The NCBI (National Center for Biotechnology Information) is one of the two largest, international providers of data for genomics and molecular biology (the EBI is the other). With its annual budget of several hundred million dollars, it organizes a challenging program of data management at the largest scale, it makes its data freely and openly available over the Internet, worldwide, and it runs significant in-house research projects.

In this unit we explore some of the offerings of the NCBI that can contribute to our objective of studying a particular gene in an organism of interest.

5.2 Task 5 - NCBI Intro

  • Read the introductory article on NCBI database resources:

NCBI Resource Coordinators (2021) Database Resources of the National Center for Biotechnology Information. 2020 Jan 8;48(D1):D9-D16. (pmid: 31602479)
PubMed

5.3 Entrez

5.6 Protein Sequence

5.7 Task 8

With this knowledge we can restrict the search to proteins called "Mbp1" that occur in Baker's Yeast. Return to the Global Search page and in the search field, type:

Mbp1[protein name] AND "Saccharomyces cerevisiae"[organism]

This finds three entries in the Protein database. Follow the link to the result CAA98618.1—a data record in Genbank Flat File (GFF) format5. The database identifier CAA98618.1 tells you that this is a record in the GenPept database. There are actually several, identical versions of this sequence in the NCBI's holdings. A link to the "Identical Protein Groups" Database near the top of the record shows you what these are:

Some of the sequences represent duplicate entries of the same gene (Mbp1) in the same strain (S288c) of the same species (S. cerevisiae). In particular:

  • there are several records for which the source is the INSDC, these are archival entries, submitted by independent yeast genome research projects;
  • there are two entries in the RefSeq database linking to the same protein: [NP_010227.1](https://www.ncbi.nlm.nih.gov/protein/NP_010227.1. One is derived from genome sequence, the other from mRNA. This RefSeq entry is the preferred version of a sequence for our purposes. RefSeq is a curated, non-redundant database which solves a number of problems of archival databases. You can recognize RefSeq identifiers – they always look like NP_12345.1, NM_12345.1, XP_12345.1, NC_12345.1 etc. This reflects whether the sequence is protein, mRNA or genomic, and inferred or obtained through experimental evidence.
  • there is a SwissProt sequence P39678.16. This link is kind of a big deal. It's a cross-reference into UniProt, the huge protein sequence database maintained by the EBI (European Bioinformatics Institute), which is the NCBI's counterpart in Europe. SwissProt entries have the highest annotation standard overall and are expertly curated. Many Webservices work with UniProt ID's (e.g. P39678.1), rather than NCBI IDs such as a RefSeq ID. But it used to be until recently that the two databases did not link to each other, mostly for reasons of funding politics. It's great to see that this divide has now been overcome.

Note that while all of these entries come from Saccharomyces cerevisiae, they have been sequenced in different yeast strains. Thus they don't have to be identical (except for the fact that this is a table of identical sequences), such related sequences might be slightly different because the strains are after all not genetically identical. And sometimes we find identical sequences in quite divergent species. Therefore I would not actually consider EIW11153.1, AJU86440.1, AJU58508.1, and AJU61971.1 to be identical proteins, although they have the same sequence.

Note all the .1 suffixes of the sequence identifiers. These are version numbers. Two observations:

  1. It's great that version numbers are now used throughout the NCBI database. This is good database engineering practice because it's really important for reproducible research that updates to database records are possible, but recognizable. When working with data you always must provide for the possibility of updates, and manage the changes transparently and explicitly. Proper versioning should be a part of all datamodels. In fact, the NCBI has recently phased out its internal unique identifiers – the GI number – in favour of accession-number.version IDs everywhere.
  2. When searching, or for general use, you can (and should) omit the version number, i.e. use NP_010227 or P39678 not NP_010227.1 resp. P39678.1. This way the database system will resolve the identifier to the most current, highest version number (unless you want the older one, of course).

5.8 Task 9 - NCBI Details

As we see, this is a good start page to explore all kinds of databases at the NCBI via cross-references.

5.9 PubMed

Arguably one of the most important databases in the life sciences is PubMed and this is a good time to look at PubMed in a bit more detail.

5.10 Task 10 - Pubmed

  • Return back to the [MBP1 RefSeq record](https://www.ncbi.nlm.nih.gov/protein/NP_010227.1.
  • Find the PubMed link under Related information in the right-hand margin and explore it. These are links that are directly related to the NP_010227 sequence in the database.
  • Next follow the link to "PubMed (Weighted)" which applies a weighting algorithm to find broadly relevant information - an example of literature data mining. PubMed(weighted) appears to give a pretty good overview of systems-biology type, cross-sectional and functional information.

But it does not find all Mbp1 related literature.

  1. On any of the PubMed pages open the Advanced query page and study the keywords that apply to PubMed searches. These are actually quite important and useful to remember. Make yourself familiar with the section on Search field descriptions and tags in the PubMed help document, (in particular [DP], [AU], [TI], and [TA]), how you use the History to combine searches, and the use of AND, OR, NOT and brackets. Understand how you can restrict a search to reviews only, and what the link to Related citations... is useful for1.
  2. Now find publications from anywhere in PubMed with Mbp1 in the title. In the result list, follow the links for the two Biochemistry papers, by Taylor et al. (2000) and by Deleeuw et al. (2008). Download the PDFs, these manuscripts will be needed in a later unit.

5.11 Digression: A "bookmarklet" to access literature

PubMed usually includes links to full-text articles, but these are often behind a paywall, even though we have access through our library system (one of the top three in the world incidentally). Here is a bookmarklet (a portmanteau of "bookmark" and "applet") to seamlessly redirect from a paywall page to full access thorugh our library's "my access" system. The key is to apply a bit of code that "rewrites" the original URL.

In your browser, create a bookmark to anything, call it "MyAccess"", and put it into your bookmarks bar for convenience. Then edit it: replace the URL of the bookmark with the following snip of Javascript:

javascript:(function(){var url=window.location.href;var re=/\/([\w.]+)\/(.*$)/;var match=url.match(re);var newURL="http://"+match[1]+".myaccess.library.utoronto.ca/"+match[2];window.location.href=newURL;})();void 0

No line breaks!

Then try it. Go to the following article from outside the university network ...

http://science.sciencemag.org/content/303/5659/788.long

... you should see the abstract but you can't view the full text without being an AAAS member. Then click on your bookmarklet. It should automatically rewrite the URL, take you to the UofT login screen, and take you to a page with full access to the article.

I hope you find this as useful as I do. The strategy lends itself to other nice ideas.

5.12 Original Information and Annotation Transfer

5.13 Task 11

In the BIN-Storing_data unit you have found the protein of MYSPE that is most similar to yeast Mbp1, in MYSPE. Navigate to the NCBI Protein page for the RefSeq entry of this protein. Explore the links that go out from the page. Assess which resources are independently useful, and which resources merely recapitulate information that relates to yeast Mbp1, the protein that you originally searched with. The goal is to develop a sense for where a page like this one collects original information, and where it merely acts as a record of annotation transfer.

5.14 Self-evaluation


  1. If there is only a single match, you will be been taken directly to the page.

  2. Actually the "real" SwissProt identifier would be patterned like MBP1_YEAST. P39678 is the corresponding UniProt identifier.

  3. Your operating system can help you keep the files organized. The "file system" is a database.