Information

Difference between curated publications and additional publications in nextProt database?


With reference to reference papers listed in nextProt database, what is the difference between curated publications and additional publications? Anyone using nextProt, kindly explain the terms. For example, when I search for SLC22A23 in nextProt simple search engine, I get as mentioned below on the left side toolbar.

References

Curated publications (21)

Additional publications (2)

Patents (0)

Submissions (0)

Web resources (0)

Link: https://search.nextprot.org/entry/NX_A1A5C7/

Thanks


SynergyAge, a curated database for synergistic and antagonistic interactions of longevity-associated genes

Interventional studies on genetic modulators of longevity have significantly changed gerontology. While available lifespan data are continually accumulating, further understanding of the aging process is still limited by the poor understanding of epistasis and of the non-linear interactions between multiple longevity-associated genes. Unfortunately, based on observations so far, there is no simple method to predict the cumulative impact of genes on lifespan. As a step towards applying predictive methods, but also to provide information for a guided design of epistasis lifespan experiments, we developed SynergyAge - a database containing genetic and lifespan data for animal models obtained through multiple longevity-modulating interventions. The studies included in SynergyAge focus on the lifespan of animal strains which are modified by at least two genetic interventions, with single gene mutants included as reference. SynergyAge, which is publicly available at www.synergyage.info, provides an easy to use web-platform for browsing, searching and filtering through the data, as well as a network-based interactive module for visualization and analysis.

Measurement(s) longevity • epistasis • synergistic interactions of longevity-associated genes • antagonistic interactions of longevity-associated genes
Technology Type(s) digital curation
Factor Type(s) type of mutant • animal model
Sample Characteristic - Organism Caenorhabditis elegans • Drosophila melanogaster • Mus musculus

Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.13049696


Background

With the rapid increases in genomic and epigenomic data in recent years, our ability to annotate regulatory elements across the human genome and predict their activities in specific cell and tissue types has substantially improved. Widely used approaches integrate multiple epigenetic signals such as chromatin accessibility, histone marks, and transcribed RNAs [1,2,3,4,5,6,7] to define collections of regulatory elements that can be used to study regulatory programs in diverse cell types and dissect the genetic variations associated with human diseases [5, 8,9,10,11].

To maximize the utility of regulatory elements, one must know which genes they regulate. We recently developed the Registry of candidate cis-Regulatory elements (cCREs), a collection of candidate regulatory genomic regions in humans and mice, by integrating chromatin accessibility (DNase-seq) data and histone mark ChIP-seq data from hundreds of biosamples generated by the ENCODE Consortium (http://screen.encodeproject.org). Over 75% of these cCREs have enhancer-like signatures (high chromatin accessibility as measured by a high DNase-seq signal and a high level of the enhancer-specific histone mark H3K27ac) and are located distal (> 2 kb) to an annotated transcription start site (TSS). For cCREs proximal to a TSS, it may be safe to assume that the TSS corresponds to the target gene, but to annotate the biological function of the TSS-distal cCREs and interpret the genetic variants that they harbor, we need to determine which genes they regulate.

Assigning enhancers to target genes on a genome-wide scale remains a difficult task. While one could assign an enhancer to the closest gene using linear distance, there are many examples of enhancers skipping over nearby genes in favor of more distal targets [12]. Experimental assays such as Hi-C and ChIA-PET survey physical interactions between genomic regions [13,14,15,16,17], and by overlapping the anchors of these interactions with annotated enhancers and promoters, we can infer regulatory connections. Approaches based on quantitative trait loci (QTL) associate genetic variants in intergenic regions with genes via the variation in their expression levels across multiple individuals in a human population [18, 19]. Recently, a single-cell perturbation approach extended this idea [20]. However, these assays are expensive to perform and have only been conducted at a high resolution in a small number of cell types. Therefore, we need to rely on computational methods to broadly predict enhancer-gene interactions.

One popular computational method for identifying enhancer-gene interactions is to correlate genomic and epigenomic signals at enhancers and gene promoters across multiple biosamples. This method is based on the assumption that enhancers and genes tend to be active or inactive in the same cell types. The first study to utilize this method linked enhancers with genes by correlating active histone mark signals at enhancers with gene expression across nine cell types [1]. Several groups subsequently used similar approaches to link enhancers and genes by correlating various combinations of DNase, histone mark, transcription factor, and gene expression data [8, 21,22,23]. While these methods successfully identified a subset of biologically relevant interactions, their performance has yet to be systematically evaluated.

Other groups have developed supervised machine-learning methods that train statistical models on sets of known enhancer-gene pairs. Most of these models use epigenomic signals (e.g., histone marks, TFs, DNase) at enhancers, promoters, or intervening windows as input features [24,25,26,27]. PEP-motif, on the other hand, uses sequence-based features [28]. The performance of these methods has not been systematically evaluated for several reasons. First, different methods use different definitions for enhancers ranging from EP300 peaks [26] to chromatin segments [27]. Second, these methods use different datasets to define their gold standards, such as ChIA-PET interactions [24, 26] or Hi-C loops [26, 27], along with different methods for generating negative pairs. Finally, many of these methods use a traditional randomized cross-validation scheme, which results in severe overfitting of some supervised models due to overlapping features [29, 30].

To facilitate the development of target gene prediction methods, we developed a collection of benchmark datasets by integrating the Registry of cCREs with experimentally derived genomic interactions. We then tested several published methods for linking enhancers with genes, including signal correlation and the supervised learning methods TargetFinder and PEP [27, 28]. Overall, we found that while TargetFinder was the best-performing method, it was only modestly better than a baseline distance method for most benchmark datasets when trained and tested on the same cell type, and Target Finder often did not outperform the distance method when applied across cell types. Our results suggest that current computational methods need to be improved and that our benchmark presents a useful framework for method development and testing.


Affiliations

Crop Research Institute, Sichuan Academy of Agricultural Sciences, Chengdu, 610066, China

Songtao Yang, Shuai Qiao & Wenfang Tan

Key Laboratory of Molecular Biology and Gene Engineering in Jiangxi Province, College of Life Science, Nanchang University, Nanchang, 330031, China

Xiaojing Liu, Xiang Kang, Youlin Zhu, Lan Yang & Dong Wang

Institute of Biotechnology and Nuclear Technology, Sichuan Academy of Agricultural Sciences, Chengdu, 610061, China


Difference between curated publications and additional publications in nextProt database? - Biology

This FAQ document is complementary to the practical use-orientated Help documentation and the About page. This focuses more on the "how, why and wherefore" of our database content.

General Questions

Our open access paper in the 2014 NAR Annual database issue summarises the database as of Sept 2013. More recently technical background to some of the curatorial issues below is provided in our blog posts, the Newsletter, and recent presentations by team members on SlideShare.

Our About page has a graphical snapshot of the entity counts for the current release, broken down by target classes and ligand types.

Content expansion and feature enhancements are announced on the front page. Releases are approximately quarterly, designated as increments (e.g. 2014.1 in April, 2014.2 in June and 2014.3 in November).

This page provides guidelines for how to cite the Guide to PHARMACOLOGY.

Yes, "GtoPdb". Note in PubChem our source name is "IUPHAR/BPS Guide to PHARMACOLOGY" (for Substances, Compounds and BioAssays). When you inspect our substance records the ligand ID number is prefixed with "GTPL". The Concise Guide to PHARMACOLOGY snapshot publication in the British Journal of Pharmacology can be abbreviated as "CGTP".

It grew out of the IUPHAR Database (IUPHAR-DB) from 2011. This was described in a series of publications and a Wikipedia entry but now has a re-direct from the original website. The pre-existing website was integrated into GtoPdb and expanded with information from an established series of journal articles called the ‘Guide to Receptors and Channels’ (GRAC). GRAC is now superseded by the "Concise Guide to PHARMACOLOGY" (CGTP) series.

While the earlier publications pre-date the database they were conceived as adjuncts to facilitate reciprocal navigation. All ligand and protein entries from articles designated as NC-IUPHAR reviews now have records in the database. There is obviously more contextual detail in the stand-alone review articles than we could feasibly index but summary information (e.g. for protein families) is not only captured in the database but we also link-out to the relevant articles.

Yes, the British Journal of Pharmacology and its publisher Wiley have been piloting this for some time. You can see examples both in the CGTP 2013/14 series and recent NC-IUPHAR reviews (e.g. this 2014 one on epigenetic pathways)

As an open resource, we encourage re-distribution and value-added integration of our content. We also maintain an expanding collaboration network for reciprocal cross-linking from other high-utility sources (these are listed on our website and we know most of the teams personally). However, they may not always refresh our updated links on their side (e.g. you may still come across the superseded IUPHAR-DB links). This link-refreshing is a major issue in the global database ecosystem (see this blogpost) but we are addressing this by alerting those we know, to new GtoPdb releases. However, of more concern is where our content has been downloaded or even web-scraped and then re-surfaced into additional databases without contacting us. The problem is certainly not confined to just our resource but we obviously have no control over this. Links can consequently include deprecated records (if you are on the team of such a resource we would be pleased to discuss the technicalities of refreshing our links).

The database is hosted by the University of Edinburgh which is registered as a Charity in Scotland (SC005336). Our major funders are the UK Wellcome Trust (via grant number 099156/Z/12/Z), the British Pharmacological Society (BPS) and International Union of Basic and Clinical Pharmacology (IUPHAR). We also have some unrestricted educational grants from companies. Note we are always open to new sponsors (please email us)

You will note that the downloadable contents of the database have generous licencing terms (share, copy, redistribute, adapt, remix, transform, build upon, including commercially). Notwithstanding, we would request that any parties incorporating the content of GtoPdb into their own work, including their own integrations should contact us. This is not only for the courtesy of us knowing who-is-doing-what with our funded work but we can also help with technical aspects. We are currently engaging with the OpenPhacts consortium about integration as well as some commercial information providers and major pharmaceutical companies.

Data Questions

Yes as described, IUPHAR-DB originated from receptor and channel pharmacology. While GtoPdb continues to maintain this focus, over 2013/14, as a grant directed objective it has extended into molecular mechanisms of action (mmoa), with corresponding human protein annotation for new target classes. In addition, there have been selective expansions driven by user interests and specific collaborations. For example, we now have a broad capture of development candidates and research compounds for the treatment of Alzheimer’s Disease. One difference users may notice between old and new records is that receptor or channel ligands often have a range of activity values from different publications judged as pharmacologically relevant (this is not to be confused with +/- ranges reported in individual for the same determination). However, for our recent expansion into enzyme drug targets we typically only select one value. Also, most of these newer targets have only one ligand as opposed to many ligands for well-studied receptors and channels. We are remediating older relationship mappings, particularly those without referenced activity values. In addition, we will convert some of our historical peptide entries into more defined molecular representations.

Our first source is peer-reviewed primary literature. We exploit the different feature sets of both UK PubMed Central and NCBI Entrez (note this occasionally gives matches in PubMed Central that are not in PubMed/MeSH) and last but not least, the full range of public databases. We are fortunate to have extensive Journal access via the University of Edinburgh but there are cases where they do not have subscriptions to some we would like to check. The only commercial database we currently use is CAS SciFinder, also courtesy of a University of Edinburgh licence. It should be noted that CAS content is extracted from primary public sources but we occasionally use it to locate entities that are difficult to resolve elsewhere.

We apply selective criteria if there is a choice of either a) multiple references for established ligands and/or well characterised targets or, b) if we need to select more recent ligands for an emerging research target. In both cases, we first go for primary publications in top-ranking journals in the field with detailed SAR, unequivocal resolution of the ligand and target entities (i.e. molecular structures and species for the proteins) as well as defined assay conditions. Note there is a time-shift between primary medicinal chemistry papers from which we can capture in vitro kinetic parameters and later papers on in vivo pharmacology as well as eventual clinical trial results. These can span anywhere between 5-20 years but we select key references across the range. In terms of activity types, we prioritise the more standardised Ki over an IC50 but will include both if they are available. We are aware of the somewhat grey line for the target assignments of binding data for receptors expressed with coupled read-outs in cell-lines (as opposed to purified proteins typically used for soluble enzyme assays). Given the intrinsic value of the cell-binding data, we judge if the mapping to a protein identifier is sufficiently resolved (and the reference details the cell-based assay). For research targets, authors may indicate optimised lead compounds in the paper or, if not, the highest potency is selected. If a ligand is indexed in PDB as the correct target-ligand pair we will obviously prioritise this but also try to find the activity values from earlier papers. In terms of activity types, we generally defer to authors (i.e. if they call it a Kd or an EC50 then so do we) but our target subcommittees may decide to change the annotation in respect of the assay type (as detailed in the IUPHAR guidelines). Note we have also been accommodating new proposed ligand nomenclature expansions as exemplified by the IUPHAR recommendations for the nomenclature of receptor allosterism and allosteric ligands.

20K references have PubMed IDs but we do include a small number of reference links for data we think is important to capture but is either in journals not indexed in PubMed, patents, book chapters, slide sets, or meeting abstracts. In the same way as we assess the quality of peer reviewed papers, we also judge these sources on a) credible provenance b) entity resolution and c) a stable URL or DOI. We also set up a literature alert for company code numbers where our initial extraction was only from abstracts or slides. More recently we have extracted data from (and thus linked to) pharmaceutical company open information sheets for repurposing candidates (in lieu of the expected papers). We note the increase in other types of non-PubMed data surfacing (e.g. Open ELNs and Figshare) so groups intending to surface pharmacological data sets via these new routes are welcome to contact us.

We select patent references for generally two cases a) ligands where we can find either the only published SAR or b) the data is extensively complementary to that from papers from the same team (because many more analogues are included, along with quantitative data and synthesis descriptions). The big bang of patent chemistry extraction has resulted in the submission of over 15 million patent-extracted structures into PubChem and EBI has taken over the new SureChEMBL resource. Consequently, it is becoming easier for us to resolve structure and data links between papers, database entries and patents. We generally select only those from pharmaceutical companies and academic institutions with an established medicinal chemistry reputation. There are some ligands where we add more speculative mappings (e.g. as a curatorial comment pointer to the patent wherein the lead series can be identified with high probability). This is for cases where a company code name is blinded (i.e. no publically declared name-to-structure) but pharmacologically important information (including quantitative target binding data), has been disclosed on company websites, in open repurposing lists, or in clinical trials entries. In a few cases we have also been able to exploit patents as a source of target binding data for monoclonal antibodies published studies.

Regardless of author primacy, we do not report significant figures that clearly exceed the variance of experimental assays (i.e. anywhere up to +/- 50%) and therefore we only maximally quote three (i.e. 1%). Note also our log-transformation to pAct also produces three figures. There is a caveat in regard to the surfacing of different figures (for rounded vs as-written) in other literature extraction sources for the same report (e.g. in PubChem BioAssay) but the consequences cannot be detailed here.

Yes. This occurs principally via a) our target committees of

650 global experts, b) the NC-IUPHAR and co-grant holders steering group and c) the University of Edinburgh database team (who are collectively authors on 129 PubMed papers). This manifests itself as an intense and continuously reviewed selection (also factoring-in user feedback) of what to include or leave out and sometimes remove. Populating the database, by definition, seeks to impose structure via entity relationship abstraction from a large unstructured document corpus. However, the complexities and nuances of pharmacology as well as (it must be said) variable publication scientific quality, all mean that brainless rule-parsing (and maximal coverage), are incompatible with our "utility biased" vision. Thus, our rules are implemented more as curatorial guidelines (i.e. we can bend or even break them where it is useful and the database consequences are not too problematic). The challenge is balancing speed and flexibility of capture against the necessary constraints of a formalised data model. We thus make extensive use of curators’ notes (a.k.a. nano-publications) to capture tacit facts as free-text and relationships via cross-pointers that are not indexed fields in the current schema. In the longer term we may accommodate new relationships as necessary. These aspects differentiate us from other valuable resources but with different capture scales and objectives. Some of our choices are pragmatic. For example we will add ligands from the earliest reports of chemical modulators for a novel target (possibly patent-only) even if these are of such low potency and/or specificity to be unpublishable for an established target (e.g. surrogate ligands for orphan receptors). Importantly our annotations are reversible as we remediate and make committee- or user-communicated corrections (note we will add improved ligands as they are published but do not typically remove older ligands with solid citations).

Only for kinases. Matrix data constitutes standardised result sets from large-scale n-ligands x n-proteins parallel assays (also called panel screening or profiling). The ones we surface from DiscoverRx, Reaction Biology and Millipore are valuable for users to access, via a separate table. However, if these results were target-mapped into the main database they would interpose a confounding set of relationships, compared to the stringent mappings we curate from the literature.

Yes, and note you are welcome to forward your own new papers (and we may get back to you for entity clarifications). Our committee members regularly alert us to new content during our target-family update cycles and in between for "hot topics" as announced on the web page (note you will have to permit us to judge the relevance before it goes in the curation queue). You can also be invited to join our CiteULike Guide to PHARMACOLOGY Triage group where you post papers with your own notes added for us to assess, or just do this in your own CUL library (for adding notes we hope you will find our entity resolution guide useful).

This refers to the practice of manually or automatically copying annotation and links between databases without evaluation. We always check the linkages we add and, crucially, also read the papers to check the activity data to enter in the database are correct. In many cases we have drawn chemical structures de novo from papers where no CID matches were available at curation time. This is why over 250 of our CIDs are novel in PubChem. For some quality control tasks we do use computational cross-checking (e.g. to ensure that our source links are concordant within PubChem) but only where we have established the operational consistency of the automated approach.

Target Questions

As an umbrella term, this has a range of meanings. One of them, as detailed in other sections, is associated with drugs to treat human diseases that typically have a data-supported protein target for their mechanism of action. Secondly, ligands are usually selected on the basis of their protein interactions. They can thus be termed "targets" in the wider sense even if they are not encompassed within drug discovery efforts for human diseases. Notably, our targets include those newer receptor-ligand pairings judged as credible by the committees (i.e. de-orphanisation see PMID 23957221). Thirdly, we perform selected orthologue curation in that we either include non-human binding data, or annotate the rodent reference orthologues. The fourth category (discussed in PMID 21569515) arises from the drug development context of undesirable ligand interactions (sometimes termed "anti-targets"). An database example is that between the withdrawn drug terfenadine and the HERG channel (KCNH2) as a liability target for cardiac toxicity. As a fifth category our ligand mapping extends to functional orphans. By this we mean specific chemical modulation reports for proteins that do not yet have sufficient validation data to be considered bona fide therapeutic drug targets but are being investigated to both establish their normal function and assessed for possible causative disease involvement (e.g. Cathepsin A). We have consequently curated ligands directed against kinases, proteases, chromatin modifying enzymes and GPCRs that can be classified as functional orphans. This is clearly a transient categorisation in the context of functional genomics and equally intense efforts to validate new therapeutic targets in the both the academic and commercial sectors.

As well as an overview and background reading, target family pages include concise at-a-glance summaries for each target. These describe nomenclature, genes and key ligands including expert-recommended selective tool compounds, endogenous ligands and approved drugs. For the most important proteins (including targets of approved drugs) we are working to include more detailed subcommittee-directed curation of detailed pharmacology, physiology, molecular function, assays, human disease relevance, and clinically significant variants. This includes extended family introductions adapted from review articles.

One of the founding (and continuing) objectives for NC-IUPHAR is to oversee the nomenclature of human receptors and channels so these human protein classes are complete in the database (with the exception of the olfactory and opsin-type GPCRs). More recently, as part of our grant objectives, we have expanded into other families. This includes: transporters, the full complement of kinases, a subset of characterised proteases, other hydrolases as well as enzymes involved in histone modifications. You can find the current families breakdown on the About page which as of November 2014 includes 2708 UniProt identifiers.

Where possible, we resolve the literature reference to a UniProtKB/Swiss-Prot ID as our primary identifier. Note that for non-human species, such as rodents, we restrict these links to the sequences in Swiss-Prot, as these are curated and reviewed. There are many reasons for choosing UniProt primacy but they include a) the utility of the Swiss-Prot canonical philosophy of protein annotation, b) species specificity, c) global reciprocal cross-referencing, d) persistence as an EBI core resource e) as of 2014, we have collaborative control over our own cross-references and f) we can correct entries via our own feedback to the UniProt team. It is important to note that this choice is protein-centric rather than gene-centric. While the dichotomy can cause problems (e.g. Swiss-Prot protein names, HGNC gene names and NC-IUPHAR nomenclature are not completely harmonised) the mappings between the core tetrad (UniProt, HGNC, Ensembl, and Entrez Gene) are concordant (i.e. have a 1:1:1:1 sequence cross-mapping) for only 18,787 human entries (as of Sept 2014). The other protein and gene resources we link to are listed in Help. Note, for us, these are secondary sources. What this means practically is that we ensure the fidelity of our curated "out" links, but we can neither control their correct reciprocity in pointing back "in" nor between themselves. While ambiguous cases in our database are few, it does affect nearly 1500 human proteins with discordances between the major protein and gene annotation pipelines. Users should also be aware that Swiss-Prot entries can have one-to-many mappings to RefSeq (since the latter are non-canonical). Where we cannot resolve members of protein family to UniProt IDs from the authors description (but the paper was judged pharmacologically important) we comment on this ambiguity.

The UniProt entries which link out to GtoPdb entries were selected for "has ligand" relationships of any type. As of 2014 this represents over 2,000 proteins of which around two thirds are from human Swiss-Prot. Note that UniProt out-links to us are now in a new category of "Chemistry" cross-references which includes ourselves, BindingDB, DrugBank and ChEMBL (but note the synchronisation times between these sources are different). We are currently exploring extending selection options (e.g. for the primary targets of approved drugs).

This is a challenge for curation since many targets are heteromeric complexes in vivo (e.g. they consist of multiple UniProt IDs). We include their NC-IUPHAR designations as complexes and provide page links to pharmacology references (as specified in PMID 17329545). However, for ligands we annotate UniProt mappings as 1:1 wherever possible. The mechanistic justification is that, for most complexes at least, the data indicates only one or two proteins participate directly in ligand binding. This more stringent annotation enhances the precision of the database in three ways, by a) taking a minimal rather than a maximal target mapping approach (see PMID 24533037) b) restricting targets to those with tractable binding sites c) putative ligand binding by homology extrapolation becomes more reliable. For example, we have mapped the current gamma secretase inhibitors to PSEN1 rather than increasing the complexity of our our target mapping matrix by adding an additional five UniProt IDs for which there is no evidence of significant inhibitor binding.

Our target pages include biologically significant alternative splicing variants and these have links out to the corresponding RefSeq nucleotide and protein entries. The increasing importance of splice variants in pharmacology is recognised in this recent IUPHAR review (PMID 24670145). Future updates may thus include splice-specific binding data. If the splice variant is clearly defined in the paper we should be able to match this to a Swiss-Prot feature line and a RefSeqNP ID.

We do capture selected pharmacologically important ligand interactions in these domains (e.g. high-affinity transporter binding for some drugs, metabolites with substrate binding data for selected drug targets and certain toxins used as pharmacological tools). However, we leave the broader matrix of molecular interaction capture for these domains to other specialist databases.

Yes. This arises from individual members of the protein families that do not yet have recorded ligand interactions in the database. Note that this absence of ligands always has the "so far" caveat and the numbers of proteins with curated interactions expands with every release.

No, for three reasons, a) While we generally curate human data, where this is unavailable we may include rodent data (and in fewer cases other orthologues such as dog) where this is available b) our collation of approved drugs has captured structures for a number of anti-infectives. These may be consolidated by adding molecular mechanisms of action (mmoa) mappings in due course but our current curatorial focus is human c) There are cases for older drugs (i.e. before target expression-cloning was routine) where human in vitro data was never published so we can only find data for a test animal (e.g. ACE inhibitors with IC50s against the rabbit enzyme).

We could expand to cover well over 2,000 human proteins that have chemical modulation reports in papers or patents that would pass our curation criteria. However, this is a future funding issue.

As explained above we use citable activity data to define a pharmacologically significant molecular interaction. The concept of a primary target is where a ligand has been optimised in a drug discovery context with a distinct molecular mechanism of action (mmoa) (usually measurable in vitro with plausible potency and specificity), and is assumed to be causatively responsible for observed pharmacology in vivo (e.g. the effect of ACE inhibitors in lowering blood pressure is due to substrate-competitive binding). This should not be confused with "target validation" where translation of the primary mmoa into therapeutic efficacy has been clinically demonstrated (but, as we know, many drug candidates with a data-supported mmoa still fail to improve disease outcomes in clinical trials).

Ligand Questions

Ligand is used as an umbrella term for pharmacologically important small-molecule large molecule interactions, but there is no strict size cut-off (i.e. it extends to certain protein-protein interactions such as cytokine to receptor). Ligands are captured in the database because publications (or other data we judge as having adequate provenance) have experimentally characterised their interaction with a protein or macromolecular complex. These interactions are selected as a) being mediated by direct binding (i.e. thermodynamically driven), b) specific (i.e. limited cross-reactivity), c) experimentally measurable, d) result in activity modulation with biochemical consequences e) the mechanistic consequences are pharmacologically relevant and f) we can resolve the ligand identity to a molecular structure (see below). Our high-level classification is divided into endogenous (e.g. metabolites, hormones and cytokines) and exogenous ligands (e.g. drugs, toxins and tool compounds). The deeper categories are in the ligand list tabs.

The basic concept is that a ligand needs to have defined molecular structure (with some exceptions, such as heparin as a fractionated polymer extract). The majority are organic molecules described as chemical entities in a number of formal ways (detailed in Help). We have consequently resolved (and performed automated cross-checking on) over 70% of our ligands to the primary mapping of a PubChem Compound Identifier (CID). This means the structure is defined by the PubChem chemistry rules (documented in their own FAQ). There are many reasons for this choice (which we can detail) but the first is the detailed and transparent relationship mapping between over 60 million CIDs and 350 data sources. A second is our active collaboration with the PubChem team on aspects of our own ligand molecular resolution, BioAssay data mapping, and iterative quality control checking of entries (see our current SID and CID sets). Aside from a small number of inorganic entries, we use peptides and proteins as the two other levels of ligand structural description. These can be mixed, for example where a moderate-size peptide has a CID that defines the backbone with a chemical modification (e.g. C-terminal amidation). Note we also curate the primary sequence string into the record, as well as including the human UniProt ID within which that native sequence has an identical match (many of these correspond to Swiss-Prot cross-references for cleavage-excised peptides). Protein-only ligands are designated by UniProt IDs (e.g. cytokines). Note that mAbs are a special class of ligand since the sequences are usually defined but do not have UniProt IDs (for various reasons). From our collaboration with IMGT mAb-DB we provide pointers to their INN-derived sequence assignments.

The images of small molecules depicted on our site are generated by an online identifier resolver which takes the ligand SMILES as input. This free service from the NCI/CADD group of the National Cancer Institute is built upon the CACTVS software. Note this has an advantage over some other rendering styles in explicitly marking the stereo centres.

We provide a "Similar ligands" tab on ligand pages. These are pre-computed for each release by clustering via a modified sphere-exclusion approach. This is based on similarity of both the properties and structural fingerprints of the molecules. Users can also explore intra-database similarities via the provided substructure and SMARTS pattern-matching searching (see Help). For those ligands that do not display any neighbours in our collection (or even if they do, but you want to extend this into a larger chemical structure space) we recommend using the PubChem "Similar Compounds" link. This will show all CIDs with a pre-computed Tanimoto similarity above 90% and these can be displayed as 2D or 3D clustering (n.b. if these are very large because of many close analogues in PubChem, a higher stringency of related search can be executed, for example 95%).

Given their importance for pharmacology and medicine the problematic divergence in database molecular structures for approved drugs has been pointed out (PMID 20298516). For this reason, we have chosen to use consensus sets compiled from within PubChem as curatorial starting points (described in this poster). This is because an exact chemical structure match between multiple sources is more likely to be correct. However, at only

900 CIDs this consensus is about 65% of the expected total. Most of these are now curated and include their drug-target relationship mappings. In addition we have front-filled to include new approvals from 2010 up to 2Q 2014. Some back-filing will be explored via the consensus approach. However, since the concept of drug "correctness" is complex and somewhat abstract we have developed stringency guidelines to maximise database utility. These reduce the internal consequences of external different structural representations of the same drugs and associated splitting of activity mappings. By controlling relationship expansion these simplifications maintain the precision of queries. Our guidelines encompass a range of complexities but two can be illustrated. Since drugs can have many salt forms in PubChem we choose (i.e. normalise to) the parent CID for target and activity mapping since this usually corresponds to the INN name-to-structure mapping. However, records in PubChem BioAssay may map to salt forms whereas inspection of the assay details in a paper indicates the assumption of assigning, for example, an IC50 to the parent molecule is reasonable (e.g. if dilution and pH buffering are used). The other major naming ambiguity and data-splitting problem is stereochemistry. An example is where an approved drug INN is assigned to an enantiomeric mixture (that does not interconvert in vivo) but assay data is mapped to three different molecular representations (i.e. both the R and S isomers and the "flat" form). In this case, we assign the drug tag to the mixture and map data to this. We then add cross-pointers to the CIDs for the R and S only if data has been reported and/or mapped to them. A well-known example is omeprazole as the mixture and esomeprazole as the S isomer, as separately approved drugs. It is important to note that we include both discontinued and withdrawn drugs (generally superseded by newer drugs) to maximise our capture and cheminformatic analysis of drug sets but these can be filtered out of queries if necessary.

No, because the database is focused on quantitative molecular pharmacology, captured as a ligand-target relationship matrix to facilitate data navigation and mining. It is thus neither a substitute for a British Pharmacopoeia as a national example, nor a Drugs.com type of patient-centric resource. Many substances approved for medicinal purposes would negatively impact the precision of our database if we mapped-in their molecular interactions as "drugs". These include nutraceuticals that are principally metabolites (e.g. the DrugBank "approved drug" entry for NADH lists 144 targets), endogenous hormone replacements and inorganic salts (with the important exception of Lithium). We still face the challenge of finding unique nationally approved drugs that are not FDA- or EMA-listed but we do have some Japan-only examples.

Cases where clinical efficacy is thought to be mediated by multiple mmoas (molecular mechanisms of action) are termed polypharmacology. The archetype for this is a dual inhibitor, such as fasidotrilat that acts on both ACE and NEP. For curation, we will map the most potent cross-reactivity but generally not large SAR result sets. If the author convincingly proposes polypharmacology on the basis of the data, we will assign the ligands as multiple primary targets. The challenge here can be the limited evidence that multiple reported mmoas in vitro are actually translated to synergistic efficacy in clinical trials. Kinases are a particular difficult example. Since we include the three sets of matrix panel results, as well selected activity data from individual papers if available, at least the cross-reactivity data is surfaced for users to make their own judgments.

We certainly capture approved drugs and some advanced clinical candidates with clear evidence of therapeutic effects but where the complete mechanism of action is unknown or remains equivocal (e.g. Lithium). We also have some research compounds that have a phenotypic read-out and/or are pathway-mapped as a partial mechanism of action. These have curator comments indicating this (e.g. CCG-1423).

The majority of our ligand entries are small organic molecules, proteins, unmodified peptides, and smaller unmodified nucleotides or polysaccharides, However we are well aware that increasing numbers of new therapeutic molecular entities in clinical development are covalently linked permutations of these basic forms. Consequently, we are currently looking at the options. This includes assessing HELM, Sugar & Splice and InChI for large molecules and other formal ways for representing hybrid moieties. In addition, we are discerning how companies are adapting their registration systems to handle this. We are also observing the new INN, FDA and USAN guidelines being developed as well as PubChem engagement in this area. In general, we have not added large recombinant protein drugs to the database where these are effectively replacements for endogenous proteins.

There are many examples that do not fit into standard rules for ligand-target relationship mapping. One of these is drug-to-prodrug where we specifically introduced a new relationship. Complications arise where we cannot activity-map the drug to the target where the pro-drug is inactive. As you can see from the ACE inhibitor examples, the challenge is compounded since both forms are assigned an INN. We make another "rule-bend" where we map both prodrug and drug to the primary target (otherwise, it would become complicated since some pro-drugs are active against the target at lower potency. The consequence is thus a slight ligand over-count. However (specifically for ACE inhibitors) this is balanced by some "missing" human target activity mappings (e.g. only rabbit data was published). We use curators' notes to cross-reference the prodrug > drug ligand relationships. Note we also do this for drug > metabolite relationship where these metabolites were reported as significantly bioactive in their own right. Another important exceptional relationship is our recording of ligand-to-ligand binding interactions in the form of therapeutic monoclonal antibodies (mAbs) and their target interactions with cytokines or receptors.

Experimentalists find it valuable for us to point them to isotopically labelled ligand derivatives reported in the literature as probes. However, if the radiolabel positions are not explicit we can neither represent the molecular structure nor match it to a PubChem CID. We therefore introduced a pragmatic solution for unspecified label positions by duplicating the record of the unlabelled structure in order to link to the reference for the results from using the (unspecified) labelled version. Some of these are being remediated as more radiochemical vendors are submitting PubChem entries.

We do not link to any specific supplier because the 2014 increase in PubChem vendor submissions (currently over 55 million CIDs) means we can no longer maintain curated links. The good news is that

80% of our CIDs have a vendor match. These are accessible via the "Chemical Vendors" link on the right-hand side of a CID entry.


Conclusions

sHSPdb harbors a comprehensive dataset available for sHSP, together with tools designed for their online analysis. To our knowledge, there is no equivalent database for sHSP. sHSP are classified into classes on the basis of various parameters, especially on the basis of amino acids motifs that discriminate the classes. sHSPdb thus constitute an efficient tool: (i) for the compilation and the organization of growing data concerning sHSP (ii) for the classification of the various sub-families of sHSP (iii) for the design of experiments to elucidate the function of this important proteins (iv) to help the analysis of the sHSP structure-function relationships.

Future developments and perspectives: (i) sHSP physico-chemical properties and sHSP amino acids usage are statistically analyzed for all sHSP classes. We will thus be able to compare the three domains (i.e., the N-terminal, the ACD and the C-terminal), thus bringing additional information to those already determined by structural methodologies. (ii) We are currently developing software for the analysis of sequence submitted by the users in order to predict if it belongs to any of the sHSP classes. (iii) Since deciphering the molecular functions of sHSP is a major issue, we will provide lexical tools (dictionaries by alphabetic order or occurrence or synonyms…) for a better semantic analysis of the words that describe the known elements of the function of sHSP. (iv) As previously noted, retained proteins that are not fully classified are under study with the help of some predicting values and of a constraint programming software under development.


Results and discussion

In order to infer the TRNs underlying root development and physiological processes in Arabidopsis, we used two carefully curated datasets obtained from 656 root-specific CEL files from 56 ATH1 microarray experiments (Additional file 1). The first dataset, that we call the TFs-only dataset, is a 656 columns by 2088 rows table that corresponds to our list of 2088 TF probesets. The second dataset, that we call the complete dataset, is a 656 by 22810 table that contains all 22810 probesets present in the ATH1 chip. We used both datasets as input for the ARACNe software [21]. The ARACNe output is a list of interacting probeset pairs ranked through a Mutual Information value and its associated p-value. Details for the theoretical background and practical use of ARACNe can be found in [16] and [21] but, briefly, an interaction between gene A and gene B means that the expression profile of gene A along all 656 experiments explains the expression profile of gene B along those same 656 experiments, and vice versa, as the interactions are not directed. In a biological context, an interaction between gene A and gene B will imply that gene A and gene B participate in the same physiological process and, even further, if gene A is a TF and gene B is a non-TF, the interaction (gene A explains gene B) will suggest that gene A is a transcriptional regulator of gene B.

Network inference was centered on the 2088 TF probesets present in the ATH1 chip and was obtained at three data processing inequality (DPI) values, 0.0, 0.1 and 0.2. DPI is a known information-theoretical property and is explained in the supplementary manual in [21]. Briefly, at DPI 0.0, when a three-node clique (triangle) is present, the interaction with the lowest mutual information will be removed, as this interaction is considered to represent an indirect interaction. At DPI values other than 0.0, three genes loops are allowed and, at DPI 1.0, no interactions are removed. A DPI value of 0.2 (which will preserve triangles if the difference between the mutual information value of its interactions is 20% or less) increases the recovery of true positive interactions while still minimizing the recovery of false positives [16]. After translation of the ARACNe output adjacency files into Cytoscape compatible tables, we obtained the corresponding TFs-only (TFsNet Additional file 2) and complete (FullNet Additional file 3) databases. As shown in Table 1, the number of edges increases dramatically from DPI 0.0 to DPI 0.1 to DPI 0.2. For clarity, all graphical representations of the networks in this paper are those obtained at DPI 0.0.

TFs participating in inferred interactions are expressed in roots

An important question regarding our networks is to determine if the TFs participating in the inferred interactions are actually being expressed in root tissues. The mas5calls function from the affy R package, used to flag microarray expression values as Present, Absent or Marginal, is an unreliable tool to determine if a gene is being expressed or not [22], specially when it involves Arabidopsis TFs [23]. Therefore, in order to determine if the TFs present in our networks are expressed in root tissues, we extracted from both the TFsNet and FullNet obtained at DPI 0.0 all TFs that participate in an interaction and we compared both lists to lists of experimentally determined root-expressed genes (see Methods). Results are presented in Table 2 and Additional file 4. Over 92% of the recovered TFs in the two types of networks have been experimentally determined to be expressed in roots. We are therefore confident that the TFs present in our datasets are indeed root TFs and the interactions that we have recovered represent true in planta transcriptional interactions.

TFs that participate in the same processes are grouped together in the TFsNet

The TFsNet was obtained from a TFs-only dataset that excludes all non-TF genes and constitutes an overview of Arabidopsis roots TFs inferred interactions (Figure 1). TFs participating in the same processes are expected to be grouped together in distinct clusters or modules. Some of these functional modules have been identified and experimentally characterized and serve as probes of the reliability of the inferred networks.

The TFsNet. (a) Overview of the TFsNet obtained at DPI 0.0. Genes are represented as nodes and inferred interactions as edges. Nodes are colored grey, except genes mentioned in the text that are colored green. Edge width and color intensity is proportional to the Mutual Information (MI) value of the interaction, with higher MI values corresponding to thicker and darker edges. Gene names were omitted for clarity. Zooms on particular TF groups are presented in the subsequent panels. (b-i) Sub-networks of TFs present in the SHR-SCR group (b), the PLT group (c), the vascular development group (d), the AtGRF group (e), the AtHAM group (f), the jasmonate response group (g) the iron-deficiency group (h) and the nitrate-response group (i). Edges are labeled with the p-value of the interaction. Edges from the groups to the rest of the network were omitted for clarity.

Two transcriptional pathways controlling stem-cell niche patterning have been identified [24–28]. The first pathway is composed of the GRAS-family SHORT ROOT (SHR AT4G37650) and SCARECROW (SCR AT3G54220) and the C2H2-family, INDETERMINATE DOMAIN (IDD) MAGPIE (MGP AT1G03840) and JACKDAW (JKD AT5G03150). As shown in Figure 1b, these four TFs are grouped together with IDD NUTCRACKER (NUC AT5G44160), the SSXT-domain transcriptional co-activator ANGUSTIFOLIA 3 (AN3 AT5G28640) and the GRAS-family SCARECROW-LIKE 3 (SCL-3 AT1G50420). NUC and SCL-3 are proposed direct transcriptional targets of SHR [29–31]. Note that, as networks obtained at DPI 0.0 cannot contain triangles, the absence of an edge, for example between SHR and NUC, does not imply a lack of interaction between these two genes but merely that both genes have other interactions with better MI scores. Also, interactions between the genes in this module have relatively low MI values, corresponding to p-values of 1e-30 and 1e-40 (relative to the lowest p-value in the dataset, 1e-140). This is probably not surprising since this pathway has a complex mode of molecular interaction [32] that will hinder the ability of the ARACNe algorithm to recover their interaction from microarray data with a higher p-value [21]. Additional IDD genes, AtIDD4 (AT2G02080), AtIDD5 (AT2G02070), AtIDD14 (AT1G68130), AtIDD15 (AT2G01940) and AtIDD16 (AT1G25250), are present in this module. Protein-protein interactions have been reported for SCL-3-NUC, MGP-SCR, MGP-SHR, MGP-JKD, SCR-JKD, and SHR-JKD [33]. On the other hand, the IDD proteins JKD and MGP regulate SHR and SCR expression and movement across root tissues via both transcriptional and protein-protein interactions [27, 34]. Finally, movement of the SHR protein is abolished by the substitution of a single threonine residue in its VHIID motif, which is proposed to mediate protein-protein interactions of SHR [35] and its nuclear localization [34]. It is therefore interesting to speculate that AtIDD4, AtIDD5, AtIDD14, AtIDD15 and AtIDD16 could also be involved in root development and patterning via transcriptional regulation of, or protein-protein interactions with SHR and SCR.

The second pathway involves auxin signaling through the activation of as yet unidentified Auxin Response Factors (ARFs) and the PLETHORA (PLT) TFs, of the AP2-EREBP family. The PLT genes, PLT1 (AT3G20840), PLT2 (AT1G51190), PLT3/AIL6 (AT5G10510) and BABY BOOM (BBM AT5G17430), have overlapping expression profiles and act in a redundant manner [26]. In the TFsNet, the four PLT genes are part of the same group, that also includes the bHLH SPATULA (AT4G36930), ARF5/MONOPTEROS (AT1G19850) and the ERF-family Cytokinin Response Factors CRF2/TMO3 (AT4G23750) and CRF3 (AT5G53290 Figure 1c). Remarkably, the four PLETHORA proteins [26] and ARF5[36] are all expressed in the seedling root stele initials. Root vascular patterning has been shown to be dependent on an auxin-cytokinin cross-talk [37] and the participation in this cross-talk of a few genes, such as SHY2[38], BRX[39] or AHP6[40, 41] has been demonstrated. However, a transcriptional network linking the PLETHORA pathway and cytokinin responsive TFs is still missing. The presence of two CRF TFs in this module provides new clues in this direction.

BODENLOS (BDL AT1G04550), a member of the Aux/IAA family, is a transcriptional inhibitor of ARF5 and its expression is controlled by ARF5 in embryos [42]. Curiously, BDL, as well as two other TARGET OF MONOPTEROS (TMO) genes, ATAIG1/TMO5 (AT3G25710) and TMO6 (AT5G60200), do not group with ARF5 in the TFsNet. Instead, they are part of a group of TFs involved in vascular development that includes genes such as IAA13 (AT2G33310), IAA3/SHY2 (AT1G04240) [39], ATHB-14/PHABULOSA (AT2G34710), ATHB-15/CORONA (AT1G52150) [43], IFL/REVOLUTA (AT5G60690) [44], ATHB-8 (AT4G32880) [45], ATHB9/PHAVOLUTA (AT1G30490) and AtTCP14 (AT3G47620) [46] (Figure 1d). pBDL::GFP expression has been observed in the root stele of 4–5 days-old seedlings (see Figure S6 in [42]), thus pointing to possible novel roles for these auxin-related genes in vascular development.

Other TFs involved in organ development are also grouped together in the TFsNet. For example, the closely related ATHAM1 (AT2G45160), ATHAM2 (AT3G60630) and ATHAM3 (AT4G00150) genes, belonging to the GRAS family, are involved in the maintenance of meristem indeterminacy, and are functionally redundant [47, 48]. These three TFs also group in the same module in the TFsNet that we inferred (Figure 1e). Another example concerns the AtGRF genes, of the GRF family, which are expressed in developing tissues, such as shoot tips, flower buds and roots. Single mutants of the AtGRF1 (AT2G22840), AtGRF2 (AT4G37740) or AtGRF3 (AT2G36400) genes have no phenotype and double mutants have minor phenotypes [49], suggesting that these three genes have redundant roles. AtGRF1, AtGRF2 and AtGRF3 group together in the TFsNet put forward here (Figure 1f). Interestingly, our network inference also recovers the interactions AtGRF3-AN3 (p-value 1e-70) and AN3-SCR (p-value 1e-40), suggesting a link between the AtGRF module and the SHR-SCR module during root development.

The TFsNet also recovers transcriptional interactions between genes known to participate in root physiological processes other than development. A first example concerns genes involved in jasmonate response (Figure 1g). This group includes the TIFY domain genes JAZ1 (AT1G19180), JAZ2 (AT1G74950), JAZ5 (AT1G17380), JAZ6 (AT1G72450), JAZ7 (AT2G34600), JAZ8 (AT1G30135), JAZ9 (AT1G70700), JAS1/JAZ10 (AT5G13220), two WRKY genes involved in pathogen response, WRKY18 (AT4G31800) and WRKY40 (AT1G80840) [50], the bHLH-family AIB (AT2G46510) [51] and MYC2 (AT1G32640) [52] and the AP2/ERF RRTF1 (AT4G34410). Interestingly, chromatin immunoprecipitation experiments have shown that WRKY40 binds JAZ8 and RRTF1 regulatory regions [53], while MYC2 was recently shown to be involved in jasmonate-dependent root development inhibition [54].

A second example includes the bHLH TF BHLH038 (AT3G56970), BHLH039 (AT3G56980), BHLH100 (AT2G41240), BHLH101 (AT5G04150), POPEYE (PYE AT3G47640) and the DNA-binding protein-coding BRUTUS (BTS AT3G18290), which are involved in iron deficiency stress regulation [55, 56]. BHLH039, BHLH101, PYE and BTS are grouped together in the TFsNet (Figure 1h BHLH038 and BHLH100 are not represented in the ATH1 chip).

A third example involves nitrate response TFs [57]. The earliest TFs to be expressed in response to nitrate stimulus are HRS1 (AT1G13300), LBD37 (AT5G67420), LDB38 (AT3G49940), LBD39 (AT4G37540) and AT3G25790 (cluster 1 in [57]). Four of these five TFs, HRS1, LDB38, LBD39 and AT3G25790 are grouped together in the TFsNet (Figure 1h). Note that the microarray data for Long et al., E-GEOD-21443, and Krouk et al., E-GEOD-20044 in the EBI database, were released a few days after our microarray experiments download and are not part of the data used for our analysis.

Using the FullNet to integrate and analyze high-throughput functional genomics data

The FullNet was obtained from data which included all 22810 probesets present in the ATH1 chip, and was centered on the 2088 TF probesets list (Additional file 3). In this network, TFs will be central nodes, with their interactors, either TFs or non-TFs, as neighboring nodes. Genes participating in the same processes should again be grouped together. For example, the TF groups identified in the TFsNet are still present in the same groups in the FullNet. One must bear in mind that, in this network, non-TF nodes are present. When a non-TF interacts with two TFs, and these interactions have better MI scores than the TF-TF interaction, then the latter interaction will, at DPI 0.0, be considered an indirect interaction, and thus will not appear in the network. However, this does not mean that the TF-TF interaction does not exist, only that it is “masked” by an intermediary non-TF node. When the TF-TF MI value is not the lowest in a triangle it is visible in the DPI 0.0 FullNet. This is the case for the interactions between PLT1, PLT2 and PLT3/AIL6, at p-values of 1e-50, the SCR-SHR interaction at a p-value of 1e-30, the interaction of the early nitrate-responsive TF HRS1 with LBD38, LDB39 and AT3G25790 at p-values of 1e-40 and lower, as well as the interaction of BHLH039 with BHLH101 at a p-value of 1e-60 and with PYE at a p-value of 1e-20. Interaction between AGL71 and AGL72, which was present at a p-value of 1e-20 in the TFsNet, is now recovered with a p-value of 1e-50. These two MADS-box genes have recently been shown to act redundantly in apical and axillary meristems [58].

In the FullNet, interactors of a TF node are potential target genes for that TF. If this is the case, one would expect a significant number of experimentally identified target genes for that TF to be present in the corresponding lists of ARACNe interactors. One example of a TF for which ARACNe-inferred interactions are confirmed experimentally corresponds to VND7/ANAC030 (AT1G71930). VND7 is a NAC-family TF involved in secondary cell wall synthesis and several lists of its putative target genes are available [59–62]. We compared these lists of experimentally identified VND7 target genes with our list of VND7 interactors from the complete dataset at DPI 0.0, 0.1 and 0.2 (Table 3 and Additional file 5). 14 out of 16 genes at DPI 0.0, 24 out of 44 at DPI 0.1 and 24 out of 107 at DPI 0.2 from our VND7 neighbor list are differentially expressed in at least one of the experimental settings. Almost all differentially expressed genes are found at high MI values, corresponding to p-values of 1e-50 and lower. Finally, three of the four differentially expressed TFs identified by Yamaguchi et al. [62], JLO (AT4G00220), MYB46 (AT5G12870) and MYB103 (AT1G63910), are part of the VND7 cluster in the TFsNet, at p-values of 1e-50 and lower. Curiously, a top-ranked VND7 interactor in our dataset, the pinoresinol reductase ATPRR1 (AT1G32100), is not present in any of the experimental VND7 target genes lists. ATPRR1 has, at DPI 0.0, TF interactors with higher MI values than VND7, suggesting that it could instead be regulated by one, or more, of these higher-score TFs. Alternatively, the VND7-ATPRR1 transcriptional interaction could be age-specific and not detectable in any of the above-mentioned experimental settings.

There are also examples of TFs for which there is little overlap between ARACNe-inferred interactors lists and experimental target gene lists. Two examples are the SHR and SCR TFs. SHR and SCR are important genes for root development and several lists of their proposed transcriptional target genes are available [29–31, 63]. Sozzani et al. [30] obtained, through microarray data analysis, a comprehensive list of differentially expressed genes during a time-course of SCR or SHR induction, while Cui et al. [31] identified SHR target genes through chromatin inmunoprecipitation (ChIP). A direct comparison of the target gene lists from Sozzani et al., to which we will refer as the Sozzani-SCR and Sozzani-SHR lists, to our ARACNe list of inferred SCR or SHR interactors obtained at DPI 0.0, 0.1 and 0.2, resulted in a low overall overlap: there are 732 ARACNe-SCR and 719 ARACNe-SHR interactors at DPI 0.2, of which 68 (9.2%) and 159 (22%) were found in the corresponding SCR- or SHR-Sozzani lists. In particular, we would expect to find in both the ARACNe and Sozzani lists genes known to participate in the SHR-SCR transcriptional regulation pathway, namely JKD, MGP, NUC and CYCD61 (AT4G03270). The first three genes are TFs and they can be found in the same module as SHR and SCR in the TFsNet. CYCD61, a non-TF, is present in both the SCR-Sozzani and SHR-Sozzani lists, but is not an ARACNE-inferred interactor of SHR, SCR, JKD, MGP nor NUC. At DPI 0.0 its only interacting TF is AGL92 (AT1G31640), which is not close to the SHR-SCR module in either the TFsNet or FullNet. While disappointing, this result is perhaps not surprising: CYCD61 is expressed in very particular wild type root cell types, the cortex/endodermis initial stem cells and lateral root primordium endodermal cells [30, 64]. Furthermore, CYCD61 participates in a complex regulatory mechanism involving protein-protein interactions, protein phosphorylation and protein degradation [64]. It is likely that these mechanisms are poorly translated into transcript levels of the corresponding genes in whole root samples, which is the input data for ARACNe.

The ability of ARACNe to recover experimentally identified TF target genes will most likely mirror the number and complexity of the regulatory interactions in which that TF participates. VND7 is a TF involved exclusively in secondary cell-wall synthesis (SCWS) [65, 66]. As such, we expect VND7 to participate in a very specific transcriptional module, and ARACNe to accurately recover its experimentally identified target genes. On the other hand, SHR and SCR are most likely involved in numerous transcriptional pathways, as mutants for these genes are strongly affected in root development [67, 68], and over 200 TFs can be found in the lists of differentially expressed genes for SHR or SCR inductions, which analyzed a specific root cell-type, i.e. ground tissue [30]. Such an important number of differentially expressed TFs (approximately 10% of all Arabidopsis TFs) further suggests that a significant number of these experimentally identified target genes are indirect targets. Additionally, regulation of root development by SCR and SHR involves expression in defined cell types, transport across cell-types, nucleus-cytoplasm translocation, protein-protein interactions and protein phosphorylation [27, 30, 34, 35, 64]. In this case, we expect that better results could be obtained by visualizing experimentally identified target genes in the context of the networks where they participate. We therefore decided to retrieve from the FullNet dataset, obtained at DPI 0.0 and with a cutoff p-value of 1e-30, all interactions for which both nodes are present in the list of 2481 differentially expressed genes in the SHR induction kinetic from Sozzani and collaborator’s study [30], to which we added SHR (AT4G37650). The resulting dataset now contains 1668 genes (67% of the original list) and the corresponding network was drawn with Cytoscape [69]. 1647 nodes (66%), including SHR, are grouped together in a single subnetwork (Figures 2a-d). We observe that this subnetwork is clearly divided in two sections, corresponding to genes that, as time progresses in the induction kinetic, switch from an under-expressed to an over-expressed state and vice versa. An analysis of this subnetwork can now help identify relevant nodes, which should play important roles in the SHR transcriptional pathway. For example, three of the main nodes that switch from under- to over-expression are PRMT3 (AT3G12270), KYP (AT5G13960) and HD2A (AT3G44750), which are genes coding for chromatin modification (histone methyl-transferase and histone deacetylase) proteins. An analysis of all genes that switch from under- to over-expression when using David [70, 71] and Enrichment Map [72] reveals that this module is enriched, among others, in cell-cycle, microtubule, RNA-processing and putative chromatin modification protein-coding genes (Figure 2e).

Subnetwork of differentially expressed genes in a SHR induction time-course [30]. Node colors correspond to down-regulated (green) or up-regulated (magenta) genes at 1 hour (a), 3 hours (b), 6 hours (c) and 12 hours (d) after SHR induction. (e) Enrichment Map [72] network of category enrichments calculated with David [70, 71] for genes in the subnetwork that switch from a down-regulated to an up-regulated status during the induction kinetic.

ARACNe-inferred networks allow for the prediction of novel genetic interactions for root-expressed TFs: a possible role for SPATULA in the PLETHORApathway

The TFsNet was obtained from data which included exclusively our list of 2088 TF probesets (see Methods). In this network, TFs that participate in the same biological process should be grouped together. Therefore, we expect higher order mutant plants for genes in a same module to exhibit root phenotypes not observed in single mutant plants. We set to test this hypothesis with genes that are present in the same module, but 1) belong to different TF families, 2) are not immediate neighbors in the TFsNet, and 3) whose mutants have distinct root phenotypes. The genes BABY BOOM (BBM) and SPATULA (SPT) matched these criteria. Both genes are present in the same module (Figure 1c), and mutants of the BBM gene, an AP2-domain TF, have slightly shorter roots [26], while mutants of the SPT gene, a bHLH TF, have slightly longer roots than wild type plants [73]. When grown on vertical plates, the bbm-2/spt-2 double mutant exhibited longer roots than either spt-2 or bbm-2 single mutant seedlings (Figure 3). A previous report showed that PIN4 and DR5::GUS expression is altered in the root meristem of spt-11 mutant seedlings [73]. Taken together, these results point to a possible transcriptional interaction between the PLETHORA pathway and SPATULA in the regulation of auxin transport and/or response in Arabidopsis root meristems.

Photographs of bbm-2 , spt-2 and bbm-2 / spt-2 six days-old mutant seedlings grown on vertical plates. Bar represents 0.5 cm.

ARACNe-inferred networks allow for the prediction of novel functions for root-expressed TFs: the case of XAL1/AGL12, a MADS-box TF involved in secondary cell-wall synthesis

Since our ARACNe inferred networks are able to recover known gene associations, we expect them to also be able to predict novel TF functions. As an example of the predictive power of our database, we decided to look for new TFs that could be participating in secondary cell wall synthesis (SCWS). For this aim, our strategy consisted in selecting several genes, both TF and non-TF, known to be involved in SCWS, recover their interactions from the FullNet and draw the resulting network in order to identify new SCWS TFs. Several TFs are known to be involved in SCWS, among which we chose VND6/ANAC101 (AT5G62380), VND7/ANAC030 (AT1G71930) [74], SND2/ANAC073 (AT4G28500) [75], MYB46[76] and IXR11 (AT1G62990) [77]. As SCWS non-TF genes we chose the cellulose synthases CESA4 (AT5G44030), CESA7 (AT5G17420) and CESA8 (AT4G18780) [78], the laccases LAC4 (AT2G38080) and LAC17 (AT5G60020) [79], the cysteine peptidases XCP1 (AT4G35350) and XCP2 (AT1G20850) [80], the chitinase-like ATCTL2 (AT3G16920) [81], the DUF6 domain WAT1 (AT1G75500) [82], TED6 (AT1G43790) [83], the DUF231 domain TBL3 (AT5G01360) [84] and the family 8 glycosyl-transferase GAUT12/IRX8 (AT5G54690) [85]. We then retrieved from the FullNet all interactions involving these genes at DPI 0.0 and a p-value cutoff of 1e-30 and used Cytoscape [69] to visualize the corresponding network (Additional file 6). It immediately appears that these genes are indeed part of a network of SCWS genes that includes our input genes plus several other known, or putative, SCWS genes including MYB83 (AT3G08500) [86], ANAC007/VND4 (AT1G12260) [65] or ATPRR1[87], but also vascular development TFs like ATHB-15[43], ATHB-16 (AT4G40060) [88] and JLO (AT4G00220) a target of VND7 [62].

In a highly connected part of this SCWS network, 22 TFs that were not part of our input gene list are now present (Figure 4). We retrieved from the FullNet all interactions involving these TFs at DPI 0.0, 0.1 and 0.2 and a p-value cutoff of 1e-30. An enrichment analysis, using David, of the lists of interactors for three of the newly identified TFs, XAL1/AGL12 (a MADS-box), BEE2 and AT1G68810 (two bHLH) revealed that they are particularly enriched in SCWS genes (data not shown) the lists of high MI value interactors for each TF are shown in Additional file 7. As these three TFs are present in the highly connected part of the SCWS network, it is not surprising to find that they share several of their interactors. AGL12/XAL1 is a MADS-box transcription factor that is expressed in phloem tissues and is involved in the regulation of both root development and flowering time [89]. BEE2 was first identified as a brassinosteroid-responsive TF [90]. Brassinosteroids promote root growth [91], are essential for the development of the vascular system in Arabidopsis stems [92] and enhance xylem vessel transdifferentiation of Arabidopsis suspension cultures [74]. AT1G68810 is 1) a TF that we found as part of the vascular development cluster in the TFsNet, 2) closely related to ATAIG1/TMO5, which is also part of the TFs-only vascular development cluster and 3) a protein-protein interactor of LONESOME HIGHWAY, a transcriptional activator involved in vascular development [93]. These results predict that XAL1, BEE2 and AT1G68810 are important TFs for SCWS.

Highly connected part of the SCWS network obtained at DPI 0.0 and a p-value cutoff of 1e-30. Genes are represented as nodes and inferred interactions as edges. TFs not present in the input list are colored yellow. The TFs AT1G68810, BEE2 and AGL12 (XAL1) are further mentioned in the text and colored orange. Edge width is proportional to the Mutual Information (MI) value of the interaction, with higher MI values corresponding to thicker edges.

As MADS-box TFs are not usually associated with SCWS, we decided to look for SCW deposition in xal1-2 loss-of-function mutant roots [89]. Since xal1-2 presents a delay in flowering time, roots from plants of the same chronological age might reveal developmental stage-related SCWS differences rather than a direct SCWS phenotype. Therefore, both Col-0 and xal1-2 roots were collected when the main stem was 29–32 cm in length, which arguably corresponds to plants at the same developmental stage. As predicted by our inferred network, xal1-2 adult roots indeed have altered secondary cell-wall patterns with gaps in the secondary xylem and fiber ring (n = 10/10), a phenotype rarely observed in wild type plants of the same size (n = 1/10 Figure 5). In an intriguing paper, Sibout et al. have shown that xylem expansion in hypocotyls and roots is linked to flowering time [94]. Coincidentally, xal1-2 plants have delayed flowering [89] and altered root SCWS, strongly suggesting that XAL1 could be part of a TRN that connects both processes.

Photographs of wild type (a) and xal1-2 (b) adult root transverse sections observed under UV light. Autofluorescence of lignified tissues is evidenced as a light-blue coloration. White arrows in xal1-2 indicate gaps in the xylem-fiber ring. Bars represent 100 microns.

The confirmation of SCWS alterations in xal1-2 root tissues shows that our bioinformatics methodology to infer TRNs is a successful approach for the accurate prediction of novel functions for root-expressed TFs. This result further strengthens that our networks will likely provide novel hypothesis concerning functional modules involved in root development. As an additional example, the DUF6 protein WAT1 [82] has, at DPI 0.0 and a p-value cutoff of 1e-50, the TF interactors ATHB-15/CNA, AT1G68810, AT4G29100, STH2, ATHB-16 and AT2G28510, all of which are part of the vascular development cluster of the TFsNet (Figure 1d). This suggests, first, that one, or more, of these TFs is the transcriptional regulator of WAT1 in root tissues and, second, that one, or more, of these TFs control vascular development, at least partly, through the direct transcriptional regulation of WAT1. Finally, the DUF6-domain protein-coding genes AT1G43650, AT1G01070, AT3G45870, AT3G18200 and AT4G30420 are interactors of TFs known to be involved in SCWS, suggesting that they might have similar roles to WAT1 in root SCWS.


References

Terwilliger TC, Waldo G, Peat TS, Newman JM, Chu K, Berendzen J (1998) Class-directed structure determination: foundation for a protein structure initiative. Protein Sci 7(9):1851–1856. doi:10.1002/pro.5560070901

Vitkup D, Melamud E, Moult J, Sander C (2001) Completeness in structural genomics. Nat Struct Biol 8(6):559–566. doi:10.1038/88640

Grabowski M, Joachimiak A, Otwinowski Z, Minor W (2007) Structural genomics: keeping up with expanding knowledge of the protein universe. Curr Opin Struct Biol 17(3):347–353. doi:10.1016/j.sbi.2007.06.003

Levitt M (2009) Nature of the protein universe. Proc Natl Acad Sci USA 106(27):11079–11084. doi:10.1073/pnas.0905029106

NIGMS (2010) NIH grants will advance studies of the form and function of proteins. http://www.nigms.nih.gov/News/results/Pages/20100930.aspx. Accessed 4 Nov 2015

Yokoyama S, Terwilliger TC, Kuramitsu S, Moras D, Sussman JL (2007) RIKEN aids international structural genomics efforts. Nature 445(7123):21. doi:10.1038/445021a

Cassman M, World Technology Evaluation Center (2007) Systems biology: international research and development. Springer, Dordrecht

Tanaka A, Hirai A, Harai D, Nakayama K, Fujii A, Yokoyama S (2015) Intellectual property rights management for structural genomics research. http://www.protein.gsc.riken.jp/Concept/Partnership/partner_eng.htm. Accessed 4 Nov 2015

SGC Mission and Philosophy (2015). http://www.thesgc.org/about/what_is_the_sgc. Accessed 4 Nov 2015

Gerlt JA, Allen KN, Almo SC, Armstrong RN, Babbitt PC, Cronan JE, Dunaway-Mariano D, Imker HJ, Jacobson MP, Minor W, Poulter CD, Raushel FM, Sali A, Shoichet BK, Sweedler JV (2011) The enzyme function initiative. Biochemistry 50(46):9950–9962. doi:10.1021/bi201312u

Albeck S, Alzari P, Andreini C, Banci L, Berry IM, Bertini I, Cambillau C, Canard B, Carter L, Cohen SX, Diprose JM, Dym O, Esnouf RM, Felder C, Ferron F, Guillemot F, Hamer R, Ben Jelloul M, Laskowski RA, Laurent T, Longhi S, Lopez R, Luchinat C, Malet H, Mochel T, Morris RJ, Moulinier L, Oinn T, Pajon A, Peleg Y, Perrakis A, Poch O, Prilusky J, Rachedi A, Ripp R, Rosato A, Silman I, Stuart DI, Sussman JL, Thierry JC, Thompson JD, Thornton JM, Unger T, Vaughan B, Vranken W, Watson JD, Whamond G, Henrick K (2006) SPINE bioinformatics and data-management aspects of high-throughput structural biology. Acta Crystallogr D Biol Crystallogr 62(Pt 10):1184–1195. doi:10.1107/S090744490602991X

Banci L, Bertini I, Cusack S, de Jong RN, Heinemann U, Jones EY, Kozielski F, Maskos K, Messerschmidt A, Owens R, Perrakis A, Poterszman A, Schneider G, Siebold C, Silman I, Sixma T, Stewart-Jones G, Sussman JL, Thierry JC, Moras D (2006) First steps towards effective methods in exploiting high-throughput technologies for the determination of human protein structures of high biomedical value. Acta Crystallogr D Biol Crystallogr 62(Pt 10):1208–1217. doi:10.1107/S0907444906029350

Chim N, Habel JE, Johnston JM, Krieger I, Miallau L, Sankaranarayanan R, Morse RP, Bruning J, Swanson S, Kim H, Kim CY, Li H, Bulloch EM, Payne RJ, Manos-Turvey A, Hung LW, Baker EN, Lott JS, James MN, Terwilliger TC, Eisenberg DS, Sacchettini JC, Goulding CW (2011) The TB structural genomics consortium: a decade of progress. Tuberculosis (Edinb) 91(2):155–172. doi:10.1016/j.tube.2010.11.009

Musa TL, Ioerger TR, Sacchettini JC (2009) The tuberculosis structural genomics consortium: a structural genomics approach to drug discovery. Adv Protein Chem Struct Biol 77:41–76. doi:10.1016/S1876-1623(09)77003-8

Cyranoski D (2006) ‘Big science’ protein project under fire. Nature 443(7110):382. doi:10.1038/443382a

Petsko GA (2007) An idea whose time has gone. Genome Biol 8(6):107. doi:10.1186/gb-2007-8-6-107

Banci L, Baumeister W, Heinemann U, Schneider G, Silman I, Stuart DI, Sussman JL (2007) An idea whose time has come. Genome Biol 8(11):408. doi:10.1186/gb-2007-8-11-408

Lane E, Ham B (2012) Science policy. The payoff of federal R&D: iPod, Google, and human genome project. Science 336(6080):433

Liu J, Montelione GT, Rost B (2007) Novel leverage of structural genomics. Nat Biotechnol 25(8):849–851. doi:10.1038/nbt0807-849

Dessailly BH, Nair R, Jaroszewski L, Fajardo JE, Kouranov A, Lee D, Fiser A, Godzik A, Rost B, Orengo C (2009) PSI-2: structural genomics to cover protein domain family space. Structure 17(6):869–881. doi:10.1016/j.str.2009.03.015

O’Donovan C, Martin MJ, Gattiker A, Gasteiger E, Bairoch A, Apweiler R (2002) High-quality protein knowledge resource: SWISS-PROT and TrEMBL. Brief Bioinform 3(3):275–284

Uniprot (2015) Current release statistics. http://www.ebi.ac.uk/uniprot/TrEMBLstats. Accessed 4 Nov 2015

Lee D, Grant A, Marsden RL, Orengo C (2005) Identification and distribution of protein families in 120 completed genomes using Gene3D. Proteins 59(3):603–615. doi:10.1002/prot.20409

Unger R, Uliel S, Havlin S (2003) Scaling law in sizes of protein sequence families: from super-families to orphan genes. Proteins 51(4):569–576. doi:10.1002/prot.10347

Nair R, Liu J, Soong TT, Acton TB, Everett JK, Kouranov A, Fiser A, Godzik A, Jaroszewski L, Orengo C, Montelione GT, Rost B (2009) Structural genomics is the largest contributor of novel structural leverage. J Struct Funct Genomics 10(2):181–191. doi:10.1007/s10969-008-9055-6

Khafizov K, Ivanov MV, Glazova OV, Kovalenko SP (2015) Computational approaches to study the effects of small genomic variations. J Mol Model 21(10):2794. doi:10.1007/s00894-015-2794-y

Khafizov K, Madrid-Aliste C, Almo SC, Fiser A (2014) Trends in structural coverage of the protein universe and the impact of the protein structure initiative. Proc Natl Acad Sci USA 111(10):3733–3738. doi:10.1073/pnas.1321614111

Pieper U, Webb BM, Dong GQ, Schneidman-Duhovny D, Fan H, Kim SJ, Khuri N, Spill YG, Weinkam P, Hammel M, Tainer JA, Nilges M, Sali A (2014) ModBase, a database of annotated comparative protein structure models and associated resources. Nucleic Acids Res 42(Database issue):D336–D346. doi:10.1093/nar/gkt1144

Arnold K, Kiefer F, Kopp J, Battey JN, Podvinec M, Westbrook JD, Berman HM, Bordoli L, Schwede T (2009) The protein model portal. J Struct Funct Genomics 10(1):1–8. doi:10.1007/s10969-008-9048-5

Moult J (2005) A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction. Curr Opin Struct Biol 15(3):285–289. doi:10.1016/j.sbi.2005.05.011

Lesley SA, Kuhn P, Godzik A, Deacon AM, Mathews I, Kreusch A, Spraggon G, Klock HE, McMullan D, Shin T, Vincent J, Robb A, Brinen LS, Miller MD, McPhillips TM, Miller MA, Scheibe D, Canaves JM, Guda C, Jaroszewski L, Selby TL, Elsliger MA, Wooley J, Taylor SS, Hodgson KO, Wilson IA, Schultz PG, Stevens RC (2002) Structural genomics of the Thermotoga maritima proteome implemented in a high-throughput structure determination pipeline. Proc Natl Acad Sci USA 99(18):11664–11669. doi:10.1073/pnas.142413399

Zhang Y, Thiele I, Weekes D, Li Z, Jaroszewski L, Ginalski K, Deacon AM, Wooley J, Lesley SA, Wilson IA, Palsson B, Osterman A, Godzik A (2009) Three-dimensional structural view of the central metabolic network of Thermotoga maritima. Science 325(5947):1544–1549. doi:10.1126/science.1174671

Omenn GS, Lane L, Lundberg EK, Beavis RC, Nesvizhskii AI, Deutsch EW (2015) Metrics for the human proteome project 2015: progress on the human proteome and guidelines for high-confidence protein identification. J Proteome Res 14(9):3452–3460. doi:10.1021/acs.jproteome.5b00499

Gaudet P, Argoud-Puy G, Cusin I, Duek P, Evalet O, Gateau A, Gleizes A, Pereira M, Zahn-Zabal M, Zwahlen C, Bairoch A, Lane L (2013) neXtProt: organizing protein knowledge in the context of human proteome projects. J Proteome Res 12(1):293–298. doi:10.1021/pr300830v

Mizianty MJ, Fan X, Yan J, Chalmers E, Woloschuk C, Joachimiak A, Kurgan L (2014) Covering complete proteomes with X-ray structures: a current snapshot. Acta Crystallogr D Biol Crystallogr 70(Pt 11):2781–2793. doi:10.1107/S1399004714019427

Wallin E, von Heijne G (1998) Genome-wide analysis of integral membrane proteins from eubacterial, archaean, and eukaryotic organisms. Protein Sci 7(4):1029–1038. doi:10.1002/pro.5560070420

Kloppmann E, Punta M, Rost B (2012) Structural genomics plucks high-hanging membrane proteins. Curr Opin Struct Biol 22(3):326–332. doi:10.1016/j.sbi.2012.05.002

PSI Publication Portal (2015). http://olenka.med.virginia.edu/psi. Accessed 24 Aug 2015

RIKEN Structural/Genomics Proteomics Initiative. Publications (2015). http://www.rsgi.riken.jp/rsgi_e/ResearchResult/index.html. Accessed 24 Aug 2015

SGC (2015) Publications. www.thesgc.org/publications. Accessed 24 Aug 2015

CSGID Publications (2015). http://csgid.org/publications. Accessed 24 Aug 2015

SSGCID Publications (2015). http://www.ssgcid.org/publications. Accessed 24 Aug 2015

Enzyme Function Initiative Publications (2015). http://enzymefunction.org/publications. Accessed 20 Nov 2015

Mougous JD, Cuff ME, Raunser S, Shen A, Zhou M, Gifford CA, Goodman AL, Joachimiak G, Ordonez CL, Lory S, Walz T, Joachimiak A, Mekalanos JJ (2006) A virulence locus of Pseudomonas aeruginosa encodes a protein secretion apparatus. Science 312(5779):1526–1530. doi:10.1126/science.1128393

Minor W, Cymborowski M, Otwinowski Z, Chruszcz M (2006) HKL-3000: the integration of data reduction and structure solution–from diffraction images to an initial model in minutes. Acta Crystallogr D Biol Crystallogr 62(Pt 8):859–866. doi:10.1107/S0907444906019949

Cherezov V, Rosenbaum DM, Hanson MA, Rasmussen SG, Thian FS, Kobilka TS, Choi HJ, Kuhn P, Weis WI, Kobilka BK, Stevens RC (2007) High-resolution crystal structure of an engineered human beta2-adrenergic G protein-coupled receptor. Science 318(5854):1258–1265. doi:10.1126/science.1150577

Rosenbaum DM, Cherezov V, Hanson MA, Rasmussen SG, Thian FS, Kobilka TS, Choi HJ, Yao XJ, Weis WI, Stevens RC, Kobilka BK (2007) GPCR engineering yields high-resolution structural insights into beta2-adrenergic receptor function. Science 318(5854):1266–1273. doi:10.1126/science.1150609

Crooks GE, Hon G, Chandonia JM, Brenner SE (2004) WebLogo: a sequence logo generator. Genome Res 14(6):1188–1190. doi:10.1101/gr.849004

Gabanyi MJ, Adams PD, Arnold K, Bordoli L, Carter LG, Flippen-Andersen J, Gifford L, Haas J, Kouranov A, McLaughlin WA, Micallef DI, Minor W, Shah R, Schwede T, Tao YP, Westbrook JD, Zimmerman M, Berman HM (2011) The structural biology knowledgebase: a portal to protein structures, sequences, functions, and methods. J Struct Funct Genomics 12(2):45–54. doi:10.1007/s10969-011-9106-2

Redner S (1998) How popular is your paper? An empirical study of the citation distribution. Eur Phys J B 4(2):131–134. doi:10.1007/s100510050359

Albarrán P, Crespo J, Ortuño I, Ruiz-Castillo J (2011) The skewness of science in 219 sub-fields and a number of aggregates. Scientometrics 88(2):385–397. doi:10.1007/s11192-011-0407-9

Brzezinski M (2015) Power laws in citation distributions: evidence from Scopus. Scientometrics 103(1):213–228. doi:10.1007/s11192-014-1524-z

Peterson GJ, Pressé S, Dill KA (2010) Nonuniversal power law scaling in the probability distribution of scientific citations. Proc Natl Acad Sci 107(37):16023–16027

Clauset A, Shalizi C, Newman M (2009) Power-law distributions in empirical data. SIAM Rev 51(4):661–703. doi:10.1137/070710111

Chen L, Oughtred R, Berman HM, Westbrook J (2004) TargetDB: a target registration database for structural genomics projects. Bioinformatics 20(16):2860–2862. doi:10.1093/bioinformatics/bth300

Morris C (2015) PiMS: a data management system for structural proteomics. Methods Mol Biol 1261:21–34. doi:10.1007/978-1-4939-2230-7_2

Morris C, Pajon A, Griffiths SL, Daniel E, Savitsky M, Lin B, Diprose JM, da Silva AW, Pilicheva K, Troshin P, van Niekerk J, Isaacs N, Naismith J, Nave C, Blake R, Wilson KS, Stuart DI, Henrick K, Esnouf RM (2011) The protein information management system (PiMS): a generic tool for any structural biology research laboratory. Acta Crystallogr D Biol Crystallogr 67(Pt 4):249–260. doi:10.1107/S0907444911007943

Zolnai Z, Lee PT, Li J, Chapman MR, Newman CS, Phillips GN Jr, Rayment I, Ulrich EL, Volkman BF, Markley JL (2003) Project management system for structural and functional proteomics: sesame. J Struct Funct Genomics 4(1):11–23

Berman HM (2008) The protein data bank: a historical perspective. Acta Crystallogr A 64(Pt 1):88–95. doi:10.1107/S0108767307035623

Weekes D, Krishna SS, Bakolitsa C, Wilson IA, Godzik A, Wooley J (2010) TOPSAN: a collaborative annotation environment for structural genomics. BMC Bioinform 11:426. doi:10.1186/1471-2105-11-426

Prilusky J, Hodis E, Canner D, Decatur WA, Oberholser K, Martz E, Berchanski A, Harel M, Sussman JL (2011) Proteopedia: a status report on the collaborative, 3D web-encyclopedia of proteins and other biomolecules. J Struct Biol 175(2):244–252. doi:10.1016/j.jsb.2011.04.011

Zimmerman MD, Grabowski M, Domagalski MJ, Maclean EM, Chruszcz M, Minor W (2014) Data management in the modern structural biology and biomedical research environment. Methods Mol Biol 1140:1–25. doi:10.1007/978-1-4939-0354-2_1

Gifford LK, Carter LG, Gabanyi MJ, Berman HM, Adams PD (2012) The protein structure initiative structural biology knowledgebase technology portal: a structural biology web resource. J Struct Funct Genomics 13(2):57–62. doi:10.1007/s10969-012-9133-7

Kobayashi N, Harano Y, Tochio N, Nakatani E, Kigawa T, Yokoyama S, Mading S, Ulrich EL, Markley JL, Akutsu H, Fujiwara T (2012) An automated system designed for large scale NMR data deposition and annotation: application to over 600 assigned chemical shift data entries to the BioMagResBank from the Riken structural genomics/proteomics initiative internal database. J Biomol NMR 53(4):311–320. doi:10.1007/s10858-012-9641-6

Ulrich EL, Akutsu H, Doreleijers JF, Harano Y, Ioannidis YE, Lin J, Livny M, Mading S, Maziuk D, Miller Z, Nakatani E, Schulte CF, Tolmie DE, Kent Wenger R, Yao H, Markley JL (2008) BioMagResBank. Nucleic Acids Res 36(Database issue):D402–D408. doi:10.1093/nar/gkm957

Seiler CY, Park JG, Sharma A, Hunter P, Surapaneni P, Sedillo C, Field J, Algar R, Price A, Steel J, Throop A, Fiacco M, LaBaer J (2014) DNASU plasmid and PSI:Biology-Materials repositories: resources to accelerate biological research. Nucleic acids research 42(Database issue):D1253–D1260. doi:10.1093/nar/gkt1060

SSGCID Available Materials (2015). http://www.ssgcid.org/available-materials. Accessed 23 Nov 2015

Brown PJ, Muller S (2015) Open access chemical probes for epigenetic targets. Future Med Chem 7(14):1901–1917. doi:10.4155/fmc.15.127

Collins FS, Tabak LA (2014) Policy: NIH plans to enhance reproducibility. Nature 505(7485):612–613

Niedzialkowska E, Gasiorowska O, Handing KB, Majorek KA, Porebski PJ, Shabalin IG, Zasadzinska E, Cymborowski M, Minor W (2016) Protein purification and crystallization artifacts: the tale usually not told. Protein Sci 25(3):720–733 doi:10.1002/pro.2861

Eschenfeldt WH, Lucy S, Millard CS, Joachimiak A, Mark ID (2009) A family of LIC vectors for high-throughput cloning and purification of proteins. Methods Mol Biol 498:105–115. doi:10.1007/978-1-59745-196-3_7

Almo SC, Garforth SJ, Hillerich BS, Love JD, Seidel RD, Burley SK (2013) Protein production from the structural genomics perspective: achievements and future needs. Curr Opin Struct Biol 23(3):335–344. doi:10.1016/j.sbi.2013.02.014

Ericsson UB, Hallberg BM, Detitta GT, Dekker N, Nordlund P (2006) Thermofluor-based high-throughput stability optimization of proteins for structural studies. Anal Biochem 357(2):289–298. doi:10.1016/j.ab.2006.07.027

Newman J, Egan D, Walter TS, Meged R, Berry I, Ben Jelloul M, Sussman JL, Stuart DI, Perrakis A (2005) Towards rationalization of crystallization screening for small- to medium-sized academic laboratories: the PACT/JCSG+ strategy. Acta Crystallogr D Biol Crystallogr 61(Pt 10):1426–1431. doi:10.1107/S0907444905024984

Sagemark J, Kraulis P, Weigelt J (2010) A software tool to accelerate design of protein constructs for recombinant expression. Protein Expr Purif 72(2):175–178. doi:10.1016/j.pep.2010.03.020

Przulj N, Wigle DA, Jurisica I (2004) Functional topology in a network of protein interactions. Bioinformatics 20(3):340–348. doi:10.1093/bioinformatics/btg415

NIGMS (2014) Recommendations for continued investment in structural biology following the sunsetting of the protein structure initiative. https://www.nigms.nih.gov/News/reports/Documents/NIGMS-FSBC-report2014.pdf. Accessed 4 Nov 2015

Winn MD, Ballard CC, Cowtan KD, Dodson EJ, Emsley P, Evans PR, Keegan RM, Krissinel EB, Leslie AG, McCoy A, McNicholas SJ, Murshudov GN, Pannu NS, Potterton EA, Powell HR, Read RJ, Vagin A, Wilson KS (2011) Overview of the CCP4 suite and current developments. Acta Crystallogr D Biol Crystallogr 67(Pt 4):235–242. doi:10.1107/S0907444910045749

Adams PD, Afonine PV, Bunkoczi G, Chen VB, Davis IW, Echols N, Headd JJ, Hung LW, Kapral GJ, Grosse-Kunstleve RW, McCoy AJ, Moriarty NW, Oeffner R, Read RJ, Richardson DC, Richardson JS, Terwilliger TC, Zwart PH (2010) PHENIX: a comprehensive Python-based system for macromolecular structure solution. Acta Crystallogr D Biol Crystallogr 66(Pt 2):213–221. doi:10.1107/S0907444909052925

Chruszcz M, Domagalski M, Osinski T, Wlodawer A, Minor W (2010) Unmet challenges of structural genomics. Curr Opin Struct Biol 20(5):587–597. doi:10.1016/j.sbi.2010.08.001

Snell G, Cork C, Nordmeyer R, Cornell E, Meigs G, Yegian D, Jaklevic J, Jin J, Stevens RC, Earnest T (2004) Automated sample mounting and alignment system for biological crystallography at a synchrotron source. Structure 12(4):537–545. doi:10.1016/j.str.2004.03.011

Miller MD, Deacon AM (2007) An X-ray microsource based system for crystal screening and beamline development during synchrotron shutdown periods. Nucl Instrum Methods Phys Res A 582(1):233–235. doi:10.1016/j.nima.2007.08.136

Cherezov V, Hanson MA, Griffith MT, Hilgart MC, Sanishvili R, Nagarajan V, Stepanov S, Fischetti RF, Kuhn P, Stevens RC (2009) Rastering strategy for screening and centring of microcrystal samples of human membrane proteins with a sub-10 microm size X-ray synchrotron beam. J R Soc Interface 6(Suppl 5):S587–S597. doi:10.1098/rsif.2009.0142.focus

Heras B, Martin JL (2005) Post-crystallization treatments for improving diffraction quality of protein crystals. Acta Crystallogr D Biol Crystallogr 61(Pt 9):1173–1180. doi:10.1107/S0907444905019451

Krojer T, Pike AC, von Delft F (2013) Squeezing the most from every crystal: the fine details of data collection. Acta Crystallogr D Biol Crystallogr 69(Pt 7):1303–1313. doi:10.1107/S0907444913013280

Liu G, Shen Y, Atreya HS, Parish D, Shao Y, Sukumaran DK, Xiao R, Yee A, Lemak A, Bhattacharya A, Acton TA, Arrowsmith CH, Montelione GT, Szyperski T (2005) NMR data collection and analysis protocol for high-throughput protein structure determination. Proc Natl Acad Sci USA 102(30):10487–10492. doi:10.1073/pnas.0504338102

Yokoyama S (2003) Protein expression systems for structural genomics and proteomics. Curr Opin Chem Biol 7(1):39–43

Rossi P, Swapna GV, Huang YJ, Aramini JM, Anklin C, Conover K, Hamilton K, Xiao R, Acton TB, Ertekin A, Everett JK, Montelione GT (2010) A microscale protein NMR sample screening pipeline. J Biomol NMR 46(1):11–22. doi:10.1007/s10858-009-9386-z

Everett JK, Tejero R, Murthy SB, Acton TB, Aramini JM, Baran MC, Benach J, Cort JR, Eletsky A, Forouhar F, Guan R, Kuzin AP, Lee HW, Liu G, Mani R, Mao B, Mills JL, Montelione AF, Pederson K, Powers R, Ramelot T, Rossi P, Seetharaman J, Snyder D, Swapna GV, Vorobiev SM, Wu Y, Xiao R, Yang Y, Arrowsmith CH, Hunt JF, Kennedy MA, Prestegard JH, Szyperski T, Tong L, Montelione GT (2015) A community resource of experimental data for NMR/X-ray crystal structure pairs. Protein Sci. doi:10.1002/pro.2774

Laskowski RA, Watson JD, Thornton JM (2005) ProFunc: a server for predicting protein function from 3D structure. Nucleic acids research 33(Web Server issue):W89–W93. doi:10.1093/nar/gki414

Watson JD, Laskowski RA, Thornton JM (2005) Predicting protein function from sequence and structural data. Curr Opin Struct Biol 15(3):275–284. doi:10.1016/j.sbi.2005.04.003

Jaroszewski L, Rychlewski L, Li Z, Li W, Godzik A (2005) FFAS03: a server for profile–profile sequence alignments. Nucleic Acids Res 33(Web Server issue):W284–W288. doi:10.1093/nar/gki418

Jaroszewski L, Li Z, Cai XH, Weber C, Godzik A (2011) FFAS server: novel features and applications. Nucleic Acids Res 39(Web Server issue):W38–W44. doi:10.1093/nar/gkr441

Shumilin IA, Cymborowski M, Chertihin O, Jha KN, Herr JC, Lesley SA, Joachimiak A, Minor W (2012) Identification of unknown protein function using metabolite cocktail screening. Structure 20(10):1715–1725. doi:10.1016/j.str.2012.07.016

Kuhn ML, Majorek KA, Minor W, Anderson WF (2013) Broad-substrate screen as a tool to identify substrates for bacterial Gcn5-related N-acetyltransferases with unknown substrate specificity. Protein Sci 22(2):222–230. doi:10.1002/pro.2199

Watson JD, Sanderson S, Ezersky A, Savchenko A, Edwards A, Orengo C, Joachimiak A, Laskowski RA, Thornton JM (2007) Towards fully automated structure-based function prediction in structural genomics: a case study. J Mol Biol 367(5):1511–1522. doi:10.1016/j.jmb.2007.01.063

Akiva E, Brown S, Almonacid DE, Barber AE 2nd, Custer AF, Hicks MA, Huang CC, Lauck F, Mashiyama ST, Meng EC, Mischel D, Morris JH, Ojha S, Schnoes AM, Stryke D, Yunes JM, Ferrin TE, Holliday GL, Babbitt PC (2014) The Structure-Function Linkage Database. Nucleic Acids Res 42(Database issue):D521–D530. doi:10.1093/nar/gkt1130

Structural Genomics Consortium, China Structural Genomics Consortium, Northeast Structural Genomics Consortium, Graslund S, Nordlund P, Weigelt J, Hallberg BM, Bray J, Gileadi O, Knapp S, Oppermann U, Arrowsmith C, Hui R, Ming J, dhePaganon S, Park HW, Savchenko A, Yee A, Edwards A, Vincentelli R, Cambillau C, Kim R, Kim SH, Rao Z, Shi Y, Terwilliger TC, Kim CY, Hung LW, Waldo GS, Peleg Y, Albeck S, Unger T, Dym O, Prilusky J, Sussman JL, Stevens RC, Lesley SA, Wilson IA, Joachimiak A, Collart F, Dementieva I, Donnelly MI, Eschenfeldt WH, Kim Y, Stols L, Wu R, Zhou M, Burley SK, Emtage JS, Sauder JM, Thompson D, Bain K, Luz J, Gheyi T, Zhang F, Atwell S, Almo SC, Bonanno JB, Fiser A, Swaminathan S, Studier FW, Chance MR, Sali A, Acton TB, Xiao R, Zhao L, Ma LC, Hunt JF, Tong L, Cunningham K, Inouye M, Anderson S, Janjua H, Shastry R, Ho CK, Wang D, Wang H, Jiang M, Montelione GT, Stuart DI, Owens RJ, Daenke S, Schutz A, Heinemann U, Yokoyama S, Bussow K, Gunsalus KC (2008) Protein production and purification. Nat Methods 5(2):135–146. doi:10.1038/nmeth.f.202

Bandaranayake AD, Almo SC (2014) Recent advances in mammalian protein production. FEBS Lett 588(2):253–260. doi:10.1016/j.febslet.2013.11.035

Almo SC, Love JD (2014) Better and faster: improvements and optimization for mammalian recombinant protein production. Curr Opin Struct Biol 26:39–43. doi:10.1016/j.sbi.2014.03.006

Glasziou P, Meats E, Heneghan C, Shepperd S (2008) What is missing from descriptions of treatment in trials and reviews? BMJ 336(7659):1472–1474. doi:10.1136/bmj.39590.732037.47

Freedman LP, Cockburn IM, Simcoe TS (2015) The economics of reproducibility in preclinical research. PLoS Biol 13(6):e1002165. doi:10.1371/journal.pbio.1002165


References

Chua G, Robinson MD, Morris Q, Hughes TR: Transcriptional networks: reverse-engineering gene regulation on a global scale. Curr Opin Microbiol. 2004, 7: 638-646. 10.1016/j.mib.2004.10.009.

Giaever G, Chu AM, Ni L, Connelly C, Riles L, Veronneau S, Dow S, Lucau-Danila A, Anderson K, Andre B, et al: Functional profiling of the Saccharomyces cerevisiae genome. Nature. 2002, 418: 387-391. 10.1038/nature00935.

Bader GD, Heilbut A, Andrews B, Tyers M, Hughes T, Boone C: Functional genomics and proteomics: charting a multidimensional map of the yeast cell. Trends Cell Biol. 2003, 13: 344-356. 10.1016/S0962-8924(03)00127-2.

Jorgensen P, Breitkreutz BJ, Breitkreutz K, Stark C, Liu G, Cook M, Sharom J, Nishikawa JL, Ketela T, Bellows D, et al: Harvesting the genome's bounty: integrative genomics. Cold Spring Harb Symp Quant Biol. 2003, 68: 431-443. 10.1101/sqb.2003.68.431.

Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, et al: A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature. 2000, 403: 623-627. 10.1038/35001009.

Ito T, Tashiro K, Muta S, Ozawa R, Chiba T, Nishizawa M, Yamamoto K, Kuhara S, Sakaki Y: Toward a protein-protein interaction map of the budding yeast: A comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins. Proc Natl Acad Sci USA. 2000, 97: 1143-1147. 10.1073/pnas.97.3.1143.

Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y: A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA. 2001, 98: 4569-4574. 10.1073/pnas.061034498.

Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, Taylor P, Bennett K, Boutilier K, et al: Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature. 2002, 415: 180-183. 10.1038/415180a.

Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, et al: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature. 2002, 415: 141-147. 10.1038/415141a.

Tong AH, Evangelista M, Parsons AB, Xu H, Bader GD, Page N, Robinson M, Raghibizadeh S, Hogue CW, Bussey H, et al: Systematic genetic analysis with ordered arrays of yeast deletion mutants. Science. 2001, 294: 2364-2368. 10.1126/science.1065810.

Ooi SL, Shoemaker DD, Boeke JD: DNA helicase gene interaction network defined using synthetic lethality analyzed by microarray. Nat Genet. 2003, 35: 277-286. 10.1038/ng1258.

Tong AH, Lesage G, Bader GD, Ding H, Xu H, Xin X, Young J, Berriz GF, Brost RL, Chang M, et al: Global mapping of the yeast genetic interaction network. Science. 2004, 303: 808-813. 10.1126/science.1091317.

Pan X, Yuan DS, Xiang D, Wang X, Sookhai-Mahadeo S, Bader JS, Hieter P, Spencer F, Boeke JD: A robust toolkit for functional profiling of the yeast genome. Mol Cell. 2004, 16: 487-496. 10.1016/j.molcel.2004.09.035.

Giot L, Bader JS, Brouwer C, Chaudhuri A, Kuang B, Li Y, Hao YL, Ooi CE, Godwin B, Vitols E, et al: A protein interaction map of Drosophila melanogaster. Science. 2003, 302: 1727-1736. 10.1126/science.1090289.

Li S, Armstrong CM, Bertin N, Ge H, Milstein S, Boxem M, Vidalain PO, Han JD, Chesneau A, Hao T, et al: A map of the interactome network of the metazoan C. elegans. Science. 2004, 303: 540-543. 10.1126/science.1091403.

Stelzl U, Worm U, Lalowski M, Haenig C, Brembeck FH, Goehler H, Stroedicke M, Zenkner M, Schoenherr A, Koeppen S, et al: A human protein-protein interaction network: a resource for annotating the proteome. Cell. 2005, 122: 957-968. 10.1016/j.cell.2005.08.029.

Rual JF, Venkatesan K, Hao T, Hirozane-Kishikawa T, Dricot A, Li N, Berriz GF, Gibbons FD, Dreze M, Ayivi-Guedehoussou N, et al: Towards a proteome-scale map of the human protein-protein interaction network. Nature. 2005, 437: 1173-1178. 10.1038/nature04209.

Barabasi AL, Albert R: Emergence of scaling in random networks. Science. 1999, 286: 509-512. 10.1126/science.286.5439.509.

Albert R, Jeong H, Barabasi AL: Error and attack tolerance of complex networks. Nature. 2000, 406: 378-382. 10.1038/35019019.

Wagner A: Does selection mold molecular networks?. Sci STKE. 2003, 2003: PE41-

Watts DJ, Strogatz SH: Collective dynamics of 'small-world' networks. Nature. 1998, 393: 440-442. 10.1038/30918.

Jeong H, Mason SP, Barabasi AL, Oltvai ZN: Lethality and centrality in protein networks. Nature. 2001, 411: 41-42. 10.1038/35075138.

Wagner A: The yeast protein interaction network evolves rapidly and contains few redundant duplicate genes. Mol Biol Evol. 2001, 18: 1283-1292.

Shen-Orr SS, Milo R, Mangan S, Alon U: Network motifs in the transcriptional regulation network of Escherichia coli. Nat Genet. 2002, 31: 64-68. 10.1038/ng881.

Zhang LV, King OD, Wong SL, Goldberg DS, Tong AH, Lesage G, Andrews B, Bussey H, Boone C, Roth FP: Motifs, themes and thematic maps of an integrated Saccharomyces cerevisiae interaction network. J Biol. 2005, 4: 6-10.1186/jbiol23.

Ge H, Liu Z, Church GM, Vidal M: Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nat Genet. 2001, 29: 482-486. 10.1038/ng776.

Jansen R, Yu H, Greenbaum D, Kluger Y, Krogan NJ, Chung S, Emili A, Snyder M, Greenblatt JF, Gerstein M: A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science. 2003, 302: 449-453. 10.1126/science.1087361.

Troyanskaya OG, Dolinski K, Owen AB, Altman RB, Botstein D: A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc Natl Acad Sci USA. 2003, 100: 8348-8353. 10.1073/pnas.0832373100.

Lee I, Date SV, Adai AT, Marcotte EM: A probabilistic functional network of yeast genes. Science. 2004, 306: 1555-1558. 10.1126/science.1099511.

Wong SL, Zhang LV, Tong AH, Li Z, Goldberg DS, King OD, Lesage G, Vidal M, Andrews B, Bussey H, et al: Combining biological networks to predict genetic interactions. Proc Natl Acad Sci USA. 2004, 101: 15682-15687. 10.1073/pnas.0406614101.

Sharan R, Suthram S, Kelley RM, Kuhn T, McCuine S, Uetz P, Sittler T, Karp RM, Ideker T: Conserved patterns of protein interaction in multiple species. Proc Natl Acad Sci USA. 2005, 102: 1974-1979. 10.1073/pnas.0409522102.

Myers CL, Robson D, Wible A, Hibbs MA, Chiriac C, Theesfeld CL, Dolinski K, Troyanskaya OG: Discovery of biological networks from diverse functional genomic data. Genome Biology. 2005, 6: R114-10.1186/gb-2005-6-13-r114.

von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P: Comparative assessment of large-scale data sets of protein-protein interactions. Nature. 2002, 417: 399-403. 10.1038/nature750.

Bader GD, Hogue CW: Analyzing yeast protein-protein interaction data obtained from different sources. Nat Biotechnol. 2002, 20: 991-997. 10.1038/nbt1002-991.

Mrowka R, Patzak A, Herzel H: Is there a bias in proteome research?. Genome Res. 2001, 11: 1971-1973. 10.1101/gr.206701.

Hodges PE, Payne WE, Garrels JI: The Yeast Protein Database (YPD): a curated proteome database for Saccharomyces cerevisiae. Nucleic Acids Res. 1998, 26: 68-72. 10.1093/nar/26.1.68.

Mewes HW, Frishman D, Guldener U, Mannhaupt G, Mayer K, Mokrejs M, Morgenstern B, Munsterkotter M, Rudd S, Weil B: MIPS: a database for genomes and protein sequences. Nucleic Acids Res. 2002, 30: 31-34. 10.1093/nar/30.1.31.

Zanzoni A, Montecchi-Palazzi L, Quondam M, Ausiello G, Helmer-Citterich M, Cesareni G: MINT: a Molecular INTeraction database. FEBS Lett. 2002, 513: 135-140. 10.1016/S0014-5793(01)03293-8.

Hermjakob H, Montecchi-Palazzi L, Lewington C, Mudali S, Kerrien S, Orchard S, Vingron M, Roechert B, Roepstorff P, Valencia A, et al: IntAct: an open source molecular interaction database. Nucleic Acids Res. 2004, 32 (Database issue): D452-D455. 10.1093/nar/gkh052.

Xenarios I, Salwinski L, Duan XJ, Higney P, Kim SM, Eisenberg D: DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 2002, 30: 303-305. 10.1093/nar/30.1.303.

Bader GD, Betel D, Hogue CW: BIND: the Biomolecular Interaction Network Database. Nucleic Acids Res. 2003, 31: 248-250. 10.1093/nar/gkg056.

Peri S, Navarro JD, Amanchy R, Kristiansen TZ, Jonnalagadda CK, Surendranath V, Niranjan V, Muthusamy B, Gandhi TK, Gronborg M, et al: Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res. 2003, 13: 2363-2371. 10.1101/gr.1680803.

Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M: BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 2006, 34 (Database issue): D535-D539. 10.1093/nar/gkj109.

International Molecular Exchange Consortium. [http://imex.sourceforge.net]

Hermjakob H, Montecchi-Palazzi L, Bader G, Wojcik J, Salwinski L, Ceol A, Moore S, Orchard S, Sarkans U, von Mering C, et al: The HUPO PSI's molecular interaction format: a community standard for the representation of protein interaction data. Nat Biotechnol. 2004, 22: 177-183. 10.1038/nbt926.

Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, Lewis S, Marshall B, Mungall C, et al: The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 2004, 32 (Database issue): D258-261.

Drabkin HJ, Hollenbeck C, Hill DP, Blake JA: Ontological visualization of protein-protein interactions. BMC Bioinformatics. 2005, 6: 29-10.1186/1471-2105-6-29.

Krallinger M, Valencia A: Text-mining and information-retrieval services for molecular biology. Genome Biol. 2005, 6: 224-10.1186/gb-2005-6-7-224.

Christie KR, Weng S, Balakrishnan R, Costanzo MC, Dolinski K, Dwight SS, Engel SR, Feierbach B, Fisk DG, Hirschman JE, et al: Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms. Nucleic Acids Res. 2004, 32 (Database issue): D311-D314. 10.1093/nar/gkh033.

Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES: Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature. 2003, 423: 241-254. 10.1038/nature01644.

Breitkreutz BJ, Stark C, Tyers M: The GRID: the General Repository for Interaction Datasets. Genome Biol. 2003, 4: R23-10.1186/gb-2003-4-3-r23.

Han JD, Bertin N, Hao T, Goldberg DS, Berriz GF, Zhang LV, Dupuy D, Walhout AJ, Cusick ME, Roth FP, Vidal M: Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature. 2004, 430: 88-93. 10.1038/nature02555.

Hoffmann R, Valencia A: Life cycles of successful genes. Trends Genet. 2003, 19: 79-81. 10.1016/S0168-9525(02)00014-8.

Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, et al: The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003, 4: 41-10.1186/1471-2105-4-41.

Decottignies A, Sanchez-Perez I, Nurse P: Schizosaccharomyces pombe essential genes: a pilot study. Genome Res. 2003, 13: 399-406. 10.1101/gr.636103.

Koonin EV, Fedorova ND, Jackson JD, Jacobs AR, Krylov DM, Makarova KS, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, et al: A comprehensive evolutionary classification of proteins encoded in complete eukaryotic genomes. Genome Biol. 2004, 5: R7-10.1186/gb-2004-5-2-r7.

Grigoriev A: On the number of protein-protein interactions in the yeast proteome. Nucleic Acids Res. 2003, 31: 4157-4161. 10.1093/nar/gkg466.

Davierwala AP, Haynes J, Li Z, Brost RL, Robinson MD, Yu L, Mnaimneh S, Ding H, Zhu H, Chen Y, et al: The synthetic genetic interaction spectrum of essential genes. Nat Genet. 2005, 37: 1147-1152. 10.1038/ng1640.

Tanaka R, Yi TM, Doyle J: Some protein interaction data do not exhibit power law statistics. FEBS Lett. 2005, 579: 5140-5144. 10.1016/j.febslet.2005.08.024.

Pereira-Leal JB, Audit B, Peregrin-Alvarez JM, Ouzounis CA: An exponential core in the heart of the yeast protein interaction network. Mol Biol Evol. 2005, 22: 421-425. 10.1093/molbev/msi024.

Maslov S, Sneppen K: Specificity and stability in topology of protein networks. Science. 2002, 296: 910-913. 10.1126/science.1065103.

Kelley R, Ideker T: Systematic interpretation of genetic interactions using protein networks. Nat Biotechnol. 2005, 23: 561-566. 10.1038/nbt1096.

Breitkreutz BJ, Stark C, Tyers M: Osprey: a network visualization system. Genome Biol. 2003, 4: R22-10.1186/gb-2003-4-3-r22.

Ozier O, Amin N, Ideker T: Global architecture of genetic interactions on the protein network. Nat Biotechnol. 2003, 21: 490-491. 10.1038/nbt0503-490.

Ghaemmaghami S, Huh WK, Bower K, Howson RW, Belle A, Dephoure N, O'Shea EK, Weissman JS: Global analysis of protein expression in yeast. Nature. 2003, 425: 737-741. 10.1038/nature02046.

Huh WK, Falvo JV, Gerke LC, Carroll AS, Howson RW, Weissman JS, O'Shea EK: Global analysis of protein localization in budding yeast. Nature. 2003, 425: 686-691. 10.1038/nature02026.

Batada NN, Shepp LA, Siegmund DO: Stochastic model of protein-protein interaction: why signaling proteins need to be colocalized. Proc Natl Acad Sci USA. 2004, 101: 6445-6449. 10.1073/pnas.0401314101.

Ideker T, Thorsson V, Ranish JA, Christmas R, Buhler J, Eng JK, Bumgarner R, Goodlett DR, Aebersold R, Hood L: Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science. 2001, 292: 929-934. 10.1126/science.292.5518.929.

Hughes TR, Marton MJ, Jones AR, Roberts CJ, Stoughton R, Armour CD, Bennett HA, Coffey E, Dai H, He YD, et al: Functional discovery via a compendium of expression profiles. Cell. 2000, 102: 109-126. 10.1016/S0092-8674(00)00015-5.

Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, Danford TW, Hannett NM, Tagne JB, Reynolds DB, Yoo J, et al: Transcriptional regulatory code of a eukaryotic genome. Nature. 2004, 431: 99-104. 10.1038/nature02800.

Bader GD, Hogue CW: An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics. 2003, 4: 2-10.1186/1471-2105-4-2.

Rives AW, Galitski T: Modular organization of cellular networks. Proc Natl Acad Sci USA. 2003, 100: 1128-1133. 10.1073/pnas.0237338100.

Spirin V, Mirny LA: Protein complexes and functional modules in molecular networks. Proc Natl Acad Sci USA. 2003, 100: 12123-12128. 10.1073/pnas.2032324100.

O'Brien KP, Remm M, Sonnhammer EL: Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res. 2005, 33 (Database issue): D476-D480. 10.1093/nar/gki107.

FlyBase Consortium: The FlyBase database of the Drosophila genome projects and community literature. Nucleic Acids Res. 2003, 31: 172-175. 10.1093/nar/gkg094.

Page JS, Masselon CD, Smith RD: FTICR mass spectrometry for qualitative and quantitative bioanalyses. Curr Opin Biotechnol. 2004, 15: 3-11. 10.1016/j.copbio.2004.01.002.

Vidalain PO, Boxem M, Ge H, Li S, Vidal M: Increasing specificity in high-throughput yeast two-hybrid experiments. Methods. 2004, 32: 363-370. 10.1016/j.ymeth.2003.10.001.

Przulj N, Corneil DG, Jurisica I: Modeling interactome: scale-free or geometric?. Bioinformatics. 2004, 20: 3508-3515. 10.1093/bioinformatics/btg415.

Jordan IK, Rogozin IB, Wolf YI, Koonin EV: Essential genes are more evolutionarily conserved than are nonessential genes in bacteria. Genome Res. 2002, 12: 962-968. 10.1101/gr.87702. Article published online before print in May 2002.

Fraser HB, Hirsh AE, Steinmetz LM, Scharfe C, Feldman MW: Evolutionary rate in the protein interaction network. Science. 2002, 296: 750-752. 10.1126/science.1068696.

Jordan IK, Wolf YI, Koonin EV: No simple dependence between protein evolution rate and the number of protein-protein interactions: only the most prolific interactors tend to evolve slowly. BMC Evol Biol. 2003, 3: 1-10.1186/1471-2148-3-1.

Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabasi AL: Hierarchical organization of modularity in metabolic networks. Science. 2002, 297: 1551-1555. 10.1126/science.1073374.

Schuldiner M, Collins SR, Thompson NJ, Denic V, Bhamidipati A, Punna T, Ihmels J, Andrews B, Boone C, Greenblatt JF, et al: Exploration of the function and organization of the yeast early secretory pathway through an epistatic miniarray profile. Cell. 2005, 123: 507-519. 10.1016/j.cell.2005.08.031.

Ptacek J, Devgan G, Michaud G, Zhu H, Zhu X, Fasolo J, Guo H, Jona G, Breitkreutz A, Sopko R, et al: Global analysis of protein phosphorylation in yeast. Nature. 2005, 438: 679-684. 10.1038/nature04187.

Li F, Long T, Lu Y, Ouyang Q, Tang C: The yeast cell-cycle network is robustly designed. Proc Natl Acad Sci USA. 2004, 101: 4781-4786. 10.1073/pnas.0305937101.

Ma'ayan A, Jenkins SL, Neves S, Hasseldine A, Grace E, Dubin-Thaler B, Eungdamrong NJ, Weng G, Ram PT, Rice JJ, et al: Formation of regulatory patterns during signal propagation in a mammalian cellular network. Science. 2005, 309: 1078-1083. 10.1126/science.1108876.

Joshi-Tope G, Gillespie M, Vastrik I, D'Eustachio P, Schmidt E, de Bono B, Jassal B, Gopinath GR, Wu GR, Matthews L, et al: Reactome: a knowledgebase of biological pathways. Nucleic Acids Res. 2005, 33 (Database issue): D428-D432. 10.1093/nar/gki072.

Kanehisa M, Goto S: KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000, 28: 27-30. 10.1093/nar/28.1.27.

Ramani AK, Bunescu RC, Mooney RJ, Marcotte EM: Consolidating the set of known human protein-protein interactions in preparation for large-scale mapping of the human interactome. Genome Biol. 2005, 6: R40-10.1186/gb-2005-6-5-r40.

Ideker T, Galitski T, Hood L: A new approach to decoding life: systems biology. Annu Rev Genomics Hum Genet. 2001, 2: 343-372. 10.1146/annurev.genom.2.1.343.

Muller HM, Kenny EE, Sternberg PW: Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol. 2004, 2: e309-10.1371/journal.pbio.0020309.

Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 1998, 95: 14863-14868. 10.1073/pnas.95.25.14863.

Bader JS, Chaudhuri A, Rothberg JM, Chant J: Gaining confidence in high-throughput protein interaction networks. Nat Biotechnol. 2004, 22: 78-85. 10.1038/nbt924.

Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003, 13: 2498-2504. 10.1101/gr.1239303.

Donaldson I, Martin J, de Bruijn B, Wolting C, Lay V, Tuekam B, Zhang S, Baskin B, Bader GD, Michalickova K, et al: PreBIND and Textomy – mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics. 2003, 4: 11-10.1186/1471-2105-4-11.

Zhu G, Spellman PT, Volpe T, Brown PO, Botstein D, Davis TN, Futcher B: Two yeast forkhead genes regulate the cell cycle and pseudohyphal growth. Nature. 2000, 406: 90-94. 10.1038/35021046.

Yoshimoto H, Saltsman K, Gasch AP, Li HX, Ogawa N, Botstein D, Brown PO, Cyert MS: Genome-wide analysis of gene expression regulated by the calcineurin/Crz1p signaling pathway in Saccharomyces cerevisiae. J Biol Chem. 2002, 277: 31079-31088. 10.1074/jbc.M202718200.

Gasch AP, Huang M, Metzner S, Botstein D, Elledge SJ, Brown PO: Genomic expression responses to DNA-damaging agents and the regulatory role of the yeast ATR homolog Mec 1p. Mol Biol Cell. 2001, 12: 2987-3003.

Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, Eisen MB, Storz G, Botstein D, Brown PO: Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell. 2000, 11: 4241-4257.

Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell. 1998, 9: 3273-3297.

Chu S, DeRisi J, Eisen M, Mulholland J, Botstein D, Brown PO, Herskowitz I: The transcriptional program of sporulation in budding yeast. Science. 1998, 282: 699-705. 10.1126/science.282.5389.699.

DeRisi JL, Iyer VR, Brown PO: Exploring the metabolic and genetic control of gene expression on a genomic scale. Science. 1997, 278: 680-686. 10.1126/science.278.5338.680.

Sudarsanam P, Iyer VR, Brown PO, Winston F: Whole-genome expression analysis of snf/swi mutants of Saccharomyces cerevisiae. Proc Natl Acad Sci USA. 2000, 97: 3364-3369. 10.1073/pnas.050407197.

Shakoury-Elizeh M, Tiedeman J, Rashford J, Ferea T, Demeter J, Garcia E, Rolfes R, Brown PO, Botstein D, Philpott CC: Transcriptional remodeling in response to iron deprivation in Saccharomyces cerevisiae. Mol Biol Cell. 2004, 15: 1233-1243. 10.1091/mbc.E03-09-0642.

Ogawa N, DeRisi J, Brown PO: New components of a system for phosphate accumulation and polyphosphate metabolism in Saccharomyces cerevisiae revealed by genomic expression analysis. Mol Biol Cell. 2000, 11: 4309-4321.

Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ, Bastuck S, Dumpelfeld B, et al: Proteome survey reveals modularity of the yeast cell machinery. Nature. 2006, 440: 631-636.

Krogan NJ, Cagney G, Yu H, Zhong G, Guo X, Ignatchenko A, Li J, Pu S, Datta N, Tikuisis AP, et al: Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature. 2006, 440: 637-643.


Materials and Methods

We use publication data provided by the Web of Science database (www.webofknowledge.com), purchased for research purposes by some of the authors of this publication in 2013. The database includes several types of scientific outputs such as articles, letters, reviews, editorials, and abstracts from 1898 to 2012 across more than 22,000 scientific journals from broad domains, resulting in a set of more than 50 million papers. For each paper, the dataset includes more information on the date of publication (month, day, year), the journal name and journal issue, author names with the order they appear in the article, their affiliations, and the references to past articles indexed in the database. For Nature we downloaded the full publication history using the Nature opensearch Application Programming Interface.

For our analysis, we focused on publications from 1960 to 2012 published in interdisciplinary journals (Nature, Science, and PNAS), as well as in journals associated to five distinct scientific fields: medicine, biology, mathematics, chemistry, and physics. To identify the journals belonging to each category, we first parsed dedicated Wikipedia pages containing lists of journal names associated to specific scientific fields and then matched these with the journals in the database (31). In total we identified 97 biology, 337 medicine, 243 physics, 248 mathematics, 138 chemistry, and 3 interdisciplinary journals.

Next, we extracted the publications associated to each of these categorized journals. To ensure dealing with original research, we collected only publications labeled as articles, letters, and reviews and that did not have a title containing the terms comment, reply, errata, or retracted article. Moreover, to have enough statistics, only the categorized journals fulfilling the following criteria were taken into account for our analysis: The collected publications associated to the journal span a period of at least 10 y, at least 1,000 collected publications were published in the journal overall, and at least 100 collected publications were published each year in the journal.

After this preprocessing, our data amount to (i) 795,558 publications from 40 journals in biology, (ii) 1,350,936 publications from 128 journals in medicine, (iii) 1,753,641 publications from 117 journals in physics, (iv) 208,223 publications from 26 journals in mathematics, (v) 1,341,150 publications from 72 journals in chemistry, and (vi) 251,294 publications from Nature, Science, and PNAS. Data about the proportion of new, established, and chaperoned PIs over time and the values of c, C, and C a l p h a b e t are provided for each journal on GitHub (https://github.com/SocialComplexityLab/chaperone-open). Raw data from Web of Science cannot be shared publicly on the web, but we offer the possibility to reproduce our results starting from raw records by making a research visit to Northeastern University or Central European University where the data are accessible. Data about the journal Nature can be downloaded for free from Nature opensearch (https://www.nature.com/opensearch/).

Author Name Disambiguation.

We formatted all author names present in the collected publications to lowercase and converted their names into their first letter only. An author named “John Smith” or “Mary Suzy Johnson” would thus be converted to the format “smith,j” or “johnson,ms,” respectively. We considered the sequence of publications within the same journal and authored by an identical formatted name to correspond to the same individual. We expect errors induced by homonyms, i.e., distinct individuals that share the same formatted name, to be low as we compare only names within the same journal. An error can thus occur only if two distinct individuals share the same formatted name and evolve in the same scientific field, i.e., the same journal, which is already an accurate disambiguating feature (32).

Robustness of Results to Alphabetic Ordering.

In certain scientific fields it is common to order authors alphabetically (17). As such, to understand how this affects the results, we perform two versions of our analysis: one, taking all publications into account, and another one, a version where we have disregarded publications where authors are alphabetically ordered. This removes 17.7 % of all publications within biology, 14.4 % within medicine, 30.9 % within physics, 75.1 % within mathematics, 23.3 % within chemistry, and 20.8 % within interdisciplinary journals. Note that these numbers include publications where the authors are ordered by choice, but also publications where this occurred by chance. Nonetheless, our conclusions are robust for both datasets, consistent with the result shown in Fig. 2 that there is a significant difference between observed C and that of the alphabetical null model C a l p h a b e t (SI Appendix, Fig. S5).