Information

How to obtain a list of proteins sorted by the ~1400 unique protein folds?


The databases CATH and SCOP both have around 1400 unique protein folds recorded from analysis of the PDB. However, I do not see any method to access this particular data.

  1. A list of each of the 1400 folds (just an id number, and/or a descriptor)?

  2. For each individual fold (of the 1400), a list of PDB IDs for proteins which are known to adopt each individual fold?


If there is a simple way provided to do this it is very well hidden. The tedious and stupid way to do 1 (get a list of folds) would seem to involve rolling your own:

  1. Go to http://scop.berkeley.edu/ver=2.07 (or whatever is the latest version).

  2. Click on each of the 12 classes in turn. e.g. (a) all alpha proteins will take you to http://scop.berkeley.edu/sunid=46456 .

  3. Save the source of each page as text.

  4. Write and run your own parser to pull out the sunid () from the http://scop.berkeley.edu/sunid= and the description line if you wish. (This assumes you program.) I think this sunid is the fold id.

If you can than find some database or table that has PDB and sunid values in it, you can write another program to find the answer to 2.

Alternatively… (appended January 2021)

  1. Download dir.cla.scope.2.07-stable.txt (or the latest version)
  2. Save as a text file.
  3. Open in Mircorsoft Excel. (Just dragging onto the app icon formatted it properly on my Mac. Your mileage may vary.)
  4. You can just select the column with the ids, paste into another sheet, and then remove duplicates to get all the different fold ids. (Alternatively, you have about 276,000 entries to do with whatever you wish.)

It looks like you can download the full database in SQL format or parse-able text files from here: SCOP Download - Berkeley

The link has a link to the Schema as well:


Automated protein structure calculation from NMR data

Current software is almost at the stage to permit completely automatic structure determination of small proteins of <15 kDa, from NMR spectra to structure validation with minimal user interaction. This goal is welcome, as it makes structure calculation more objective and therefore more easily validated, without any loss in the quality of the structures generated. Moreover, it releases expert spectroscopists to carry out research that cannot be automated. It should not take much further effort to extend automation to ca 20 kDa. However, there are technological barriers to further automation, of which the biggest are identified as: routines for peak picking adoption and sharing of a common framework for structure calculation, including the assembly of an automated and trusted package for structure validation and sample preparation, particularly for larger proteins. These barriers should be the main target for development of methodology for protein structure determination, particularly by structural genomics consortia.

This is a preview of subscription content, access via your institution.


Abstract

Analyses of genomes show that more than 70% of eukaryotic proteins are composed of multiple domains. However, most studies of protein folding focus on individual domains and do not consider how interactions between domains might affect folding. Here, we address this by analysing the three-dimensional structures of multidomain proteins that have been characterized experimentally and observe that where the interface is small and loosely packed, or unstructured, the folding of the domains is independent. Furthermore, recent studies indicate that multidomain proteins have evolved mechanisms to minimize the problems of interdomain misfolding.


Results

Large-scale profiling of RNA binding protein binding sites with eCLIP

The eCLIP methodology enabled highly efficient identification of RBP binding sites [18], leading to the generation of the first large-scale database of RNA binding protein targets profiled in the same cell types using a standardized workflow [20]. This dataset contains 223 eCLIP profiles of RNA binding sites for 150 RNA binding proteins (120 in K562 and 103 in HepG2 cells), covering a wide range of RBP functions, subcellular localizations, and predicted RNA binding domains (Fig. 1a Additional files 1 and 2) [20]. Each experiment contains biological duplicate immunoprecipitation libraries along with a paired size-matched input from one of the two experimental biosamples (Fig. 1b). For each experiment, raw sequencing data, processed data (including read mapping and identified binding sites), and experimental meta-data (including antibody and immunoprecipitation validation documentation, biosample information, and additional related ENCODE datasets) were deposited at the ENCODE data coordination center (https://www.encodeproject.org) [20].

Two hundred twenty-three eCLIP datasets profile targets for 150 RNA binding proteins. a Colors indicate RBPs profiled by eCLIP, with manually annotated RBP functions, subcellular localization patterns from immunofluorescence imaging, and predicted RNA binding domains indicated (Additional file 1). b Schematic overview of eCLIP as performed in the datasets described here. Two biological replicates (defined as biosamples from separate cell thaws and crosslinked more than a week apart) were performed for each RBP, along with one size-matched input taken from one of the two biosamples prior to immunoprecipitation

Many CLIP methods included radioactive labeling of the 5′ end of RNA fragments with 32 P to visualize protein-RNA complexes after SDS-PAGE electrophoresis and membrane transfer in order to query whether RNA bound to co-purified RBPs of different size is present [4]. However, the eCLIP protocol we utilized above did not include this direct visualization of protein-associated RNA due to the complexity of incorporating radioactive labeling at this scale, preferring validation of eCLIP signal with orthogonal approaches (such as comparison with in vitro-derived motifs or overlap with knockdown/RNA-seq changes). To address this question for future large-scale eCLIP profiling, we pursued alternative labeling approaches. We found that ligation of biotinylated cytidine (instead of the normal RNA adapter) enabled visualization similar to that observed with 32 P while using commercially available chemiluminescent detection reagents for biotin-labeled nucleic acids (Additional file 3: Fig. S1a-c) [21]. We note that unlike 32 P labeling (which is done as a 5′ phosphorylation reaction with T4 Polynucleotide Kinase), this labeling uses the standard eCLIP RNA adapter ligation reaction and thus may more accurately reflect true protein-coupled RNA positioning.

Surprisingly, when expanding this approach across RBPs, we observed detectable transfer of RNA from non-crosslinked cells to nitrocellulose membranes in a supplier-dependent manner (Additional file 3: Fig. S1d-f). We had previously noted that certain sourced nitrocellulose membranes contained greater amounts of RNA, which would then be recovered during library preparation (particularly in input libraries, which lack adapter addition prior to membrane transfer) [22]. However, we now observed that the recommended (lower contaminant, membrane I) membrane from that effort showed increased transfer of RNA than our previous supplier (membrane G) (Additional file 3: Fig. S1d-f). Although the signal observed in crosslinked samples was typically significantly higher (median 12.5-fold across 17 RBPs tested), with 88% (15 out of 17) RBPs greater than 5-fold (Additional file 3: Fig. S1d), for 2 out of 17, we observed within 5-fold RNA transfer in non-crosslinked samples (Additional file 3: Fig. S1d,f).

To directly query whether this led to artifactual eCLIP peak identification, we chose seven eCLIP experiments performed with membrane I and performed replicate experiments with membrane G. Using MATR3 as an example, we observed that peak fold-enrichment compared across membranes was similar to that observed for within-membrane replicates (Additional file 3: Fig. S1g). Extending this to all seven RBPs, only one (FXR2) out of seven showed notably lower replication of peak significance using membrane G (Additional file 3: Fig. S1h), and even in that case, we observed high overall correlation in peak fold-enrichment (Additional file 3: Fig. S1i). Conservation of signal was not limited to peak calls, as we observed similar enrichments for retrotransposable and other RNA elements as well (Additional file 3: Fig. S1j). Thus, although our data indicates that whether RNA that is not crosslinked to protein will transfer to nitrocellulose membranes is supplier- and product-dependent, but that it does not generally appear to add significant background to the eCLIP profiles studied here.

Recovering RNA binding protein association to retrotransposons and other multicopy RNAs

Standard peak analysis revealed a wide variety of binding modes to mRNAs, with RBPs enriched for coding sequences, 3′ and 5′ untranslated regions, proximal and distal intronic regions, and non-coding RNAs (Additional file 3: Fig. S2a) [20]. Notably, we observed that RNA binding protein mRNAs were 1.4-fold enriched (p = 2.1 × 10 −22 by one-sample t test) among all peak-containing genes (median 13.5% per dataset, relative to 9.4% of all genes with at least one peak). In particular, well-studied splicing regulators (e.g., SRSF7 and TRA2A) were more than 3-fold enriched for binding to RBPs (Additional file 3: Fig. S2b-c). In contrast, transcription factors were unchanged (1.0-fold depleted), suggesting that RNA processing regulators are particularly likely to themselves be the target of RNA processing regulation. In total, RBPs profiled in this study bound a median of 107 RBPs and 34 transcription factors, confirming the presence of a highly complex regulatory network of RNA and DNA processing (Additional file 3: Fig. S2c).

In addition to single-copy RNA transcripts, the human genome contains many high-copy regions that are expressed as functional RNAs but present a substantial challenge to standard short read mapping strategies. These include RNAs such as the large and small ribosomal RNA (rRNA), 7SK snRNA, and others that have one or few expressed primary transcripts but dozens to hundreds of pseudogenes throughout the genome, as well as retrotransposable elements including LINE and Alu elements with thousands of moderately divergent sense and antisense copies throughout transcribed genes [23]. We found that simply including non-uniquely mapped reads in standard analysis created thousands of peaks in introns, in intergenic regions, and at pseudogenes that typically lacked standard peak shapes (likely reflecting sequencing errors relative to the main expressed transcript), indicating the need for improved methods to properly quantify RBP binding to such loci.

In order to include these RNA types in eCLIP analysis, we developed a “family-aware mapping” approach in which adapter-trimmed reads are first mapped against a database of sequences for primary transcripts and pseudogenes for 82 families (Fig. 2a) (Additional file 4). Reads mapping to reference transcripts contained within a family (e.g., LINE, YRNA, or 18S rRNA) are used for quantitation, but reads that map to multiple families are masked (discarding an average of 1.1% of reads). These results are then integrated with standard unique genomic mapping in order to incorporate reads that uniquely map to regions annotated as repetitive elements by RepeatMasker [24] into the final family quantitation (Fig. 2a). Confirming the success of this approach, we observed that in eCLIP replicates of YRNA-associating factor TROVE2/RO60 in K562, only 3.7 and 6.8% (replicate 1 and 2, respectively) of usable reads uniquely mapped to YRNA transcripts with standard processing (2.9 and 5.1% to RNY1/2/4/5, with another 0.7% and 1.8% to YRNA pseudogenes) (Fig. 2b). In contrast, for these same datasets, 14.2% and 21.7% of reads mapped uniquely to the YRNA family using the family-aware mapping approach, making use of hundreds of thousands of additional reads that did not uniquely map to individual transcripts (Fig. 2b). Performing this analysis for all RBPs, we observed a wide range of read recovery and enrichment for particular elements (Fig. 2c, Additional file 5). For some RBPs such as RPS11 (K562), an average of 95.2% of reads were only recovered using family mapping (68.1% mapping to RNA18S with an additional 24.1% to RNA28S). In contrast, only 10.4% of reads in KHSRP (K562) eCLIP mapped to multicopy family elements, with 58.9% uniquely mapping to the genome (including 41.1% uniquely mapping to introns outside of RepeatMasker elements) (Fig. 2c).

Quantification of repetitive elements and other non-uniquely mapped reads. a Graphical representation of repetitive element mapping. Reads are mapped to human genome (requiring unique mapping) and a database of repetitive element families. Reads are then associated with RNA element families based on mismatch score, with (red) reads discarded if mapping equally well to more than one family. b Stacked bars indicate the number of reads from TROVE2 eCLIP in K562 that map either uniquely to one of four primary Y RNA transcripts, map uniquely to Y RNA pseudogenes (identified by RepeatMasker), or (for family-aware mapping) map to multiple Y RNA transcripts but not uniquely to the genome or to other repetitive element families. c Stacked bars indicate the fraction of reads (averaged between replicates) of all 223 eCLIP experiments, separated by whether they map (red) uniquely to the genome, (purple) uniquely to the genome but within a repetitive element identified by RepeatMasker, or (gray) to repetitive element families. Datasets are sorted by the fraction of unique genomic reads. d Heatmap indicates the relative information for 26 elements and 168 eCLIP datasets, requiring elements and datasets to have at least one entry meeting a 0.2 relative information cutoff (based on Additional file 3: Fig. S2d). See Table 1 for RBP:element enrichments meeting this criteria and Additional file 5 for all enrichments

At the element level, our family-aware mapping strategy recovers many known processing or interacting factors, including RBPs enriched for the mature 18S (RPS3, RPS11) and 28S rRNA (DDX21, NOL12) as well as the 45S rRNA precursor (UTP18, WDR43), tRNAs (NSUN2), RN7SK (LARP7), YRNA (TROVE2), and others (Fig. 2d). To validate this approach, we considered 17 RNA elements with well-studied direct links to either RBP function (such as snoRNA binding with rRNA processing and snRNA binding with snRNA processing and the spliceosome) or specific RBP regulators (e.g., snRNA RN7SK with LARP7 [25] and YRNAs with TROVE2/Ro60 [26]) (Additional file 3: Fig. S2d). We observed that 140 eCLIP datasets had one of these 17 elements as the most highly enriched (by relative information, which we observed to better enable comparison across elements versus fold-enrichment), and in 84 (60%) of these cases, the RBP was previously characterized as having the element-paired RBP function, indicating that this approach is highly successful at recovering targets that reflect annotated functions of profiled RBPs. To set a cutoff for analysis, we found that an information cutoff of 0.2 maximized predictive accuracy, at which 70% (74 out of 105 RBPs with the most enriched RNA element meeting this cutoff) had annotated functions matching the known role for this element (Additional file 3: Fig. S2e). Using this cutoff, 235 RBP-element pairings were identified with large numbers of RBPs associated with mRNA regions (42 with CDS, 24 with 3′UTR, 40 with distal intronic, and 23 with proximal intronic regions) and rRNA (24 with RNA28S and 15 with RNA18s, as well as 12 with precursor 45S rRNA), and smaller numbers associated with other specific RNA classes (Fig. 2d, Table 1).

Characterization of ribosomal RNA interactors and processing factors

Ribosomal RNA (rRNA) is the most abundant RNA found in eukaryotic cells and plays essential roles in defining the structure and activity of the ribosome. In humans, the 5S rRNA is separately transcribed, whereas the 18S, 28S, and 5.8S rRNAs are transcribed as one 45S precursor transcript that then undergoes a complex series of cleavage and RNA modification steps to process the mature rRNAs, which then form complex structures that scaffold the assembly of

80 proteins to create the functional ribosome [27]. Unbiased approaches have characterized over 250 additional factors as playing critical roles in processing pre-rRNA, indicating that rRNA processing and function represent a major function of RBPs in humans [28].

Considering the 150 RBPs profiled, we observed that different subsets of RBPs showed enrichment to specific rRNAs (Fig. 3a), suggesting that the incorporation of normalization against paired input was successful in removing general background at abundant transcripts. Although we are unable to distinguish between mapping to mature 18S, 28S, and 5.8S transcripts versus those regions in the precursor, the

10-fold lower read density we observe for 45S (median 281 reads per million (RPM)) versus 18S (2715 RPM) or 28S (1983 RPM) in eCLIP input samples (Additional file 3: Fig. S3a-c) suggests that the majority of 18S and 28S reads reflect mature rRNA transcripts. Considering 30 RBPs previously shown to effect pre-rRNA processing [28], we found that 16 had enrichment for one of the three (18S, 28S, or 45S) rRNAs (42.1% of RBPs meeting a 0.101 position-wise information cutoff) relative to 12.5% of others (3.4-fold enriched, p = 0.00025 by Fisher’s exact test) (Additional file 3: Fig. S3d). Despite high and relatively even read density overall on the abundant rRNA transcripts (Additional file 3: Fig. S3a-c), we observed that these rRNA-enriched RBPs showed a number of specific enrichment patterns: two on the 45S precursor (one situated around the 01 and A0 early processing sites, and a second located

2000 nt further downstream that is discussed below), a cluster at position

4200 of the 28S, and a cluster at

1150 of the 18S, along with other profiles unique to individual RBPs (Fig. 3a). Distinct ribosomal components RPS3 and RPS11 had different positional enrichments, as expected given their different positioning within the 18S ribosome (Additional file 3: Fig. S3e).

eCLIP enrichment for rRNA links RBPs with ribosomal RNA processing. a Heatmap indicates relative information at each position along (top) the ribosomal RNA precursor 45S polycistronic transcript and (bottom) within the mature 18S and 28S transcripts. Reads mapping equally to the 45S and mature 18S or 28S are assigned to the mature for quantitation. Purple asterisk indicates RBPs for which knockdown showed rRNA processing defects in Tafforeau et al. [28]. b Lines indicate fold-enrichment in DDX51 eCLIP in K562 cells at the 3′ end of the 28S and 45S transcript. For this and further plots, black line indicates mean and gray region indicates 10th to 90th percentile across all 223 eCLIP datasets. c, d Lines indicate relative information for c UTP18 in K562 and d WDR3 in K562 across the 45S precursor. e Lines indicate fold-enrichment for indicated RBPs within a region flanking putative ribosomal-encoded microRNA rmiR-663. f Red indicates mismatch positions relative to ribosomal rmiR-663 (and 100 nt flanking regions) for genomic-encoded miR-663a, miR-663b, and two additional homologous regions containing putative microRNAs. g Pie chart indicates the fraction of reads in ILF3 HepG2 eCLIP mapping (green) with fewer mismatches to rmiR-663, or (gray) mapping equally well to rmiR-663 and other miR-663 family members as indicated. See Additional file 3: Fig. S3j-k for LIN28B (HepG2) and SSB (HepG2). h, i Points indicate fold-enrichment in each eCLIP dataset for h C/D-box snoRNAs versus 45S precursor RNA, and i H/ACA-box snoRNAs versus C/D-box snoRNAs. Pearson’s correlation and significance were calculated in MATLAB

Our data on rRNA precursor position-specific enrichment confirms and provides further resolution to proteins previously characterized to play roles in ribosomal RNA processing. Some factors had specific positioning, including DDX51 which had specific enrichment at the 3′ end of 28S as well as the 3′-ETS precursor region, consistent with previous characterization of the role of DDX51 in 3′ end maturation of 28S [29], and UTP18 which had specific enrichment at the 5′ end, matching its roles in early cleavages at the 01, A0, and 1 sites suggested from large-scale screening data [28] (Fig. 3b, c, Additional file 3: Fig. S3f-g). Others, such as WDR3, had broader enrichment patterns that suggest participation in multiple maturation steps (Fig. 3d, Additional file 3: Fig. S3h).

Surprisingly, we observe a cluster of RBP association in the 45S precursor around position 2100, a region located between the A0 and 1 processing sites which lacks a well-defined processing role (Fig. 3a) [27]. Two of these factors have previous links to nucleolar activity, as ILF3 (also known as NF90) was previously shown to associate with pre-60S ribosomal particles in the nucleolus and knockdown of ILF3 gives defects in rRNA biogenesis [28, 30], and LIN28B has been shown to repress let-7 processing by sequestering pri-let-7 in the nucleolus [31]. In this region, multiple sites of ILF3 and SSB enrichment flank a more specific region enriched in LIN28B eCLIP (Fig. 3e, Additional file 3: Fig. S3i) which has previously been described to contain a potential rRNA-encoded microRNA, rmiR-663a [32]. As rmiR-663a shares similar sequence to genomic-encoded miR-663a on chromosome 20 (and would have the same mature miRNA sequence), it has been challenging to isolate expression of the ribosomal-encoded transcript in isolation [33], and indeed, the majority of LIN28B eCLIP reads mapping to pri-miRNA map equally to both variants (Sup Fig. 3j). However, when we used sequence variants in the pri-miR sequence as well as the more variable flanking sequence to estimate their separate expression (Fig. 3f), we observed that reads unique to the rmiR outnumbered those unique to genomic homologs by more than 400-fold (Fig. 3g and Additional file 3: Fig. S3j-k), indicating that the observed signal is likely derived from 45S rather than other genomic homologs.

Finally, we considered binding to snoRNAs, a class of highly structured small RNAs that play essential roles in guiding modification of ribosomal RNAs. We found that enrichment for C/D-box snoRNAs, which canonically guide methylation of RNA, was highly correlated to enrichment for the 45S precursor (R 2 = 0.67, p = 1.6 × 10 −54 ) (Fig. 3h), providing further confirmation that these 45S-enriched RBPs are likely playing key roles in rRNA processing. Surprisingly, however, we observed that enrichment for H/ACA-box snoRNAs showed far lower correlation with enrichment for either C/D-box snoRNAs (R 2 = 0.42) or the 45S precursor (R 2 = 0.17) (Fig. 3i, Additional file 3: Fig. S3l). Thus, this data confirms the ability of eCLIP with input normalization to specifically isolate enrichment between abundant snoRNA classes, and suggests that (at least for the RBPs profiled to date here) we see stronger overlap between rRNA precursor and C/D-box versus H/ACA-box snoRNAs.

Repetitive elements define a significant fraction of the RBP target landscape

Repetitive elements constitute a large fraction of the non-coding genome [34], and elements annotated by RepBase constitute an average of 12.2% of reads observed in eCLIP input experiments (Additional file 3: Fig. S4a). In particular, as retrotransposable L1/LINE and Alu elements constitute 10.8% and 0.4% of intronic sequences, respectively (Additional file 3: Fig. S4b), they represent a significant fraction of the pool of nuclear transcribed pre-mRNAs available for RBP interactions. Although some RBPs have been shown to play roles in regulation of active retrotransposition [35], the majority of intronic elements have accumulated mutations or deletions and are no longer capable of active retrotransposition, leaving the question of their function relatively poorly understood. However, recent analyses of RBP targets identified by CLIP (including early releases of the eCLIP data considered here) have shown that both antisense Alu and antisense LINE elements contain cryptic splice sites that can lead to improper splicing and polyadenylation, suggesting that a major yet unappreciated role for many RBPs may be to suppress the emergence of inappropriate cryptic RNA processing sites introduced upon retrotransposition [36, 37].

Querying for RBPs with enriched eCLIP signal at retrotransposable and other repetitive elements, we surprisingly observed that only a small subset of elements (notably including L1 and Alu elements both in sense and antisense orientation) showed high RBP specificity, whereas most elements showed extremely highly correlated enrichments across RBPs (Fig. 4a, Additional file 3: Fig. S4c). This group of elements showed enrichment in a small subset of eCLIP experiments, notably including multiple members of the highly abundant HNRNP family (HNRNPA1, HNRNPU, HNRNPC, and HNNRPL), indicating that they may be coordinately regulated to prevent inappropriate RNA processing.

RBP association at retrotransposable and other repetitive elements. a (left) Heatmap indicates fold-enrichment in eCLIP versus paired input, averaged across two biological replicates. Shown are 30 RepBase elements which had average RPM > 100 in input experiments and at least one RBP with greater than 5-fold enrichment and 65 eCLIP experiments with greater than 5-fold enrichment for at least one element. (right) Color indicates correlation in fold-enrichment between elements across the 65 experiments. b, c Points indicate fold-enrichment for b Alu elements and c L1 LINE elements in individual biological replicates. Shown are all RBPs with average enrichment of at least 2 (for Alu elements) or 5 (for L1 elements). d Bars indicate L1 retrotransposition casTLE effect score (positive score indicates increased retrotransposition upon RBP knockout), with error bars indicating 95% minimum and maximum credible interval estimates (data from Liu et al. [38]). e (left) Each point indicates significance (from two-sided Kolmogorov-Smirnov test) between fold changes observed in RNA-seq of RBP knockdown for the set of genes with one or more RBP-bound L1 (or antisense L1) elements versus the set of genes containing one or more L1 (or antisense L1) elements but lacking RBP binding (defined as overlap with an IDR peak). RBPs were separated based on requiring 5-fold enrichment for L1 elements as in c. (right) Cumulative distribution plots for (top) MATR3 in HepG2 and (bottom) SUGP2 in HepG2. Significance shown is versus the set of genes containing one or more L1 (or antisense L1) elements but lacking RBP binding (red line). f Points indicate the fraction of antisense L1-assigned reads that map to canonical (RepBase) elements for six expression-altering antisense L1-enriched eCLIP datasets (from e), five other antisense-L1 enriched eCLIP datasets, and 11 paired input samples. Significance is from the two-sided non-parametric Kolmogorov-Smirnov test. See Additional file 3: Fig. S4g for the full distribution of read assignments

Analysis of Alu elements recapitulated a previously described interaction of HNRNPC with antisense Alu elements [36], but additionally revealed two RBPs with more than 5-fold enrichment: ILF3 (enriched for both sense and antisense Alu elements) and RNA Polymerase II component POLR2G (antisense) (Fig. 4b, Additional file 3: Fig. S4d). Both of these factors have previous links to RNA processing through Alu elements, as ILF3 association was suggested to repress RNA editing in Alu elements [39] and Alu elements have been shown to effect RNA Polymerase II elongation rates [40]. In total, 19 datasets showed more than 2-fold enrichment for either Alu or antisense Alu elements (Fig. 4b).

Considering L1/LINE elements, we observed enrichment with far more RBPs, with 26 datasets showing 5-fold enrichment (Fig. 4c). Interestingly, we observed generally distinct sets for sense versus antisense L1 enrichment, with only HNRNPC (in K562, but not HepG2) and ZC3H8 showing enrichment for both (Fig. 4c, Additional file 3: Fig. S4e). The RBPs identified here align well with those identified in an independent analysis of L1-associated RBPs which used a subset of these datasets along with independent iCLIP and other datasets, confirming robustness of this analysis across different approaches to quantify enrichment to L1 elements [37]. To query the role of L1 association, we first considered whether binding could specifically act to repress L1 retrotransposition itself. Of the 15 RBPs with more than 5-fold enrichment at sense L1 elements, SAFB (p = 0.002), PPIL4 (0.06), and TRA2A (p = 0.05) were all identified as candidate suppressors of L1 retrotransposition in a recent genome-wide CRISPR screening assay [38], suggesting that this eCLIP enrichment approach identifies functional regulators of retrotransposition (Fig. 4d).

However, we observed that while enriched signal was centered at L1 sense and antisense elements, the signal often extended for multiple kilobases on either side (Additional file 3: Fig. S4f), indicating that despite the overlap with functional regulators of active lines, the majority of eCLIP signal is likely coming from inactive L1 elements contained within pre-mRNAs rather than independently transcribed active L1 elements in the cell lines studied here. Thus, we next assayed whether these RBPs showed evidence for silencing cryptic RNA processing sites created upon retrotransposition, as previously described [36, 37]. To do this, we hypothesized that knockdown of such RBPs would lead to inclusion of premature stop codons that signal nonsense-mediated decay, ultimately decreasing abundance of target mRNA transcripts. For MATR3, we indeed observed that genes containing one or more antisense L1 elements overlapped by peaks showed significantly decreased expression upon RBP knockdown (Fig. 4e), consistent with recent findings that MATR3 binding blocks both cryptic poly(A)-sites and splice sites within LINEs [37]. Interestingly, we observed a similar pattern for 3 other RBPs with antisense L1 enrichment, HNRNPM (which has been identified in complexes with MATR3 [41]), SUGP2, and EXOSC5 (Fig. 4e). These four RBPs also showed particular enrichment for reference L1 sequences as opposed to unique genomic mapping to more degenerate elements, suggesting that this specifically segregates expression-altering antisense L1-enriched RBPs (Fig. 4f, Additional file 3: Fig. S4g).

Meta-gene binding profiles reveal RBP functions

Next, we turned to the question of whether eCLIP peak distributions could reveal RBP roles in mRNA processing. To better separate RBP association patterns, we considered the distribution peaks across a meta-gene generated by size-normalizing binding across all protein-coding transcripts relative to transcription start and stop sites and start and stop codons, and then averaging across all expressed genes (Fig. 5a). Considering binding relative to the coding region (CDS) and 5′ and 3′ untranslated regions of spliced mRNA, we observed an overall average of approximately one peak per gene across the entire mRNA (Additional file 3: Fig. S5a), with a variety of patterns of individual RBP association (Fig. 5b).

mRNA meta-gene profiles from eCLIP correspond to RBP regulatory roles. a (left) Each line indicates the presence (orange) of a reproducible DDX3X K562 eCLIP peak for 9162 mRNAs that are expressed (TPM > 1) in K562. Each gene was normalized to 13 5′UTR, 100 CDS, and 49 3′UTR bins (based on average lengths among expressed transcripts in K562 cells). (right) A meta-mRNA plot is generated by averaging across all expressed genes, with shaded region indicating 5th to 95th percentile observed in 100 bootstrap samplings. b Heatmap indicates peak coverage for 104 datasets (requiring at least 100 reproducible peaks and at least one meta-mRNA position with 5th percentile greater than 0.002). Color indicates the average occupancy, normalized by setting (blue) minimum value to zero and (yellow) maximum to one. Meta-mRNA profiles were hierarchically clustered and manually labeled. c Heatmap indicates pairwise correlation (Pearson’s R) between each pair of positions along the meta-mRNA in b. d Lines indicate average normalized peaks per bin for all RBPs in the indicated class. Shaded region indicates one standard deviation. e Heatmap indicates odds ratio of overlap between eCLIP datasets in (x-axis) indicated meta-mRNA cluster versus (y-axis) annotated RBP functions. See Additional file 3: Fig. S5d for significance

At a global level, the most striking observation was clear delineation points at the start and stop codon positions (Fig. 5b, c), likely reflecting the fact that translation initiation is unique to the 5′UTR whereas the 3′UTR is the only region where bound RBPs will not be removed by translating ribosomes. However, more subtle clustering revealed distinct subgroups within the broader 5′UTR-, CDS-, and 3′UTR-enriched classes (Fig. 5b, d). For example, we observed two distinct classes of 5′UTR binding that appear to correlate with distinct RBP functions. The first (5UTR.TSS) showed greater enrichment closer to the transcription start site and included nuclear 5′ end processing factors such as cap-binding protein NCBP2 (Fig. 5b, d). In addition to 5′ end enrichment, this class also contained RBPs with substantial 3′UTR signal, such as 3′ end processing factor CSTF2T (which also showed significant signal extending past annotated transcription termination sites (Additional file 3: Fig. S5b), consistent with previous CLIP studies [42]). A second set (5UTR.SC) showed biased peak presence closer to the start codon and included both canonical translational initiation factors (such as EIF3G, EIF3D, and EIF3H) as well as RBPs previously shown to play translational regulatory roles (including DDX3X, SRSF1, and FMR1) (Fig. 5b).

Similarly, we also observed distinctions within CDS binding, with either uniform (CDS.UN) density or biased towards the 5′ (CDS.5P) or 3′ (CDS.3P) end. We observed that 13 out of 15 spliceosomal RBPs showed CDS enrichment (10 of which fell into the CDS.UN category), likely reflecting the general lack of introns in 5′UTRs (due to their small size) and 3′UTRs (as they would create targets for nonsense-mediated decay) (Fig. 5b, d).

Finally, we observed multiple modalities of 3′UTR peak distribution. The 3UTR.Un class showed relatively uniform density and contained many well-characterized 3′UTR binding proteins, including NMD factor UPF1 and stress granule factor TIA1. In contrast, RBPs in the 3UTR.5P class had peak density enriched closer to (and continuing 5′ of) the stop codon, including the well-studied IGF2BP family of RBPs (Additional file 3: Fig. S5c). Finally, we observed a number of RBPs with increased enrichment towards the transcription termination site (3UTR.TTS).

Next, we considered whether these patterns corresponded to different RNA processing functions. Although the number of RBPs is limited for some functions, we observed that many clusters had significant overlaps with distinct RBP functional annotations (Fig. 5e, Additional file 3: Fig. S5d). In particular, RBPs associated with nuclear RNA processing steps showed little change (median 1.2-fold decrease in peak density around the stop codon), whereas RBPs with cytoplasmic roles showed a significant 1.6-fold increase (Additional file 3: Fig. S5e), consistent with a stronger role for the stop codon as a delineation point for cytoplasmic RBP association. In all, our results suggest that the pattern of relative enrichment in different gene regions is predictive of the regulatory role that the RBPs play.

Splicing regulatory roles revealed by intronic meta-gene profiles

Next, we performed regional analysis to query binding to exons (specifically 50 nt bordering the splice sites) and 500 nt of proximal introns flanking both the 3′ and 5′ splice sites. As an example, we observed that out of 89,265 introns present in highly expressed transcripts (TPM > 1), 2699 had a significant IDR peak from eCLIP of U2AF2 in K562 cells (Additional file 3: Fig. S6a). These peaks had a stereotypical positioning at the 3′ splice site (extending into the downstream exon due to the use of full reads rather than just read 5′ ends for analysis), matching the well-characterized role of U2AF2 in 3′ splice site recognition (Fig. 6a). These matrices were then summed across all introns to calculate a meta-intron plot representing the average peak coverage at each position, with confidence intervals estimated by bootstrapping (Fig. 6b).

Meta-exon plots reveal intronic regulatory roles. a Each line indicates the presence (in blue) of a reproducible U2AF2 K562 eCLIP peak for 2699 introns that contain at least one peak within the displayed region (500 nt of proximal intron and 50 nt of exon flanking the 5′ and 3′ splice sites). See Additional file 3: Fig. S6a for all 89,265 introns. b Meta-exon plot for data shown in a, with line indicating average and shaded region indicating 5th to 95th percent confidence interval (derived by 100 bootstrap samplings). c (left) Heatmap indicates average peak coverage across all introns for 130 RBPs with at least 100 peaks and 5th percentile confidence interval at least 0.0005 (for heatmap visualization, the maximum value for each dataset was set to one to calculate normalized coverage). (right) Lines show individual RBP examples for five clusters identified based on similar meta-exon profiles. Y-axis indicates fraction of introns with peak

Performing this analysis for 130 RBPs with sufficient peaks (see the “Methods” section), we observed that the profiles recapitulated many known binding patterns, including U2AF1 and U2AF2 at the 3′ splice site, SF3B4 and SF3A3 at the branch point, PRPF8 at the 5′ splice site, and RBFOX2 and PTBP1 at proximal introns (Fig. 6c). Clustering analysis indicated a number of distinct RBP association patterns. In addition to a large group of exclusively exonic datasets, we observed clusters for the canonical splicing features (5′ splice site, 3′ splice site, and branch point), and two additional clusters: one where RBPs showed enrichment for peaks at proximal introns flanking both the 5′ and 3′ splice sites, and one with dominant enrichment in the 5′ splice site proximal intron only (Fig. 6c, right). We also observed a wide range of peak frequency canonical splicing machinery components such as U2AF2, SF3B4, and PRPF8 had significantly enriched peaks at many introns (with a position maximum of 3.6%, 7.8%, and 5.3% of queried abundant introns respectively in K562), whereas factors such as PTBP1 and RBFOX2 were less commonly enriched at specific positions (0.1% and 0.5%, respectively) (Fig. 6c).

Insights into spliceosomal association and core splicing regulation

The breadth of RBPs profiled provided a unique opportunity to explore their interactions with the spliceosome and their impacts on splicing regulation. In addition to contacting the intron, many spliceosomal and splicing regulatory proteins also interact with the spliceosomal small nuclear RNAs (snRNAs). The overall snRNA family includes five specific RNA families (U1, U2, U4, U5, and U6, which also have variant isoforms that differ slightly in sequence) that play essential roles in canonical GT-AG RNA splicing, as well as four (U11, U12, U4atac, U5atac) specific to the minor AT-AC spliceosome, each of which plays specific mechanistic roles during splicing [43]. Thus, RBP association with a particular snRNA can help to map its function to a particular step in splicing. Quantitating snRNA enrichment using the family-aware mapping described above, we recapitulated many known associations between RBPs and the spliceosome, including interactions of SF3B4 with U2 snRNA (47- and 32-fold enriched in HepG2 and K562, respectively) [44] and GEMIN5 with U1 (11.2-fold enriched in K562) [45] (Fig. 7a). In some cases, these dominated overall RNA recovery for example, an average of 41% of reads from SF3A3 eCLIP and 17% and 20% of SF3B4 eCLIP reads in HepG2 and K562 respectively mapped to the U2 snRNA, whereas U2 reads averaged only 0.7% in input samples.

Insights from eCLIP of spliceosome-associated RBPs. a Heatmap indicates fold-enrichment for individual snRNAs within eCLIP datasets. Shown are all RBPs with greater than 5-fold enrichment for at least one snRNA. b Browser shows read density for eCLIP of AQR (K562), SF3B4 (K562), and SF3A3 (HepG2) for the NARF exon 11 3′ splice site region. Dotted line indicates position of enriched reverse transcription termination at crosslink sites. c (left) Pie chart shows all (n = 2475) introns with > 20 reads in the − 50 to − 15 (branch point) region in AQR K562 eCLIP. Blue indicates putative branch points (the subset with more than 50% of read 5′ ends at one position). (right) Motif information content for 11-mers centered on the putative branch points. Image generated with seqLogo package in R. d Lines indicate mean normalized eCLIP enrichment in IP versus input for SF3B4 and SF3A3 at (red/purple/green) alternative 3′ splice site extensions in RBP knockdown or (black) alternative 3′ splice site events in control HepG2 or K562 cells. The region shown extends 50 nt into exons and 300 nt into introns

Interestingly, while many factors showed similar association between analogous snRNAs in the major and minor spliceosomes (such as PRPF8 and SMNDC1 with U6 and U6atac, and SF3B1 and SF3B4 with U2 and U12), some RBPs were specifically associated with either the major (SF3A3, which was 29.5-fold enriched for U2 but 1.2-fold depleted for U12 in HepG2, and QKI, 118.6-fold enriched for U6 but 2.4-fold depleted for U6ATAC) or minor spliceosome (HNRNPM, which was 8.1-fold enriched in K562 and 7.6-fold in HepG2 for U11 but 5.3- and 4.2-fold depleted for U1) (Fig. 7a, Supplemental Fig. 7a-d). Although preliminary analysis did not show altered splicing upon HNRNPM knockdown specifically at U11/U12 introns, previous studies have suggested that HNRNPM may contribute to minor intron splicing through interactions with FUS [46].

In the first catalytic step of intron splicing, a transesterification step joins the 5′ splice site with the branch point to create an intron lariat structure (Additional file 3: Fig. S7e). This is an essential step in splicing and helps to define 3′ splice site choice, but identification of branch points has remained challenging due to variable positioning (ranging from 20 to 40 nucleotides upstream of the 3′ splice site) and a degenerate sequence motif [47]. Recent efforts to use either specialized library preparation protocols or focused analysis of deep sequencing to identify branch points via lariat junction-spanning reads have enabled the identification of tens of thousands of branch points, but the regulation of branch point recognition and its role in splicing regulation remains poorly understood. Considering the RBPs profiled here, we observe multiple RBPs showing specific enrichment at branch points, including both known regulators (such as SF3 complex components SF3B4 and SF3A3), as well as novel factors (including RBM5). Indeed, analysis of these datasets coupled with focused iCLIP profiling of purified spliceosomes recently indicated distinct patterns of RBP association at branch points and 5′ and 3′ splice sites, which yielded unique insights into how branch point strength defines RBP association and splicesomal assembly dynamics [48].

However, we were particularly intrigued by the observation of a striking pattern of both 5′ splice site and branch point enrichment for the RBP AQR (Fig. 7b). Knockdown of AQR yielded over 30,000 altered alternative splicing events, by far the most of any knockdown performed by the ENCODE consortium to date (including canonical splicing components including U2AF1/2 and SF3B4) [20], consistent with previous studies that indicate a role for AQR in pre-mRNA splicing [49]. However, closer inspection revealed that unlike the canonical peak shape in the branch point region observed for SF3B4 and SF3A3, the 5′ end of AQR eCLIP reads often piled up at specific positions (Fig. 7b). Using simple criteria to identify candidate branch points as positions with more than 50% of read 5′ ends within the overall − 15 to − 50 region, out of 2475 introns with at least 20 reads mapping to the entire branch point region, we identified 1018 candidate branch points in K562 (Fig. 7c). Motif analysis of these positions yielded the canonical branch point motif signal (with 92% containing an A at the base prior to read starts) (Fig. 7c). Thus, these results suggest that AQR eCLIP signal is derived from introns after lariat formation, where reverse transcription is incapable of reading through the branch point adenosine (Additional file 3: Fig. S7e), and that deeper sequencing of AQR eCLIP (potentially with improved methodology to enrich reads at the 3′ rather than 5′ splice site) will provide direct identification of branch points in human.

Next, we considered eCLIP signal at alternatively spliced cassette exons. Considering “native” cassette exons in wild-type K562 and HepG2 cells, we observed that branch point factors SF3B4 and SF3A3 showed decreased signal at alternative exons relative to constitutive exons, consistent with U2AF2 and other spliceosomal components and potentially reflecting overall lower spliceosomal occupancy (Additional file 3: Fig. S7f). However, at alternative 3′ splice sites with the proximal site increased upon knockdown of branch point components SF3B4 and SF3A3, we observed that average eCLIP enrichment for SF3B4 and SF3A3 was decreased at the typical branch point location but increased towards the 3′ splice site (compared to eCLIP signal at native A3SS events which utilize both distal (upstream) and proximal 3′ splice sites in control shRNA datasets) (Fig. 7d, Additional file 3: Fig. S7g). Consistent with previous mini-gene studies showing that 3′ splice site scanning and recognition originates from the branch point and can be blocked if the branch point is moved too close to the 3′ splice site AG [50], these results provide further evidence that use of branch point complex association to restrict recognition by the 3′ splice site machinery may be a common regulatory mechanism [51] (Additional file 3: Fig. S7h).

Clustering of RBP binding identifies known and novel co-associating factors

Large-scale RBP target profiling using a consistent methodology enables cross-comparison between datasets. Considering simple overlap between peak sets for all profiled RBPs, we observed significant overlap for many pairs of RBPs, which often formed co-associating groups (Fig. 8a, left). These groups of RBPs with highly overlapping peaks generally segregated into four major categories. First, we observe high similarity between the same RBP profiled in HepG2 and K562 (including QKI, PTBP1, and LIN28B) (Fig. 8a, green). Indeed, we observe an average peak overlap of 30.0% between the same RBP in K562 and HepG2 versus 4.9% for random RBP pairings (6.1-fold increased), confirming the broad reproducibility of binding across cell types (Fig. 8b). Second, we observe many cases of high overlap between eCLIP for homologous RBPs within the same family, including TIA1 and TIAL1, IGF2BP1/2/3, and fragile X-related FMRP, FXR1, and FXR2 (Fig. 8a, yellow). Third, we observe clusters containing known co-regulating RBPs, including recognition and processing machinery for the 3′ splice site (U2AF1 and U2AF2), branch point (SF3B4 and SF3A3), and 5′ splice site (EFTUD2, RBM22, PRPF8, and others), as well as a group of RBPs that play general roles in binding the 5′UTR of nearly all genes to regulate translation (DDX3X, EIF3G, and NCBP2) (Fig. 8a, red).

RBP co-association predicts known and novel RNP complexes. a Heatmap indicates the pairwise fraction of eCLIP peaks overlapping between datasets. Callout examples are shown for known complexes, RBP families, same RBP profiled across cell types, and putative novel complexes. b GSEA analysis comparing the fraction overlap observed profiling the same RBP in both K562 and HepG2, compared against random pairings of RBPs (with one profiled in K562 and the other in HepG2). c As in b, but using the set of RBPs with interactions reported in the BioPlex IP-mass spectrometry database [52]

Interestingly, we observe unexpected clusters that suggested potential novel complexes or co-interacting partners (Fig. 8a, blue). Some clusters likely reflect overlapping targeting to specific types of RNAs: for example, one cluster contains three RBPs we described above to show specific enrichment at antisense L1/LINE elements (HNRNPM, BCCIP, and EXOSC5). The patterns of other clusters are often less clear, with some containing both well-studied RBPs as well as those with no known RNA processing roles (for example, high overlap between HNRNPL and AGGF1 across both cell types). To consider whether these likely reflected true instances of RBP co-interaction, we asked whether RBPs that had higher peak overlap were more likely to have interactions from large-scale IP-mass spectrometry experiments. Using the BioPlex 2.0 database of

56,000 interactions [52], we observed that RBPs with IP-MS interactions showed an average 2.3-fold increase in eCLIP peak overlap (11.4% versus 4.9% for RBPs without interactions), suggesting that there is a general correlation between peak overlap and RBP interactions (Fig. 8c).

Finally, we performed co-immunoprecipitation (co-IP) studies focusing on one predicted novel interaction group involving HNRNPL and AGGF1. We observed that AGGF1 co-immunoprecipitated HNRNPL, unlike unrelated factors RBFOX2 or FMR1 (Additional file 3: Fig. S8a). We note that this co-IP was observed using less stringent co-IP wash buffers, but was not observed using the high-salt wash buffers present in eCLIP (Additional file 3: Fig. S8b), indicating that the overlap in eCLIP binding likely reflects independent crosslinking events to the distinct RBPs. Thus, these results indicate that the eCLIP data resource reveals many novel RBP interactions that are likely to reflect previously unidentified regulatory complexes.


Conclusions

Gene redundancy in Arabidopsis has previously been shown to limit the number of mutants detectable by phenotype [33]. The completed genome sequence shows that a high degree of redundancy might indeed obscure the quest for many phenotypes. Accordingly, we suggest that there probably exists a high degree of functional redundancy among Arabidopsis RING-domain proteins. This would also correlate with the fact that surprisingly few genes in the complete set are characterized as mutants. To our knowledge, this is the case for only two of them, COP1 and PRT1 [34,35]. Notably, for both these proteins, a functional requirement for the RING domain has been demonstrated, and both are unique with respect to their RING domains.

In this study, we present an ordered set of manually curated RING domains of Arabidopsis. In summary, our set includes all bona fide RING domains, as well as common RING-variant domains. Notably, additional Arabidopsis proteins might have potential to form variant RING-finger domains, as has been suggested, for instance, for the HOS1 protein [36]. However, their primary sequences do not support this notion unambiguously and we chose not to include any RING-domain variants in our analysis for which no structural experimental evidence is yet available. Clearly, our findings show that predictions of cysteine-rich domains have to be met with skepticism. On a proteomics level, they can be misleading in drawing general conclusions, as is amply demonstrated by the overestimation of the abundance of the PHD domain owing to their overlapping classification with RING domains. Additional structural data are needed and have to be taken into account in computational analyses to resolve these issues. Our curated set of RING domains in Arabidopsis will serve as a vital starting point for further genome analysis in this field.


Discussion

In this study, we have constructed a mapping of the protein structure space for the first time by considering the overall surface shape of both single-chain and complex proteins. The shape space visualized in this work would give an impression that the protein shape space is continuous, but this is not specific to the protein surface shape representation. Indeed, earlier works that mapped protein structures considering main-chain conformations also show continuous structure distributions [17–20] and moreover, there exists active discussion on the continuity [29] or the many-to-many similarity relationship [30] of the protein structure space. Analogous to well-established protein main-chain structure classifications, such as SCOP [5] and CATH [4], this work will lead to a new classification for protein shapes at a medium to low resolution, which are being accumulated at an increasing pace by cryo-electron tomography and cryo-EM. By establishing the classification from the distribution of the protein shapes, for example, we will be able to take a census of protein shapes, that is, to count the number of specific protein shapes in organisms and compare across different organisms [31].

The observed variety of protein shapes in this work will also be useful for designing protein representations used in a cell-scale physical simulation of biomolecules [32]. Rather than using an overly simplified molecular representation, as is usual for such a simulation, one could diversify protein shapes in the simulation box by sampling structures from different locations in the shape space (Fig 1 and Fig 6).

Last but not least, this work has strong implications for protein design. Our study indicates that a protein shape can be realized with utterly different backbone conformations that even belong to different fold classes as shown in Table 1 and S1 Fig. Also, the shape mappings of single chains and complexes revealed regions in the shape space that are not occupied by either of them, or are occupied only by complex shapes (Fig 7). Shapes that correspond to the former may be difficult to construct with proteins, and other materials such as DNAs or polysaccharides may be required, while those in the latter region may be better designed using complexes rather than a single-chain protein.

In the coming age of medium- to low-resolution biomolecular structures, protein design needs a novel way of viewing biomolecular shapes. We expect that this work makes a unique and significant contribution by providing a foundation of understanding the protein shape universe.


How to obtain a list of proteins sorted by the ~1400 unique protein folds? - Biology

All databases and documents in the UniProt FTP directory and web sites are distributed under the Creative Commons Attribution-NoDerivs License.

Citation

If you want to cite UniProtKB in a publication, please use one of the references listed here.

Table of contents

1. What is the UniProt Knowledgebase? 1.1 The Swiss-Prot Protein Knowledgebase 1.2 The computer-annotated supplement TrEMBL 2. Conventions used in the database 2.1 General structure of the database 2.2 Status 2.3 Structure of a sequence entry 2.4 Evidence attributions 3. The different line types 3.1 The ID line 3.2 The AC line 3.3 The DT line 3.4 The DE line 3.5 The GN line 3.6 The OS line 3.7 The OG line 3.8 The OC line 3.9 The OX line 3.10 The OH line 3.11 The reference (RN, RP, RC, RX, RG, RA, RT, RL) lines 3.12 The CC line 3.13 The DR line 3.14 Cross-references to the nucleotide sequence database 3.15 The PE line 3.16 The KW line 3.17 The FT line 3.18 The SQ line 3.19 The sequence data line 3.20 The // line Appendix A : Amino-acid codes Appendix B : Format differences between the Swiss-Prot and EMBL databases B.1 Generalities B.2 Differences in line types present in both databases B.3 Line types defined by Swiss-Prot but currently not used by EMBL B.4 Line types defined by EMBL but currently not used by Swiss-Prot Appendix C : Documentation files Appendix D : UniProt Knowledgebase Appendix E : Relationships between Swiss-Prot and some biomolecular databases

Until 2002, the EBI/SIB Swiss-Prot + TrEMBL databases and the PIR Protein Sequence Database (PIR-PSD) coexisted as protein databases with differing protein sequence coverage and annotation priorities. In 2002, EBI, SIB, and PIR (at the Georgetown University Medical Center and National Biomedical Research Foundation) joined forces as the UniProt consortium. The primary mission of the consortium is to support biological research by maintaining a high quality database that serves as a stable, comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase, with extensive cross-references and querying interfaces freely accessible to the scientific community.

The UniProt Knowledgebase (UniProtKB) provides the central database of protein sequences with accurate, consistent, rich sequence and functional annotation.

The UniProt Knowledgebase consists of two sections: Swiss-Prot - a section containing manually-annotated records with information extracted from literature and curator-evaluated computational analysis, and TrEMBL - a section with computationally analyzed records that await full manual annotation.

Swiss-Prot is an annotated protein sequence database. It was established in 1986 and maintained collaboratively, since 1987, by the group of Amos Bairoch first at the Department of Medical Biochemistry of the University of Geneva and now at the SIB Swiss Institute of Bioinformatics and the EMBL Data Library (now the EMBL Outstation - The European Bioinformatics Institute (EBI)). The Swiss-Prot Protein Knowledgebase consists of sequence entries. Sequence entries are composed of different line types, each with their own format. For standardization purposes the format of Swiss-Prot follows as closely as possible that of the EMBL Nucleotide Sequence Database.

Swiss-Prot distinguishes itself from protein sequence databases by four distinct criteria:

In Swiss-Prot, as in many sequence databases, two classes of data can be distinguished: the core data and the annotation.

For each sequence entry the core data consists of:

  • The sequence data
  • The citation information (bibliographical references)
  • The taxonomic data (description of the biological source of the protein).

The annotation consists of the description of the following items:

  • Function(s) of the protein
  • Posttranslational modification(s) such as carbohydrates, phosphorylation, acetylation and GPI-anchor
  • Domains and sites, for example, calcium-binding regions, ATP-binding sites, zinc fingers, homeoboxes, SH2 and SH3 domains and kringle
  • Secondary structure, e.g. alpha helix, beta sheet
  • Quaternary structure, i.g. homodimer, heterotrimer, etc.
  • Similarities to other proteins
  • Disease(s) associated with any number of deficiencies in the protein
  • Sequence conflicts, variants, etc.

We try to include as much annotation information as possible in Swiss-Prot. To obtain this information we use, in addition to the publications that report new sequence data, review articles to periodically update the annotations of families or groups of proteins. We also make use of external experts, who have been recruited to send us their comments and updates concerning specific groups of proteins.

We believe that having systematic recourse both to publications other than those reporting the core data and to subject referees represents a unique and beneficial feature of Swiss-Prot.

In Swiss-Prot, annotation is mainly found in the comment lines (CC), in the feature table (FT) and in the keyword lines (KW). Most comments are classified by 'topics' this approach permits the easy retrieval of specific categories of data from the database.

Many sequence databases contain, for a given protein sequence, separate entries which correspond to different literature reports. In Swiss-Prot we try as much as possible to merge all these data so as to minimize the redundancy of the database. If conflicts exist between various sequencing reports, they are indicated in the feature table of the corresponding entry.

It is important to provide the users of biomolecular databases with a degree of integration between the three types of sequence-related databases (nucleic acid sequences, protein sequences and protein tertiary structures) as well as with specialized data collections. Swiss-Prot is currently cross-referenced to more than 100 different databases. Cross-references are provided in the form of pointers to information related to Swiss-Prot entries and found in data collections other than Swiss-Prot. This extensive network of cross-references allows Swiss-Prot to play a major role as a focal point of biomolecular database interconnectivity.

Swiss-Prot is distributed with a large number of index files and specialized documentation files. Some of these files have been available for a long time (this user manual, the release notes, the various indices for authors, citations, keywords, etc.), but many have been created recently and we are continuously adding new files. 'Documentation files' section contains an up-to-date descriptive list of all distributed document files.

TrEMBL is the computer-annotated section of the UniProt Knowledgebase. It contains translations of all coding regions in the DDBJ/EMBL/GenBank nucleotide databases, and protein sequences extracted from the literature or submitted to UniProtKB, which are not yet integrated into Swiss-Prot. TrEMBL allows these sequences to be made publicly available quickly without diluting the high quality annotation found in Swiss-Prot.

The information in a TrEMBL entry is initially derived directly from the underlying DDBJ/EMBL/GenBank nucleotide entry and the quality of data is directly dependent on the information provided by the submitter of the nucleotide entry. This information may be enhanced later by automatic annotation procedures (see below) but if not, it remains as provided by the submitter until the entry is manually annotated and added to Swiss-Prot.

After creation of a TrEMBL entry, a number of steps are taken to improve the data quality for users:

Records waiting in TrEMBL for full manual annotation are enhanced by automatic annotation. Information is transferred from well-characterised entries in Swiss-Prot to unannotated entries in TrEMBL which belong to groups defined by InterPro, a database of protein families, domains and functional sites. This process brings the standard of annotation in TrEMBL closer to that found in Swiss-Prot through the addition of accurate, high-quality information to TrEMBL entries, thus improving the quality of data available to the user.

Sequences from the same organism which are full-length and which have 100% identity are merged into a single entry to reduce redundancy.

The following sections describe the general conventions used in the knowledgebase to achieve uniformity of presentation. Experienced users of the EMBL Database can skip these sections and directly refer to this document, which lists the minor differences in format between the two data collections.

The UniProt Knowledgebase is composed of sequence entries. Each entry corresponds to a single contiguous sequence as contributed to the bank or reported in the literature. In some cases, entries have been assembled from several papers that report overlapping sequence regions. Conversely, a single paper can provide data for several entries, e.g. when related sequences from different organisms are reported.

References to positions within a sequence are made using sequential numbering, beginning with 1 at the N-terminal end of the sequence.

The sequence data correspond to the precursor form of a protein before posttranslational modifications and processing.

To distinguish the fully annotated entries in the Swiss-Prot section of the UniProt Knowledgebase from the computer-annotated entries in the TrEMBL section, the 'status' of each entry is indicated in the first (ID) line of each entry. The two defined classes are:

Reviewed Entries that have been manually reviewed and annotated by UniProtKB curators (Swiss-Prot section of the UniProt Knowledgebase).
Unreviewed Computer-annotated entries that have not been reviewed by UniProtKB curators (TrEMBL section of the UniProt Knowledgebase).

The entries in the UniProt Knowledgebase are structured so as to be usable by human readers as well as by computer programs. The explanations, descriptions, classifications and other comments are in ordinary English. Wherever possible, symbols familiar to biochemists, protein chemists and molecular biologists are used.

Each sequence entry is composed of lines. Different types of lines, each with their own format, are used to record the various data that make up the entry. A sample sequence entry is shown below.

Entries from the TrEMBL section follow the same format. For format differences see the description of the distinct line types.

Each line begins with a two-character line code, which indicates the type of data contained in the line. The current line types and line codes and the order in which they appear in an entry, are shown in the table below.

Line code Content Occurrence in an entry
IDIdentificationOnce starts the entry
ACAccession number(s)Once or more
DTDateThree times
DEDescriptionOnce or more
GNGene name(s)Optional
OSOrganism speciesOnce or more
OGOrganelleOptional
OCOrganism classificationOnce or more
OXTaxonomy cross-referenceOnce
OHOrganism hostOptional
RNReference numberOnce or more
RPReference positionOnce or more
RCReference comment(s)Optional
RXReference cross-reference(s)Optional
RGReference groupOnce or more (Optional if RA line)
RAReference authorsOnce or more (Optional if RG line)
RTReference titleOptional
RLReference locationOnce or more
CCComments or notesOptional
DRDatabase cross-referencesOptional
PEProtein existenceOnce
KWKeywordsOptional
FTFeature table dataOnce or more in Swiss-Prot, optional in TrEMBL
SQSequence headerOnce
(blanks)Sequence dataOnce or more
//Termination lineOnce ends the entry

As shown in the above table, some line types are found in all entries, others are optional. Some line types occur many times in a single entry. Each entry must begin with an identification line (ID) and end with a terminator line (//).

A detailed description of each line type is given in the next section of this document. It must be noted that, with the exception of GN, all line types exist in the EMBL Database. A description of the format differences between the UniProt Knowledgebase and EMBL databases is given in this document.

The two-character line-type code that begins each line is always followed by three blanks, so that the actual information begins with the sixth character. In general, information is not extended beyond character position 75, there are however a few exceptions where lines may be longer (e.g. OH lines, CC lines that contain the 'WEB RESOURCE' topic (see section 3.12), etc.).

The evidence for annotations are available in UniProtKB entries. An individual evidence description consists of a mandatory evidence type, represented by a code from the Evidence Codes Ontology (ECO) and, where applicable, the source of the data which is usually another database record that is represented by the database name and record identifier, but in the case of publications that are not in PubMed we indicate instead the corresponding UniProtKB reference number.

The ID (IDentification) line is always the first line of an entry. The general form of the ID line is:

The first item on the ID line is the entry name of the sequence. This name is a useful means of identifying a sequence, but it is not a stable identifier as is the accession number (see 3.2).

The Swiss-Prot entry name consists of up to 11 uppercase alphanumeric characters. Swiss-Prot uses a general purpose naming convention that can be symbolized as X_Y, where:

  • X is a mnemonic code of at most 5 alphanumeric characters representing the protein name. Examples: B2MG is for Beta-2-microglobulin, HBA is for Hemoglobin alpha chain and INS is for Insulin, CAD17 for Cadherin-17
  • The '_' sign serves as a separator
  • Y is a mnemonic species identification code of at most 5 alphanumeric characters representing the biological source of the protein. This code is generally made of the first three letters of the genus and the first two letters of the species.

PSEPU is for Pseudomonas putida and NAJNI is for Naja nivea.

However, for species most commonly encountered in the database, self-explanatory codes are used. There are 16 of those codes: BOVIN for Bovine, CHICK for Chicken, ECOLI for Escherichia coli, HORSE for Horse, HUMAN for Human, MAIZE for Maize (Zea mays), MOUSE for Mouse, PEA for Garden pea (Pisum sativum), PIG for Pig, RABIT for Rabbit, RAT for Rat, SHEEP for Sheep, SOYBN for Soybean (Glycine max), TOBAC for Common tobacco (Nicotina tabacum), WHEAT for Wheat (Triticum aestivum), and YEAST for Baker's yeast (Saccharomyces cerevisiae).

As it was not possible to apply the above rules to viruses, they were given arbitrary, but generally easy-to-remember identification codes.

Examples of complete protein sequence entry names are: RL1_ECOLI for ribosomal protein L1 from Escherichia coli,, AFTIN_HUMAN for Aftiphilin from human, SODC_DROME for Superoxide dismutase [Cu-Zn] from Drosophila melanogaster.

The names of all the presently-defined species identification codes are listed in the document file speclist.txt.

The TrEMBL entry name consists of up to 16 uppercase alphanumeric characters. TrEMBL uses a general purpose naming convention similar to that of Swiss-Prot, where:

  • X is identical to the accession number of the entry
  • The '_' sign serves as a separator
  • Y is a mnemonic species identification code.

As it is not possible in a reasonable timeframe to manually assign organism codes to all species represented in TrEMBL, "virtual" codes have been defined that regroup organisms at a certain taxonomic level. Such codes are prefixed by the number "9" and generally correspond to a "pool" of organisms, which can be 'wide' as a kingdom. Here are some examples of such codes:

These type of "virtual" codes are also listed in the document file speclist.txt.

Examples of complete TrEMBL entry names are O95417_HUMAN, Q9VVG0_DROME, P71025_BACSU or Q9SR52_ARATH.

The second item on the ID line indicates the status of the entry (see section 2.2).

The third and last item of the ID line is the length of the molecule, which is the total number of amino acids in the sequence. This number includes the positions reported to be present but which have not been determined (coded as 'X'). The length is followed by the letter code 'AA' (Amino Acids).

Two examples of Swiss-Prot ID lines are shown below:

Example of a TrEMBL ID line:

The AC (ACcession number) line lists the accession number(s) associated with an entry. The format of the AC line is:

An example of an accession number line is shown below:

Semicolons separate the accession numbers and a semicolon terminates the list. If necessary, more than one AC line can be used. Example:

The purpose of accession numbers is to provide a stable way of identifying entries from release to release. It is sometimes necessary for reasons of consistency to change the names of the entries, for example, to ensure that related entries have similar names. However, an accession number is always conserved, and therefore allows unambiguous citation of entries.

Researchers who wish to cite entries in their publications should always cite the first accession number. This is commonly referred to as the 'primary accession number'. 'Secondary accession numbers' are sorted alphanumerically.

We strongly advise those users who have programs performing mappings of Swiss-Prot to another data resource to use Swiss-Prot accession numbers to identify an entry.

Entries will have more than one accession number if they have been merged or split. For example, when two entries are merged into one, the accession numbers from both entries are stored in the AC line(s).

If an existing entry is split into two or more entries (a rare occurrence), the original accession numbers are retained in all the derived entries and a new primary accession number is added to all the entries.

An accession number is dropped only when the data to which it was assigned have been completely removed from the database. Accession numbers deleted from Swiss-Prot are listed in the document file delac_sp.txt and those deleted from TrEMBL are listed in delac_tr.txt.

UniProtKB accession numbers consist of 6 or 10 alphanumerical characters in the format:

1 2 3 4 5 6 7 8 9 10
[A-N,R-Z] [0-9] [A-Z] [A-Z, 0-9] [A-Z, 0-9] [0-9]
[O,P,Q] [0-9] [A-Z, 0-9] [A-Z, 0-9] [A-Z, 0-9] [0-9]
[A-N,R-Z] [0-9] [A-Z] [A-Z, 0-9] [A-Z, 0-9] [0-9] [A-Z] [A-Z,0-9] [A-Z,0-9] [0-9]

The three patterns can be combined into the following regular expression:

Here are some examples of valid accession numbers: P12345, Q1AAA9, O456A1, P4A123 and A0A022YWF9.

The DT (DaTe) lines show the date of creation and last modification of the database entry.

The format of the DT line in Swiss-Prot is:

Where 'DD' is the day, 'MMM' the month and 'YYYY' the year, respectively. The dates shown in DT lines correspond to the date of the release at which an entry was integrated or updated. There are always three DT lines in each entry, each of them is associated with a specific comment:

  • The first DT line indicates when the entry first appeared in the database. The associated comment, 'integrated into UniProtKB/database_name', indicates in which section of UniProtKB, Swiss-Prot or TrEMBL, the entry can be found
  • The second DT line indicates when the sequence data was last modified. The associated comment, 'sequence version', indicates the sequence version number. The sequence version number of an entry is incremented by one when the amino acid sequence shown in the sequence record is modified
  • The third DT line indicates when data other than the sequence was last modified. The associated comment, 'entry version', indicates the entry version number. The entry version number is incremented by one whenever any data in the flat file representation of the entry is modified.

Example of a block of Swiss-Prot DT lines:

Example of a block of TrEMBL DT lines:

Whenever the sequence of an entry is updated there is always also an annotation update. The date in the third DT line is thus always at least as recent as the one in the second DT line.

Note that sequence and entry versions are not reset when an entry moves from Swiss-Prot to TrEMBL. The date of integration into Swiss-Prot can be more recent than the last sequence update.

A comprehensive archive of UniProtKB/Swiss-Prot and UniProtKB/TrEMBL entry versions is available: the UniProtKB Sequence/Annotation Version Database (UniSave) is a repository of UniProtKB/Swiss-Prot and UniProtKB/TrEMBL entry versions. Unlike the UniProt Knowledgebase, which contains only the latest Swiss-Prot and TrEMBL entry and sequence versions, the UniProtKB Sequence/Annotation Version Database provides access to all versions of these entries. This allows to track sequence changes, to find out when a given annotation appeared in an entry and how it evolved.

The DE (DEscription) lines contain general descriptive information about the sequence stored. This information is generally sufficient to identify the protein precisely.

The description always starts with the recommended name (RecName) of the protein. Alternative names (AltName) are indicated thereafter.

The DE line contains 3 categories, as well as several subcategories, of protein names:

Category FieldSubcategory FieldCardinalityDescription
RecName: 1 in UniProtKB/Swiss-Prot
0-1 in UniProtKB/TrEMBL
The name recommended by the UniProt consortium.
Full=1 The full name.
Short=0-n An abbreviation of the full name or an acronym.
EC=0-n An Enzyme Commission number.
AltName: 0-n A synonym of the recommended name.
Full=0-1 The full name.
Short=0-n An abbreviation of the full name or an acronym.
EC=0-n An Enzyme Commission number.
AltName:Allergen=0-1 See allergen.txt.
AltName:Biotech=0-1 A name used in a biotechnological context.
AltName:CD_antigen=0-n See cdlist.txt.
AltName:INN=0-n The international nonproprietary name: A generic name for a pharmaceutical substance or active pharmaceutical ingredient that is globally recognized and is a public property.
SubName: 0 in UniProtKB/Swiss-Prot
0-n in UniProtKB/TrEMBL
A name provided by the submitter of the underlying nucleotide sequence.
Full=1 The full name.
EC=0-n An Enzyme Commission number.

Each name is shown on a separate line lines may therefore exceed 80 characters.

Protein naming guidelines are described in the International protein nomenclature guidelines.

A block of DE lines may further contain multiple Includes: and/or Contains: sections and a separate field Flags: to indicate whether the protein sequence is a precursor or a fragment:

FieldCardinalityValue
Includes:0-n A block of protein names as described in the table above.
Contains:0-n A block of protein names as described in the table above.
Flags:0-1 Precursor and/or Fragment or Fragments

If a protein is known to be cleaved into multiple functional components, the description starts with the name of the precursor protein, followed by 'Contains:' section(s). Each individual component is described in a separate 'Contains:' section Alternative names (AltName) are allowed for each individual component. Example:

If a protein is known to include multiple functional domains each of which is described by a different name, the description starts with the name of the overall protein, followed by 'Includes:' section(s). All the domains are listed in a separate 'Includes:' section. Alternative names (AltName) are allowed for each individual domain. Example:

In rare cases, the functional domains of an enzyme are cleaved, but the catalytic activity can only be observed, when the individual chains reorganize in a complex. Such proteins are described in the DE line by a combination of both 'Includes:' and 'Contains:', in the order given in the following example:

When the mature form of a protein is derived by processing of a precursor, we indicate this fact using the Flag 'Precursor' in such cases the sequence displayed does not correspond to the mature form of the protein.

If the complete sequence is not determined, we indicate it in the 'Flags' section with 'Fragment' or 'Fragments'. Example:

The format of the DE line in TrEMBL follows closely the format used in Swiss-Prot. However, as TrEMBL is not manually annotated, the description is derived directly from the underlying nucleotide entry and its accuracy relies on the information provided by the submitter of the nucleotide entry. It is why TrEMBL entries usually have submitted name (SubName) instead of a recommended name (RecName). The description may later be improved by automatic annotation procedures (see section Automatic annotation) but if not, it remains as provided by the submitter until the entry is manually annotated and added to Swiss-Prot.

The GN (Gene Name) line indicates the name(s) of the gene(s) that code for the stored protein sequence. The GN line contains three types of information:

  1. Gene names (a.k.a gene symbols). The name(s) used to represent a gene. As there can be more than one name assigned to a gene, we make a distinction between the one which we believe should be used as the official gene name and the other names which are listed as "Synonyms".
  2. Ordered locus names (a.k.a. OLN, ORF numbers, CDS numbers or Gene numbers). A name used to represent an ORF in a completely sequenced genome or chromosome. It is generally based on a prefix representing the organism and a number which usually represents the sequential ordering of genes on the chromosome. Depending on the genome sequencing center, numbers are only attributed to protein-coding genes, or also to pseudogenes, or also to tRNAs and other features. If two predicted genes have been merged to form a new gene, both gene identifiers are indicated, separated by a slash (see last example). Examples: HI0934, Rv3245c, At5g34500, YER456W, YAR042W/YAR044W.
  3. ORF names (a.k.a. sequencing names or contig names or temporary ORFNames). A name temporarily attributed by a sequencing project to an open reading frame. This name is generally based on a cosmid numbering system. Examples: MtCY277.28c, SYGP-ORF50, SpBC2F12.04, C06E1.1, CG10954.

The format of the GN line is:

None of the above four tokens are mandatory. But a "Synonyms" token can only be present if there is a "Name" token.

If there is more than one gene, GN line blocks for the different genes are separated by the following line:

Wrapping is done preferentially at a semicolon, otherwise at a comma.

It often occurs that more than one name has been assigned to an individual locus, in which case all the synonyms will be listed alphabetically and case- insensitively. Example:

The OS (Organism Species) line specifies the organism which was the source of the stored sequence. In the rare case where all the species information will not fit on a single line, more than one OS line is used. The last OS line is terminated by a period.

The species designation consists, in most cases, of the Latin genus and species designation followed by the English name (in parentheses). For viruses, only the common English name is given.

Examples of OS lines are shown here:

The names (official name, common name, synonym) concerning one species are cut across lines when they do not fit into a single line:

The OG (OrGanelle) line indicates if the gene coding for a protein originates from mitochondria, a plastid, a nucleomorph or a plasmid.

The format of the OG line is:

Where 'name' is the name of the plasmid.

Hydrogenosomes are membrane-enclosed redox organelles found in some anaerobic unicellular eukaryotes which contain hydrogenase and produce hydrogen and ATP by glycolysis. They are thought to have evolved from mitochondria most hydrogenosomes lack a genome, but some like (e.g. the anaerobic ciliate Nyctotherus ovalis) have retained a rudimentary genome.

Mitochondria are redox-active membrane-bound organelles found in the cytoplasm of most eukaryotic cells. They are the site of sthe reactions of oxidative phosphorylation, which results in the formation of ATP.

Nucleomorphs are reduced vestigal nuclei found in the plastids of cryptomonad and chlorachniophyte algae. The plastids originate from engulfed eukaryotic phototrophs.

Plastids are classified based on either their taxonomic lineage or in some cases on their photosynthetic capacity.

Apicoplasts are the plastids found in Apicocomplexa parasites such as Eimeria, Plasmodium and Toxoplasma they are not photosynthetic.

Chloroplasts are the plastids found in all land plants and algae with the exception of the glaucocystophyte algae (see below). Chloroplasts in green tissue are photosynthetic in other tissues they may not be photosynthetic and then may also have secondary information relating to subcellular location (e.g. amyloplasts, chromoplasts).

Organellar chromatophores are the photosynthetic plastids found in Paulinella chromatophora, a photosynthetic thecate amoeba of the Cercozoa lineage. At 1 Mb this plastid is 3 times larger than the largest known plastid it is not clear if it is derived from the same endosymbiotic event that is thought to have led to all other plastids.

Cyanelles are the plastids found in the glaucocystophyte algae. They are also photosynthetic but their plastid has a vestigial cell wall between the 2 envelope membranes.

Non-photosynthetic plastid is used when the plastid in question derives from a photosynthetic lineage but the plastid in question is missing essential genes. Some examples are Aneura mirabilis, Epifagus virginiana, Helicosporidium (a liverwort, higher plant and green alga respectively).

The term Plastid is used when the capacities of the organism are unclear for example in the parasitic plants of the Cuscuta lineage, where sometimes young tissue is photosynthetic.

If an entry reports the sequence of a protein identical in a number of plasmids, the names of these plasmids will all be listed in the OG lines of that entry. The plasmid names are separated by commas, the last plasmid name is preceded by the word 'and'. Plasmid names are never written across two lines. Example:

The document plasmid.txt lists all the plasmid names that are used in the database in the context of the OG line.

The OC (Organism Classification) lines contain the taxonomic classification of the source organism. The taxonomic classification used is that maintained at the NCBI (see https://www.ncbi.nlm.nih.gov/Taxonomy/) and used by the nucleotide sequence databases (EMBL/GenBank/DDBJ). The NCBI's taxonomy reflects current phylogenetic knowledge. It is a sequence-based taxonomy as much as possible and based on published authorities wherever possible. Because of the inherent ambiguity of evolutionary classification and the specific needs of database users (e.g. trying to track down the phylogenetic history of a group of organisms or to elucidate the evolution of a molecule), this taxonomy strives to accurately reflect current phylogenetic knowledge. The NCBI's taxonomy is intended to be informative and helpful no claim is made that it is the best or the most exact.

The classification is listed top-down as nodes in a taxonomic tree in which the most general grouping is given first. The classification may be distributed over several OC lines, but nodes are not split or hyphenated between lines. Semicolons separate the individual items and the list is terminated by a period.

The format of the OC line is:

For example the classification lines for a human sequence would be:

The OX (Organism taxonomy cross-reference) line is used to indicate the identifier of a specific organism in a taxonomic database. The format of the OX line is:

Currently the cross-references are made to the taxonomy database of NCBI, which is associated with the qualifier 'TaxID' and a taxonomic code.

The OH (Organism Host) line is optional and appears only in viral entries. It indicates the host organism(s) that are susceptible to be infected by a virus.

A virus being an inert particle outside its hosts, the virion has neither metabolism, nor any replication capability, nor autonomous evolution. Identifying the host organism(s) is therefore essential, because features like virus-cell interactions and posttranslational modifications depend mostly on the host.

The format of the OH line is:

The HostName consists of the official name and, optionally, a common name and/or synonym. The length of an OH line may exceed 80 characters.

Example for Simian hepatitis A virus:

These lines comprise the literature citations. The citations indicate the sources from which the data has been abstracted. The reference lines for a given citation occur in a block, and are always in the order RN, RP, RC, RX, RG, RA, RT and RL. Within each such reference block, the RN line occurs once, the RC, RX and RT lines occur zero or more times, and the RP, RG/RA and RL lines occur one or more times. If several references are given, there will be a reference block for each.

An example of a complete reference is:

The formats of the individual lines are explained below.

The RN (Reference Number) line gives a sequential number to each reference citation in an entry. This number is used to indicate the reference in comments and feature table notes. The format of the RN line is:

where 'n' denotes the n th reference for this entry. The reference number is always between square brackets.

The RP (Reference Position) lines describe the extent of the work relevant to the entry carried out by the authors. The format of the RP line is:

It should contain a description of the information that has been propagated in the Swiss-Prot entry.

A typical comment is "NUCLEOTIDE SEQUENCE". This item might be tagged with a qualifier, indicating the origin of the sequence data. Valid names of this qualifiers are:

  • GENOMIC DNA: the individual gene has been sequenced
  • GENOMIC RNA: the individual gene has been sequenced
  • MRNA: the individual cDNA has been sequenced
  • LARGE SCALE GENOMIC DNA: the gene has been sequenced as part of a genome project
  • LARGE SCALE MRNA: the cDNA has been sequenced as part of a large-scale cDNA project

If 2 qualifiers apply, both are indicated, separated by a '/'.

The 'LARGE SCALE ANALYSIS' is another typical tag added in references that report large screen results to indicate that results have not been extensively studied.

Typical examples of RP lines are shown below:

The RC (Reference Comment) lines are optional lines which are used to store comments relevant to the reference cited. The format of the RC line is:

The currently defined tokens and their order in the RC line are:

STRAIN PLASMID TRANSPOSON TISSUE

Reference comment line topics may span lines. Examples of RC lines:

The RX (Reference cross-reference) line is an optional line which is used to indicate the identifier assigned to a specific reference in a bibliographic database. The format of the RX line is:

Where the valid bibliographic database names and their associated identifiers are:

Name Identifier
MEDLINE Eight-digit MEDLINE Unique Identifier (UI)
PubMed PubMed Unique Identifier (PMID)
DOI Digital Object Identifier (DOI)
AGRICOLA AGRICOLA Unique Identifier

The Reference Group (RG) line lists the consortium name associated with a given citation. The RG line is mainly used in submission reference blocks, but can also be used in paper references, if the working group is cited as an author in the paper. RG line and RA line (Reference Author) can be present in the same reference block at least one RG or RA line is mandatory per reference block. An example of the use of RG lines is shown below:

The RA (Reference Author) lines list the authors of the paper (or other work) cited. The RA line is present in most references, but might be missing in references that cite a reference group (see RG line). At least one RG or RA line is mandatory per reference block.

All of the authors are included, and are listed in the order given in the paper. The names are listed surname first followed by a blank, followed by initial(s) with periods. The authors' names are separated by commas and terminated by a semicolon. Author names are not split between lines. An example of the use of RA lines is shown below:

As many RA lines as necessary are included in each reference.

An author's initials can be followed by an abbreviation such as 'Jr' (for Junior), 'Sr' (Senior), 'II', 'III' or 'IV' (2nd, 3rd and 4th). Example:

The RT (Reference Title) lines give the title of the paper (or other work) cited as exactly as possible given the limitations of the computer character set. The format of the RT line is:

Example of a set of RT lines:

It should be noted that the format of the title is not always identical to that displayed at the top of the published work:

  • Major title words are not capitalized
  • The text of a title ends with either a period '.', a question mark '?' or an exclamation mark '!'
  • Double quotation marks ' " ' in the text of the title are replaced by single quotation marks
  • Titles of articles published in a language other than English have been translated into English
  • Greek letters are written in full (alpha, beta, etc.).

The RL (Reference Location) lines contain the conventional citation information for the reference. In general, the RL lines alone are sufficient to find the paper in question.

The RL line for a journal citation includes the journal abbreviation, the volume number, the page range and the year. The format for such an RL line is:

Journal names are abbreviated according to the conventions used by the National Library of Medicine (NLM) and are based on the existing ISO and ANSI standards. A list of the abbreviations currently in use is given in the document file jourlist.txt

An example of an RL line is:

When a reference is made to a paper which is 'in press' at the time the database is released, the page range, and possibly the volume number, are indicated as '0' (zero). An example of such an RL line is shown here:

The RL line for an electronic publication includes an '(er)' prefix. The format is indicated below:

A variation of the RL line format is used for papers found in books or other types of publication, which are then cited using the following format:

For unpublished observations the format of the RL line is:

Where 'MMM' is the month and 'YYYY' is the year.

We use the 'unpublished observations' RL line to cite communications by scientists to Swiss-Prot of unpublished information concerning various aspects of a sequence entry.

For Ph.D. theses the format of the RL line is:

An example of such a line is given here:

For patent applications the format of the RL line is:

Where 'Pat_num' is the international publication number of the patent, 'DD' is the day, 'MMM' is the month and 'YYYY' is the year. Example:

The final form that an RL line can take is that used for submissions. The format of such an RL line is:

Where 'MMM' is the month, 'YYYY' is the year and 'Database_name' is one of the following:

the EMBL/GenBank/DDBJ databases UniProtKB the PDB data bank the PIR data bank

Two examples of submission RL lines are given here:

The CC lines are free text comments on the entry, and are used to convey any useful information. The comments always appear below the last reference line and are grouped together in comment blocks a block is made up of 1 or more comment lines. The first line of a block starts with the characters '-!-'.

The format of a comment block is:

The comment blocks are arranged according to what we designate as 'topics'. The current topics and their definitions are listed in the table below.

Topic Description
ALLERGEN Information relevant to allergenic proteins
ALTERNATIVE PRODUCTS Description of the existence of related protein sequence(s) produced by alternative splicing of the same gene, alternative promoter usage, ribosomal frameshifting or by the use of alternative initiation codons see 3.12.15
BIOPHYSICOCHEMICAL PROPERTIES Description of the information relevant to biophysical and physicochemical data and information on pH dependence, temperature dependence, kinetic parameters, redox potentials, and maximal absorption see 3.12.8
BIOTECHNOLOGY Description of the use of a specific protein in a biotechnological process
CATALYTIC ACTIVITY Description of the reaction(s) catalyzed by an enzyme [1]
CAUTION Warning about possible errors and/or grounds for confusion
COFACTOR Description of any non-protein substance required by an enzyme for its catalytic activity
DEVELOPMENTAL STAGE Description of the developmentally-specific expression of mRNA or protein
DISEASE Description of the disease(s) associated with a deficiency of a protein
DISRUPTION PHENOTYPE Description of the effects caused by the disruption of the gene coding for the protein see 3.12.27
DOMAIN Description of the domain structure of a protein
ACTIVITY REGULATION Description of the regulatory mechanism of an enzyme, transporter, microbial transcription factor
FUNCTION General description of the function(s) of a protein
INDUCTION Description of the compound(s) or condition(s) that regulate gene expression
INTERACTION Conveys information relevant to binary protein-protein interaction 3.12.12
MASS SPECTROMETRY Reports the exact molecular weight of a protein or part of a protein as determined by mass spectrometric methods see 3.12.23
MISCELLANEOUS Any comment which does not belong to any of the other defined topics
PATHWAY Description of the metabolic pathway(s) with which a protein is associated
PHARMACEUTICAL Description of the use of a protein as a pharmaceutical drug
POLYMORPHISM Description of polymorphism(s)
PTM Description of any chemical alternation of a polypeptide (proteolytic cleavage, amino acid modifications including crosslinks). This topic complements information given in the feature table or indicates polypeptide modifications for which position-specific data is not available.
RNA EDITING Description of any type of RNA editing that leads to one or more amino acid changes
SEQUENCE CAUTION Description of protein sequence reports that differ from the sequence that is shown in UniProtKB due to conflicts that are not described in FT CONFLICT lines, such as frameshifts, erroneous gene model predictions, etc. See 3.12.36
SIMILARITY Description of the similaritie(s) (sequence or structural) of a protein with other proteins
SUBCELLULAR LOCATION Description of the subcellular location of the chain/peptide/isoform. See 3.12.13
SUBUNIT Description of the quaternary structure of a protein and any kind of interactions with other proteins or protein complexes except for receptor-ligand interactions, which are described in the topic FUNCTION.
TISSUE SPECIFICITY Description of the tissue-specific expression of mRNA or protein
TOXIC DOSE Description of the lethal dose (LD), paralytic dose (PD) or effective dose of a protein
WEB RESOURCE Description of a cross-reference to a network database/resource for a specific protein see 3.12.38

[1] For the 'CATALYTIC ACTIVITY' topic: To allow the curation of reactions at the level of specific enzymes instead of enzyme classes, and to use standardized names for reactants, we use chemical reaction descriptions from the Rhea database whenever possible. For catalytic activities that can only be described in the form of free text, we follow the recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (IUBMB) as published in Enzyme Nomenclature, NC-IUBMB, Academic Press, New-York, (1992).

We show here, for each of the defined topics, two examples of their usage:

A BIOPHYSICOCHEMICAL PROPERTIES block must contain at least one of the properties Absorption, Kinetic parameters, pH dependence, Redox potential, Temperature dependence and may have any combination of these properties (ordered as indicated above). The meaning of these subtopics is as follows:

Property Description
Absorption indicates the wavelength at which photoreactive proteins such as opsins and DNA photolyases show maximal absorption
Kinetic parameters mentions the Michaelis-Menten constant (KM) and maximal velocity (Vmax) of enzymes
pH dependence describes the optimum pH for enzyme activity and/or the variation of enzyme activity with pH variation
Redox potential reports the value of the standard (midpoint) oxido-reduction potential(s) for electron transport proteins
Temperature dependence indicates the optimum temperature for enzyme activity and/or the variation of enzyme activity with temperature variation the thermostability/thermolability of the enzyme is also mentioned when it is known

The CC line topic INTERACTION conveys information relevant to binary protein-protein interaction. It is automatically derived from the IntAct database and is updated on a monthly basis. The occurrence is one INTERACTION topic per entry, with each binary interaction being presented in a separate line. Each data line can be longer than 80 characters.

Interactions can be derived by any appropriate experimental method, but must be confirmed by a second experiment, if resulting from a single yeast- two-hybrid experiment. For large-scale experiments, interactions are considered if a high confidence is assigned from the authors.

The format of the CC line topic INTERACTION is:

  • <Interactant> represents a UniProtKB protein.
  • the first <Interactant> is represented by: (<Accession>|<IsoId>|<ProductId>)
  • the second <Interactant> is represented by: (<Accession>|<IsoId>|<ProductId> [<Accession>])(: <Gene>)?
  • <IsoId> is a UniProtKB isoform ID.
  • <ProductId> is a UniProtKB product ID, i.e. a feature identifier.
  • <Gene> is either the gene name, ordered locus name or ORF name of the gene that encodes the UniProtKB protein.
  • <Experiments> is the number of experiments in IntAct that support an interaction.
  • <IntActId> is an IntAct protein ID.
  • 'Xeno' is an optional flag that indicates that the interacting proteins are derived from different species. This may be due to the experimental set-up or may reflect a pathogen-host interaction.
  • IntAct=IntActId, IntActId identifies the interaction in IntAct by using the two IntAct protein identifiers.

Binary interactions with different isoforms that are described in P11309.

Binary interaction with a product of proteolytic cleavage.

Examples of interaction lines are given below. The CC INTERACTION topics are not complete only explained interaction lines are indicated.

In the typical example the current protein, Q9NQ11, is interacting with Q2M2I8 which is further characterized by its gene name "AAK1". The interaction is supported by two experiments stored in IntAct. Experimental details for this interaction can be found by querying IntAct with "EBI-6308763, EBI-1383433".

The current protein, P27449, interacts with an isoform of Q9UHP7 defined by the IsoID Q9UHP7-3.

No gene name information for the interacting protein is available.

The protein self-associates.

The source organisms of the interacting proteins are different.

Different isoforms of the current protein are shown to interact with the same protein (O14613). This is reflected by different IntActIDs for the current protein.

Example entry with many interaction lines: Q02821.

The document subcell.txt, lists the controlled vocabularies used in the comment line (CC) topic SUBCELLULAR LOCATION, their definitions and further information such as synonyms or relevant GO terms in the following format:

The format of SUBCELLULAR LOCATION is:

  • Molecule: Isoform, chain or peptide name
  • Location = Subcellular_location( Flag)?( Topology( Flag)?)?( Orientation( Flag)?)?
    • Subcellular_location: SL-line of subcell.txt ID-record
    • Topology: SL-line of subcell.txt IT-record
    • Orientation: SL-line of subcell.txt IO-record

    Note: Perl-style multipliers indicate whether a pattern (as delimited by parentheses) is optional (?) or may occur 1 or more times (+).

    When no chain/peptide/isoform is specified, the subcellular location corresponds to that of the mature protein.

    The format of the CC line topic ALTERNATIVE PRODUCTS is:

    Note: Variable values are represented in italics. Perl-style multipliers indicate whether a pattern (as delimited by parentheses) is optional (?), may occur 0 or more times (*), or 1 or more times (+). Alternative values are separated by a pipe symbol (|).

    Topic Description
    Event Biological process that results in the production of the alternative forms. It lists one or a combination of the following values (Alternative promoter usage, Alternative splicing, Alternative initiation, Ribosomal frameshifting).
    Format: Event=controlled vocabulary
    Example: Event=Alternative splicing
    Named isoforms Number of isoforms listed in the topics 'Name' currently only for 'Event=Alternative splicing'.
    Format: Named isoforms=number
    Example: Named isoforms=6
    Comment Any comments concerning one or more isoforms optional
    Format: Comment=free text
    Example: Comment=Experimental confirmation may be lacking for some isoforms
    Name A common name for an isoform used in the literature or assigned by Swiss-Prot currenty only available for spliced isoforms.
    Format: Name=common name
    Example: Name=Alpha
    Synonyms Synonyms for an isoform as used in the literature optional currently only available for spliced isoforms.
    Format: Synonyms=Synonym_1[, Synonym_n]
    Example: Synonyms=B, KL5
    IsoId Unique identifier for an isoform, consisting of the Swiss-Prot accession number, followed by a dash and a number.
    Format: IsoId=acc#-isoform_number[, acc#-isoform_number]
    Example: IsoId=P05067-1
    Sequence Information on the isoform sequence the term 'Displayed' indicates, that the sequence is shown in the entry a lists of feature identifiers (VSP_#) indicates that the isoform is annotated in the feature table the FTIds enable programs to create the sequence of a splice variant if the accession number of the IsoId does not correspond to the accession number of the current entry, this topic contains the term 'External' 'Not described' points out that the sequence of the isoform is unknown.
    Format: Sequence=VSP_#[, VSP_#]|Displayed|External|Not described
    Example: Sequence=Displayed
    Example: Sequence=VSP_000013, VSP_000014 Example: Sequence=External
    Example: Sequence=Not described
    Note Lists isoform-specific information optional. It may specify the event(s), if there are several.
    Format: Note=Free text
    Example: Note=No experimental confirmation available

    Example of the CC lines and the corresponding FT lines for an entry with alternative splicing:

    • 'ProductName' is the name of an isoform or product of proteolytic cleavage
    • 'Mass=XXX' is the determined molecular weight (MW)
    • 'Mass_error=XX' (optional) is the accuracy or error range of the MW measurement
    • 'Method=XX' is the ionization method
    • 'Note='. Comment in free text format
    • 'Evidence=PubMed:/Ref.n' indicates the relevant reference'.

    Note that we only describe effects caused the complete absence of a gene and thus a protein in vivo (null mutants caused by random or target deletions, insertions of a transposable element etc.) To avoid description of phenotypes due to partial or dominant negative mutants, missense mutations are not described in this comment, but in FT MUTAGEN instead. Defects caused by transient inactivation by methods such as RNA interference (RNAi) or blockage by antibodies are not described in this comment due to the difficulty to interpret results, except for C. elegans RNAi studies, which are widely used and done in vivo.

    The format of the SEQUENCE CAUTION topic is:

    • Sequence is the sequence which differs from the UniProtKB sequence. It is described by one of:
      • an EMBL protein identifier (with version number)
      • an EMBL accession number.
      • a literature reference (e.g. Ref.3).
      • Frameshift
      • Erroneous initiation
      • Erroneous termination
      • Erroneous gene model prediction
      • Erroneous translation
      • Miscellaneous discrepancy

      These lines will not be wrapped and their length may therefore exceed 80 characters.

      • 'ProductName' is the name of an isoform or product of proteolytic cleavage
      • 'Name' is the name of the database
      • 'Note' (optional) is a free text note
      • 'URL' is the WWW address (URL) of the database

      The length of these lines may exceed 80 characters because long URL addresses are not wrapped into multiple lines.

      The DR (Database cross-Reference) lines are used as pointers to information in external data resources that is related to UniProtKB entries. The full list of all databases to which UniProtKB is cross-referenced can be found in the document dbxref.txt. It also contains references describing these resources and provides links to their web sites.

      For example, if the X-ray crystallographic atomic coordinates of a sequence are stored in the Protein Data Bank (PDB) there will be one DR line pointing to each of the corresponding entries in PDB. For a sequence translated from a nucleotide sequence there will be DR line(s) pointing to the relevant entri(es) in the EMBL/GenBank/DDBJ database which correspond to the DNA or RNA sequence(s) from which it was translated.

      The format of the DR line is:

      The cross-references to the EMBL/GenBank/DDBJ nucleotide sequence database and PROSITE are described in sections 3.15 and 3.15.114.

      The first field of the DR line, the 'RESOURCE_ABBREVIATION', is the abbreviated name of the referenced resource. The currently defined abbreviations are listed below.

      Abbreviation Description
      EMBL Nucleotide sequence database of EMBL/EBI (see 3.15)
      ABCD ABCD database of sequenced antibodies with their known targets
      Allergome Allergome a platform for allergen knowledge
      Antibodypedia Antibodypedia antibody database
      ArachnoServer ArachnoServer: Spider toxin database
      Araport Araport: Arabidopsis Information Portal
      Bgee Bgee dataBase for Gene Expression Evolution
      BindingDB The Binding Database
      BioCyc Collection of Pathway/Genome Databases
      BioGRID BioGRID, The Biological General Repository for Interaction Datasets
      BioGRID-ORCS BioGRID-ORCS database of CRISPR phenotype screens
      BioMuta BioMuta curated single-nucleotide variation and disease association database
      BMRB Biological Magnetic Resonance Data
      BRENDA BRENDA Comprehensive Enzyme Information System
      CarbonylDB CarbonylDB database of protein carbonylation sites
      CAZy Carbohydrate-Active enZymes
      CCDS The Consensus CDS (CCDS) project
      CDD Conserved Domains Database
      ChEMBL A database of bioactive drug-like small molecules.
      ChiTaRS A database of human, mouse and fruit fly chimeric transcripts and RNA-sequencing data
      CGD Candida genome database
      CLAE Database of Characterized Lignocellulose-Active Enzymes
      ComplexPortal A manually curated, encyclopaedic resource of macromolecular complexes from a number of key model organisms
      COMPLUYEAST-2DPAGE 2-D database at Universidad Complutense de Madrid
      CollecTF CollecTF database of bacterial transcription factor binding sites
      ConoServer ConoServer: Cone snail toxin database
      CORUM CORUM comprehensive resource of mammalian protein complexes
      CPTAC The CPTAC Assay Portal serves as a centralized public repository of "fit-for-purpose," multiplexed quantitative mass spectrometry-based proteomic targeted assays.
      CPTC CPTAC Antibody Portal
      CTD Comparative Toxicogenomics Database
      DEPOD Human DEPhOsphorylation Database
      dictyBase Dictyostelium discoideum online informatics resource
      DIP Database of interacting proteins
      DisGeNET DisGeNET discovery platform integrating information on gene-disease associations
      DisProt Database of protein disorders
      DMDM Domain mapping of disease mutations
      DNASU The DNASU plasmid repository
      DOSAC-COBS-2DPAGE 2D-PAGE database from the dipartimento oncologico di III Livello
      DrugBank The DrugBank database (DrugBank)
      DrugCentral DrugCentral Online Drug Compendium
      EchoBASE The integrated post-genomic database for E. coli (EchoBASE)
      eggNOG evolutionary genealogy of genes: Non-supervised Orthologous Groups
      ELM The Eukaryotic Linear Motif resource for functional sites in proteins
      Ensembl Database of automatically annotated sequences of large genomes (Ensembl database)
      EnsemblBacteria This databases is part of Ensembl Genomes, which has been created to complement the existing Ensembl site, the focus of which are vertebrate genomes.
      EnsemblFungi This databases is part of Ensembl Genomes, which has been created to complement the existing Ensembl site, the focus of which are vertebrate genomes.
      EnsemblMetazoa This databases is part of Ensembl Genomes, which has been created to complement the existing Ensembl site, the focus of which are vertebrate genomes.
      EnsemblPlants This databases is part of Ensembl Genomes, which has been created to complement the existing Ensembl site, the focus of which are vertebrate genomes.
      EnsemblProtists This databases is part of Ensembl Genomes, which has been created to complement the existing Ensembl site, the focus of which are vertebrate genomes.
      EPD The Encyclopedia of Proteome Dynamics is a resource that contains data from multiple, large-scale proteomics experiments aimed at characterising proteome dynamics in both human cells and model organisms.
      ESTHER The server ESTHER (ESTerases and alpha/beta-Hydrolase Enzymes and Relatives) is dedicated to the analysis of proteins or protein domains belonging to the superfamily of alpha/beta-hydrolases, exemplified by the cholinesterases.
      euHCVdb The European Hepatitis C Virus database
      VEuPathDB Eukaryotic Pathogen, Vector and Host Database Resources
      EvolutionaryTrace The Evolutionary Trace ranks amino acid residues in a protein sequence by their relative evolutionary importance.
      ExpressionAtlas Information on gene expression patterns under different biological conditions.
      FlyBase Drosophila genome database (FlyBase)
      Gene3D Database of structural assignments for genes (Gene3D)
      GeneCards GeneCards: human genes, protein and diseases
      GeneDB GeneDB pathogen genome database from Sanger Institute
      GeneID Database of genes from NCBI RefSeq genomes
      GeneReviews GeneReviews, a resource of expert-authored, peer-reviewed disease descriptions.
      GeneTree The phylogenetic gene trees that are available at https://www.ensembl.org/ and https://ensemblgenomes.org/.
      Genevisible Genevisible is a free resource to explore public expression data for a gene of interest and find the top five tissues, cell lines, cancers or perturbations in which it has the highest expression or response.
      GeneWiki GeneWiki, an initiative that aims to create seed articles for every notable human gene.
      GenomeRNAi Database of phenotypes from RNA interference screens in Drosophila and Homo sapiens.
      GlyConnect GlyConnect integrated glycodata platform
      GlyGen Computational and Informatics Resources for Glycoscience
      GO Gene Ontology (GO) database
      Gramene Comparative mapping resource for grains (Gramene)
      GuidetoPHARMACOLOGY An expert-driven guide to pharmacological targets and the substances that act on them
      HGNC Human gene nomenclature database (HGNC)
      HAMAP Database of microbial protein families (HAMAP)
      HOGENOM Homologous genes from fully sequenced organisms database
      HPA Human Protein Atlas
      IDEAL IDEAL database of Intrinsically Disordered proteins
      IMGT/GENE-DB IMGT genome database for vertebrate immunoglobulin and T-cell receptor genes
      InParanoid InParanoid: Eukaryotic Ortholog Groups
      IntAct Protein interaction database and analysis system (IntAct)
      InterPro Integrated resource of protein families, domains and functional sites (InterPro)
      IPI International Protein Index
      iPTMnet iPTMnet integrated resource for PTMs in systems biology context
      jPOST Japan Proteome Standard Repository/Database
      KEGG Kyoto encyclopedia of genes and genomes
      LegioList Legionella pneumophila (strains Paris and Lens) genome database
      Leproma Mycobacterium leprae genome database (Leproma)
      MaizeGDB Maize Genetics/Genomics Database (MaizeGDB)
      MalaCards MalaCards human disease database
      MassIVE Mass Spectrometry Interactive Virtual Environment
      MaxQB MaxQB - The MaxQuant DataBase
      MEROPS Peptidase database (MEROPS)
      MetOSite MetOSite database of experimentally confirmed sulfoxidized methionines
      MGI Mouse Genome Informatics Database (MGI)
      MIM Mendelian Inheritance in Man Database (MIM)
      MINT Molecular INTeraction database
      MoonDB A database of extreme multifunctional and moonlighting proteins
      MoonProt A database for moonlighting proteins
      neXtProt neXtProt, the human protein knowledge platform
      NIAGADS NIAGADS Genomics Database
      OGP USC-OGP 2-DE database
      OMA Identification of Orthologs from Complete Genome Data
      OpenTargets Target Validation Platform
      Orphanet Orphanet a database dedicated to information on rare diseases and orphan drugs
      OrthoDB Database of Orthologous Groups
      PANTHER Protein ANalysis THrough Evolutionary Relationships (PANTHER) Classification System
      PathwayCommons Pathway Commons web resource for biological pathway data
      PATRIC Pathosystems Resource Integration Center
      PaxDb A comprehensive absolute protein abundance database
      PCDDB The Protein Circular Dichroism Data Bank
      PDB 3D-macromolecular structure Protein Data Bank (PDB)
      PDBsum PDB sum
      PeptideAtlas PeptideAtlas
      PeroxiBase Peroxidase superfamilies database
      Pfam Pfam protein domain database
      PharmGKB The Pharmacogenetics and Pharmacogenomics Knowledge Base
      Pharos Pharos NIH Druggable Genome Knowledgebase
      PHI-base Pathogen-Host Interaction database
      PhosphoSitePlus Phosphorylation site database
      PhylomeDB Database for complete collections of gene phylogenies
      PIR Protein sequence database of the Protein Information Resource (PIR)
      PIRSF Protein classification system of PIR (PIRSF)
      PlantReactome Curated source of core pathways and reactions in plant biology
      PomBase Schizosaccharomyces pombe database
      PRIDE PRoteomics IDEntifications database
      PRINTS Protein Fingerprint database (PRINTS)
      PRO PRO provides an ontological representation of protein-related entities.
      ProMEX Protein Mass spectra EXtraction database
      PROSITE PROSITE protein domain and family database (see 3.15.114)
      Proteomes UniProt Proteomes database
      ProteomicsDB ProteomicsDB human proteome resource
      PseudoCAP Pseudomonas aeruginosa Community Annotation Project
      Reactome Curated resource of core pathways and reactions in human biology
      REBASE Restriction enzymes and methylases database (REBASE)
      RefSeq NCBI reference sequences
      REPRODUCTION-2DPAGE 2D-PAGE database from the Lab of Reproductive Medicine at the Nanjing Medical University
      RGD Rat Genome Database (RGD)
      RNAct RNAct Protein&ndashRNA interaction prediction database
      SABIO-RK Biochemical Reaction Kinetics Database
      SASBDB Small Angle Scattering Biological Data Bank
      SFLD Structure-Function Linkage Database
      SGD Saccharomyces Genome Database (SGD)
      SignaLink A signaling pathway resource with multi-layered regulatory networks
      SIGNOR A signaling network open resource (SIGNOR)
      SMART Simple Modular Architecture Research Tool (SMART)
      SMR The SWISS-MODEL Repository (SMR)
      STRING STRING: functional protein association networks
      SUPFAM Superfamily database of structural and functional annotation
      SWISS-2DPAGE 2D-PAGE database from the Geneva University Hospital (SWISS-2DPAGE)
      SwissLipids SwissLipids knowledge resource for lipid biology
      SwissPalm SwissPalm database of S-palmitoylation events
      TAIR The Arabidopsis Information Resource (TAIR)
      TCDB Transport Classification Database
      TIGRFAMs TIGR protein family database (TIGRFAMs)
      TopDownProteomics TopDownProteomics is a resource from the Consortium for Top Down Proteomics that hosts top down proteomics data presenting validated proteoforms to the scientific community.
      TreeFam TreeFam database of animal gene trees
      TubercuList Mycobacterium tuberculosis H37Rv genome database (TubercuList)
      UCD-2DPAGE UCD-2DPAGE: University College Dublin 2-DE Proteome Database.
      UCSC UCSC genome browser
      UniLectin UniLectin database of carbohydrate-binding proteins
      UniPathway UniPathway: a resource for the exploration and annotation of metabolic pathways
      VGNC Vertebrate Gene Nomenclature Committee database
      World-2DPAGE The World-2DPAGE database
      WormBase A multi-species resource for nematode biology and genomics (WormBase)
      WBParaSite WormBase ParaSite (WBParaSite) resource for parasitic worms (helminths)
      Xenbase Xenopus laevis and tropicalis biology and genome database
      ZFIN Zebrafish Information Network genome database (ZFIN)

      The second field of the DR line, the 'RESOURCE_IDENTIFIER', is an unambiguous pointer to a record in the referenced resource.

      • For Allergome, Antibodypedia, ArachnoServer, Bgee, BioCyc, BioGRID, BioGRID-ORCS, CCDS, CDD, ChEMBL, CGD, CollecTF, ComplexPortal, ConoServer, CPTAC, CTD, DIP, DisGeNET, DisProt, DMDM, DNASU, DrugBank, EchoBASE, eggNOG, EvolutionaryTrace, FlyBase, Gene3D, GeneDB, GeneID, GeneReviews, GeneTree, GeneWiki, GenomeRNAi, GlyConnect, GO, GuidetoPHARMACOLOGY, HOGENOM, HPA, IDEAL, InterPro, KEGG, MEROPS, MGI, MIM, MINT, CLAE, neXtProt, NIAGADS, OpenTargets, Orphanet, OrthoDB, PANTHER, PATRIC, Pfam, PharmGKB, PHI-base, PIR, PlantReactome, PRINTS, PRO, Proteomes, ProteomicsDB, Reactome, REBASE, RefSeq, REPRODUCTION-2DPAGE, RGD, SGD, SMART, SUPFAM, SwissLipids, TAIR, TCDB, TIGRFAMs, TreeFam, UCSC, UniPathway, VGNC, World-2DPAGE, Xenbase or ZFIN the resource identifier is the accession number (also called the Unique Identifier in some databases) of the referenced entry.
      • For ABCD, BindingDB, BMRB, CarbonylDB, COMPLUYEAST-2DPAGE, CORUM, CPTC, DEPOD, DOSAC-COBS-2DPAGE, DrugCentral, ELM, EPD, ExpressionAtlas, GlyGen, InParanoid, IntAct, iPTMnet, jPOST, MassIVE, MaxQB, MetOSite, MoonDB, MoonProt, OGP, PathwayCommons, PaxDb, PCDDB, PeptideAtlas, Pharos, PhosphoSitePlus, PhylomeDB, PRIDE, ProMEX, RNAct, SABIO-RK, SASBDB, SignaLink, SIGNOR, SMR, STRING, SWISS-2DPAGE, SwissPalm, TopDownProteomics, UCD-2DPAGE or UniLectin the resource identifier is the UniProtKB accession number.
      • For Araport, BioMuta, dictyBase, ESTHER, GeneCards, Genevisible, IMGT/GENE-DB, MalaCards, Peroxibase, PomBase, PseudoCAP the resource identifier is the official gene name.
      • For BRENDA, the resource identifier is an EC number.
      • For ChiTaRS, the resource identifier is a gene name.
      • For CAZy, the resource identifier is the CAZy family number.
      • For Ensembl, EnsemblBacteria, EnsemblFungi, EnsemblMetazoa, EnsemblPlants, EnsemblProtists, Gramene, WormBase and WBParaSite, the resource identifier is a transcript identifier.
      • For VEuPathDB, the primary identifier is a combination of the child database name and the accession number in this database. Both are concatenated by a ':'.
      • For HGNC, the resource identifier is the unique identifier assigned by the HUGO Gene Nomenclature Committee.
      • For HAMAP, the resource identifier is the unique identifier of a HAMAP signature.
      • For PDB, the resource identifier is the entry name.
      • For PDBsum, the resource identifier is the PDB entry name.
      • For PIRSF, the resource identifier is the protein family number.
      • For OMA, the primary identifier consists of an OMA group fingerprint.
      • For MaizeGDB, the resource identifier is the 'Gene-product' accession ID.
      • For LegioList, Leproma or TubercuList, the resource identifier is the genome Open Reading Frame (ORF) code.
      • For euHCVdb, the resource identifier is an EMBL accession number.

      The third field of the DR line, the 'OPTIONAL_INFORMATION_1', is used to provide optional information.

      • For RefSeq this field is the nucleotide sequence identifier.
      • For CDD, InterPro, PANTHER, Pfam, PIR, PRINTS, REBASE, SFLD, SMART, SUPFAM or TIGRFAMs, this field is the entry name.
      • For PDB, this field is the structure determination method, which is controlled vocabulary that currently includes: X-ray (for X-ray crystallography), NMR (for NMR spectroscopy), EM (for electron microscopy and cryo-electron diffraction), Fiber (for fiber diffraction), IR (for infrared spectroscopy), Model (for predicted models) and Neutron (for neutron diffraction).
      • For dictyBase, FlyBase, CGD, HGNC, MGI, PomBase, RGD, SGD, Xenbase, VGNC or ZFIN, this field is the gene designation. If the gene designation is not available, a dash ('-') is used.
      • For GO, this field is a 1-letter abbreviation for one of the 3 ontology aspects, separated from the GO term by a column. If the term is longer than 46 characters, the first 43 characters are indicated followed by 3 dots ('. '). The abbreviations for the 3 distinct aspects of the ontology are P (biological Process), F (molecular Function), and C (cellular Component).
      • For HAMAP, this field contains the HAMAP entry name for a protein family.
      • For Ensembl, EnsemblBacteria, EnsemblFungi, EnsemblMetazoa, EnsemblPlants, EnsemblProtists, Gramene, WormBase and WBParaSite, this field is a protein identifier.
      • For ESTHER and PIRSF, this field is the protein family name.
      • For BioGRID-ORCS, this field indicates the number of hits in CRISPR screens.
      • For IntAct and BioGRID, this field indicates the number of interactors.
      • For MIM, this field distinguishes between MIM "gene" and "phenotype" entries. Note that some MIM entries describe both a gene and a phenotype. In such a case, this field indicates "gene+phenotype".
      • For Allergome, this field is the name of the allergen.
      • For ABCD, Antibodypedia and CPTC, this field is the number of antibodies.
      • For ArachnoServer and ConoServer, this field is the name of the toxin.
      • For CAZy, this field is the CAZy family name.
      • For Orphanet, this field is the name of the disease caused by defects in the protein.
      • For ChiTaRS and UCSC, this field is the organism name.
      • For BRENDA and Genevisible, this field is an organism code.
      • For eggNOG, this field is the taxonomic scope.
      • For GlyConnect and GlyGen, this field contains details about the glycans and glycosylation sites.
      • For PeroxiBase, this field is a name given by the database curators, based on organism and peroxidase classification.
      • For Reactome and PlantReactome, this field is the name of the pathway.
      • For UniPathway, this field is the identifier of the reaction.
      • For DrugBank, this field is a drug generic name for which the protein is a target.
      • For TCDB, this field is the transport classification family name.
      • For Bgee and ExpressionAtlas, this field is the expression pattern.
      • For TAIR, this field is the TAIR locus name (AGI number).
      • For MoonDB, this field indicates the entry type ("Curated" or "Predicted").
      • For RNAct, this field indicates the molecule type (usually "protein").
      • For Pharos, this field indicates the development/druggability level.
      • For BindingDB, BioCyc, BMRB, CarbonylDB, CCDS, ChEMBL, COMPLUYEAST-2DPAGE, CollecTF, CORUM, CPTAC, CTD, DEPOD, DIP, DisGeNET, DisProt, DMDM, DNASU, DOSAC-COBS-2DPAGE, DrugCentral, EchoBASE, ELM, EPD, VEuPathDB, EvolutionaryTrace, Gene3D, GeneCards, GeneDB, GeneID, GeneReviews, GeneTree, GeneWiki, GenomeRNAi, GuidetoPHARMACOLOGY, HOGENOM, HPA, InParanoid, iPTMnet, jPOST, KEGG, LegioList, Leproma, MaizeGDB, MalaCards, MassIVE, MaxQB, MEROPS, MetOSite, MINT, MoonProt, CLAE, nextProt, NIAGADS, OMA, OGP, OpenTargets, OrthoDB, PathwayCommons, PATRIC, PaxDb, PCDDB, PDBsum, PeptideAtlas, PharmGKB, PHI-base, PhosphoSitePlus, PhylomeDB, PRIDE, PRO, ProMEX, ProteomicsDB, PseudoCAP, REPRODUCTION-2DPAGE, SABIO-RK, SASBDB, SignaLink, SIGNOR, SMR, STRING, SWISS-2DPAGE, SwissLipids, SwissPalm, TAIR, TopDownProteomics, TreeFam, TubercuList, UCD-2DPAGE, UniLectin and World-2DPAGE, this field is not used and a dash ('-') is displayed in that field.

      A number of DR lines possess a fourth field, the 'OPTIONAL_INFORMATION_2', which is used to provide further optional information.

      • For the protein domain/family or structural databases CDD, Gene3D, HAMAP, PANTHER, Pfam, PIRSF, SFLD, SMART, SUPFAM and TIGRFAMs, this field is the number of hits found in the sequence.
      • For Ensembl, EnsemblBacteria, EnsemblFungi, EnsemblMetazoa, EnsemblPlants, EnsemblProtists, Gramene, WormBase and WBParaSite, this field is a gene identifier.
      • For GO, this field is a 3-character GO evidence code. The GO evidence code is followed by the source database from which the cross-reference was obtained, separated by a colon. The definitions of the evidence codes are: IDA=inferred from direct assay, IMP=inferred from mutant phenotype, IGI=inferred from genetic interaction, IPI=inferred from physical interaction, IEP=inferred from expression pattern, TAS=traceable author statement, NAS=non-traceable author statement, IC=inferred by curator, ISS=inferred from sequence or structural similarity.
      • For PDB, this field indicates the resolution of structures that were determined by X-ray crystallography or electron microscopy.

      A number of DR lines possess a fifth field, the 'OPTIONAL_INFORMATION_3', which is used to provide further optional information.

      • For PDB, this field indicates the chain(s) and the corresponding range, of which the structure has been determined. If the range is unknown, a dash is given rather than the range positions (e.g. 'A/B=-.'), if the chains and the range is unknown, a dash is used.
      • For WormBase, this field indicates the gene designation.

      Some of the resources to which we link contain information that is specific to an isoform sequence and where this is known, we indicate the corresponding UniProtKB isoform sequence identifier in our DR lines:


      Methods and materials

      TripletRes is a deep-learning based contact-map prediction method consisting of three consecutive steps (Fig 1). It first creates a deep MSA and extracts three coevolutionary matrix features. Next, the feature sets are fed into three sets of deep ResNets and trained in an end-to-end fashion. Finally, a symmetric matrix distance histogram probability is created and binarized into the contact-map prediction.

      MSA generation

      To help offset the overfitting effects, TripletRes creates MSAs using different strategies for training and testing protein sequences. For training proteins, MSAs are created by HHblits with an E-value threshold of 0.001 and a minimum sequence coverage of 40% to search through Uniclust30 (2017_10) [43] database with 3 iterations.

      For test proteins, the DeepMSA pipeline [30] was utilized to generate MSAs. The initial MSA is created also by HHblits but followed-up with multiple iterations. If the Neff value of the initial MSA is lower than a given threshold (= 128 that was decided by trial and error), a second step will be performed using jackhmmer [44] through UniRef90 (release-2017_12) [45]. Here, Neff measures the number of effective sequences in the MSA and is defined as: (1) where N is the total number of sequences in the MSA, if the sequence identity Sm,n between sequences m and n is over 0.8 or = 0 otherwise. To assist the MSA concatenation, the jackhmmer hits are converted into an HHblits format sequence database, against which a second HHblits search was performed. In case that Neff is still below 128, a third iteration is performed by hmmsearch [44] through the MetaClust (2017_05) [46], where the final MSA is pooled from all iterations (see S3 Fig for the whole MSA construction pipeline).

      Coevolutionary feature extraction

      Three sets of coevolutionary features are extracted from the deep MSAs. First, the covariance (COV) feature measures the marginal dependency between different sequential positions and is calculated by (2) where fi(a) is the frequency of a residue a at position i of the MSA, fi,j(a,b) is the co-occurrence of two residue types a and b at positions i and j.

      The COV feature captures marginal correlations among variables, which contains transitional correlations. The negative of the inverse of the covariance matrix, i.e., precision matrix, can be interpreted as the Mean-field approximation of Potts model [12] and thus can capture direct couplings. In this work, a ridge regularized precision matrix (PRE), Θ, is estimated by minimizing the regularized negative log-likelihood function [27,47] (3) where the first two terms are the negative log-likelihood of Θ assuming that the data follows a multivariate Gaussian distribution tr(SΘ) is the trace of matrix SΘ log|Θ| is the log determinant of Θ and is the regularization function of Θ to avoid over-fitting, with ρ = e −6 being a positive regularization hyper-parameter.

      The last feature, which was firstly introduced by plmConv [48], is the raw coupling parameter matrix of the inverse Potts model approximated by PLM. Instead of assuming the data follows a multivariate Gaussian distribution, PLM approximates the probability of a sequence for the Potts model with (4)

      Here, P(σ m ) is the probability model for the m-th sequence in the MSA and is the marginal probability of l-th position in the sequence by (5) where h and J are single site and coupling parameters, respectively. In TripletRes, the raw coupling parameter matrix J is used as the PLM feature.

      Thus, each feature is represented by a 21*L by 21*L matrix for a protein sequence with L amino acids. The entries of the 21 by 21 sub-matrix of a corresponding amino acid pair are the descriptors, which are fed into a convolutional transformer as conducted by a fully convolutional neural network with residual architecture (Fig 1).

      Deep neural-network modeling

      TripletRes implements residual neural networks (ResNets) [29] as the deep learning model. Compared to traditional convolutional networks, ResNets adds feedforward neural networks to an identity map of input, which helps enable the efficient training of extremely deep neural networks such as the one used in TripletRes. As illustrated in Fig 1, the neural network structure of TripletRes has four sets of residual blocks, where three of them are connected to the input layer for feature extraction. Each of the three ResNets has 24 basic blocks and can learn layered features based on the specific input. After transforming each input feature into a feature map of 64 channels, we concatenate the transformed features along the feature channel and employ another deep ResNet containing 24 residue blocks to learn the fused information from the three features.

      The activation function of the last layer is a softmax function which outputs the probability of each residue pair belonging to specific distance bins. Here, the residue-residue distance is split into 10 intervals spanning 5-15Å with an additional two bins representing distance less than 5Å and more than 15Å, respectively. The whole set of deep ResNets are trained by the supervision of the maximum likelihood of the prediction, where the loss function is defined as the sum of the negative log-likelihood over all the residue pairs of the training proteins: (6)

      Here, T is the total number of residue pairs in the training set. if the distance of t-th residue pair of native structures falls into k-th distance interval otherwise is the predicted probability that the distance of the t-th residue pair falls into the k-th distance interval. The probability of the t-th residue pair forming a contact Pt is the sum of the first 4 distance bins: (7)

      The training process uses dropout to avoid over-fitting, where the dropout rate is set to 0.2. We use Adam [49], an adaptive stochastic gradient descent algorithm, to optimize the loss function. TripletRes implements deep ResNets using Pytorch [50] and was trained using the Extreme Science and Engineering Discovery Environment (XSEDE) [39].


      Drug–protein adducts: past, present, and future

      Research over the past half-century has demonstrated that the metabolism of drugs and other foreign compounds to chemically reactive intermediates that bind covalently to endogenous proteins can, in certain cases, lead to organ damage or to immune-mediated adverse reactions. While the chemistry of metabolic activation is now relatively well understood, the molecular events that link exposure to reactive metabolites to toxic sequalae remain ill-defined. In particular, the role of covalent protein binding in drug-induced toxicities is unclear, and has been a controversial issue in drug discovery efforts. In this article, the covalent binding of drugs and other xenobiotics to proteins is reviewed from a historical perspective, and the evolution of the field is traced from an early dependence on radiolabeled tracers that provided limited information on adduct structure to contemporary label-free approaches based on powerful chemical proteomics and mass spectrometry methodology that provide a global perspective on proteome reactivity. Currently evolving databases of the proteins targeted by safe and toxic xenobiotics likely will lead in the future to predictive algorithms that can be exploited by medicinal chemists to assist in the design of safe therapeutic agents.

      This is a preview of subscription content, access via your institution.


      Abstract

      Natural recombination combines pieces of preexisting proteins to create new tertiary structures and functions. We describe a computational protocol, called SEWING, which is inspired by this process and builds new proteins from connected or disconnected pieces of existing structures. Helical proteins designed with SEWING contain structural features absent from other de novo designed proteins and, in some cases, remain folded at more than 100°C. High-resolution structures of the designed proteins CA01 and DA05R1 were solved by x-ray crystallography (2.2 angstrom resolution) and nuclear magnetic resonance, respectively, and there was excellent agreement with the design models. This method provides a new strategy to rapidly create large numbers of diverse and designable protein scaffolds.

      Most efforts in de novo protein design have been focused on creating idealized proteins composed of canonical structural elements. Examples include the design of coiled coils, repeat proteins, TIM barrels, and Rossman folds (16). These studies elucidate the minimal determinants of protein structure, but they do not aggressively explore new regions of structure space. Additionally, idealized structures may not always be the most effective starting points for engineering novel protein functions. Functional sites in proteins are often created from nonideal structural elements, such as kinks, pockets, and bulges.

      The lack of nonideal structural elements from de novo designed proteins highlights a key difference between natural protein evolution and current design methods. Specifically, protein design methods universally begin with a target structure in mind. Therefore, the space of designable structures that can accommodate these nonideal protein elements is limited by the imagination of the designer. In contrast, natural evolution is based not on design but on cellular fitness provided by the evolved protein function. This lack of a predetermined target fold is a powerful feature of protein evolution that holds significant potential for the design of novel structures and functions. In an effort to tap this potential, we sought to develop a method of computational protein design inspired by mechanisms of natural protein evolution.

      Gene duplication and homologous recombination mix and match elements of protein structure to give rise to new structures and functions (79). This phenomenon is most evident at the level of independently folding protein domains (1012), but recent studies have shown that these same principles function at a smaller scale during the evolution of distinct, globular protein folds (13). Insertions, deletions, and replacement of secondary and supersecondary structural elements sample alternative tertiary structures (1416). Our design strategy, called SEWING (structure extension with native-substructure graphs), is motivated by this process and builds new protein structures from pieces of naturally occurring protein domains. The process is not dictated by the need to adopt a specific target fold but rather is aimed at creating large sets of alternative structures that satisfy predefined design requirements. One of the strengths of this approach is that it ensures that all of the structural elements of the protein are inherently designable, at the same time allowing for the incorporation of structural oddities unlikely to be found in idealized proteins. Here, we apply SEWING to the design of helical proteins. We show that designed structures are diverse and contain structural features absent from alternative design strategies.

      SEWING begins with the extraction of small structural motifs, or substructures, from existing protein structures. These serve as the basic building blocks for all generated models. We aimed to identify substructures that were large enough to carry information regarding structural preference yet small enough to allow combinations that can generate novel globular structures. Ultimately, we chose to extract two distinct types of substructures. The first is composed of continuous stretches of protein structure that encompass two secondary structural elements separated by a loop (Fig. 1). These substructures capture the relative orientation between adjacent secondary structure elements and maintain local packing interactions. In addition, there is evidence that substructures of this size adopt a relatively limited number of conformations that have already been sampled exhaustively in known protein structures (14). The second type is composed of groups of three to five secondary structural elements, where each element makes van der Waals contacts with every other element, but the elements are not necessarily continuous in primary sequence (Fig. 1, supplementary methods). Nonadjacent, or discontinuous, substructures maintain longer-range tertiary interactions that provide valuable stability and are often conserved during protein evolution (17).

      The goal of SEWING is to combine and modify these extracted components in order to develop new tertiary structures. Naturally occurring homologous recombination, in which sequence similarity between DNA molecules leads to the combination of the genetic material, guides the formation of new protein chimeras. This process enriches for proteins that are more likely to be well folded and functional, as sequence-similarity filters for segments that are structurally compatible. In the case of SEWING, we know the three-dimensional structures of the building blocks therefore, we can directly use structural information to probe which substructures are well suited for combination. During SEWING, continuous substructures are eligible for combination if the C-terminal region of one substructure shares high structural similarity with the N-terminal region of another substructure and if superposition of the two regions does not create any steric clashes between other regions in the two substructures. This type of superposition ensures that the three-dimensional spacing between all pairs of secondary structural elements adjacent in primary sequence is similar to that observed in the Protein Data Bank (PDB). During discontinuous SEWING, combinations are created by superimposing two elements (helices in this study) from one substructure with two elements from another substructure. For both continuous and discontinuous SEWING, structure similarity is identified by using a fast geometric hashing approach that ensures that the regions of interest can align with low root mean square deviation (RMSD) (18).

      Once pairwise structural similarity is calculated between all substructures, these data are used to generate a large graph (Fig. 1). The nodes in this graph represent the substructures, and the edges indicate a level of structural similarity that allows recombination. Novel structures are generated from this graph by traversing a path wherein each followed edge adds new structural elements to the design model. The number of edges included in the sampled paths can control the approximate size of the generated structures. Unlike previously described methods of de novo backbone generation, no target structure is required, and output structures span a diverse set of globular folds.

      Previous studies have demonstrated that protein fragments can adopt alternative structures when placed in new environments (1921), and thus, similar to natural evolution, the next step in the design process was to further stabilize the protein through mutagenesis. This optimization step was achieved using iterative steps of side-chain packing and backbone minimization available in the Rosetta molecular-modeling suite (22). Preference for the amino acid sequence present in the parental substructure was used to better preserve the structural interactions inherent to the parent substructures.

      To test SEWING, we designed a diverse set of helical proteins using graphs composed of continuous substructures or discontinuous substructures. Continuous and discontinuous substructures were extracted from nonredundant subsets of the PDB (23, 24). In total, 33,928 continuous substructures and 4584 discontinuous substructures were extracted. Design models from the continuous graph were generated by using three-edge paths and were therefore composed of substructures extracted from four existing structures from the PDB (Fig. 1). The continuous graph contained 345 million edges, which allowed an estimated 7 × 10 16 backbones that can be filtered and optimized in later designs steps (supplementary methods). Initially, 50,000 alternative tertiary structures were created and used as templates for rotamer-based sequence optimization and energy minimization. These models were filtered and sorted by using metrics that evaluate predicted energy (normalized by sequence length), side-chain packing, buried polar groups, and sequence-structure agreement (supplementary methods) (25). When examining the models, we noticed that the naïve SEWING procedure was biased toward creating low–contact order models, i.e., structures with few contacts between residues distant in primary sequence. To overcome this bias, we filtered for models with contact orders more representative of naturally occurring helical proteins (fig. S1). We have subsequently demonstrated that Monte Carlo sampling of the SEWING graph with a score function that favors long-range contacts can be used to build high-contact models with high frequency (fig. S1). This illustrates one way that directed sampling of the SEWING graph can be used to enforce design requirements.

      In total, 11 designs based on continuous SEWING were selected for experimental characterization (table S1). Each region of the final designs shared between 45 and 65% sequence identity with the substructure that they were built from (figs. S2 and S3), but when performing a BLAST search with the full-length sequences, no matches were identified that aligned over the full length of the proteins. Eight designs expressed well in Escherichia coli and were readily purified from the soluble fraction of 1-liter cultures. Four of the eight proteins were monomeric as seen by size-exclusion chromatography–multiangle light scattering, had a circular dichroism (CD) spectrum characteristic of a helical protein, and unfolded cooperatively upon thermal denaturation (Fig. 2 and figs. S4 and S5). Two of the designed proteins are hyperthermophiles and require high concentrations of chemical denaturant in order for one to observe thermal unfolding (Fig. 2B). For one design, CA01, several thermodynamic parameters were determined by fitting a modified Gibbs-Helmholtz equation to the thermal and chemical denaturation surface (Fig. 3B and table S2) (26). The extrapolated melting temperature of 126°C places it among the top 0.01% of values in the ProTherm database of protein stabilities (27). The crystal structure of CA01 was solved to 2.2 Å and shows excellent agreement with the design model, with the RMSD of the atomic coordinates equaling 0.8 Å. Similarly, the side-chain packing of the protein core is nearly identical for the design model and the experimental structure (Fig. 3, fig. S6, and table S3).

      The structural variety in the design models for the well-folded proteins is of particular note (Fig. 2). The SEWING-generated models include kinked and curved helices (figs. S7 and S8), cavities and clefts (figs. S9 and S10), and a large range of helix-crossing angles (Fig. 2). The topologies of the SEWING models are varied when compared with previously designed α-helical proteins, which are restricted to coiled coils, repeat proteins, and up-down four-helical bundles (Fig. 2C). To compare SEWING models with naturally occurring protein structures, we searched for structurally similar domains using the Dali server (28). In general, large portions of the models aligned to regions of existing protein structures. However, the sequence identities across the alignments were typically below 20%, and the positions of the unaligned residues frequently diverged (fig. S11). For instance, the fifth helix of CA01 is shifted by

      9 Å relative to the fifth helix in the top Dali match. These sequences and structural differences provide unique surfaces that may serve as templates for future design goals.

      To test discontinuous SEWING, models were generated from two-edge paths and, thus, were composed of structural elements from three parent structures. The variable number of helical elements in the discontinuous substructures therefore allowed design models to be composed of 5 to 11 helices. Unlike models from the continuous-substructure graph, discontinuous models require the addition of loops between consecutive helices. Loops were designed by using a database of fragments from the PDB (29). Each loop fragment was superimposed onto the design model and optimized using gradient-based minimization in Cartesian space. Any path that created structures for which no loop fragment could be found was eliminated from the set of designs. Design models were filtered and optimized in the same way as models from the continuous graph. In total, 10 were selected for experimental characterization (table S1).

      Of these 10 designs, 2 expressed at levels sufficient for purification. Both purified proteins were helical and folded, as evidenced by CD (Fig. 2 and fig. S4). Similar to the results from the continuous designs, one discontinuous design, DA03, demonstrated high thermostability, which required high levels of denaturant to completely unfold. For this design, an 181-residue six-helix bundle, unfolding appears to follow a three-state model (fig. S12).

      The structure of the other well-folded discontinuous design, DA05, was solved using nuclear magnetic resonance (NMR) spectroscopy, as the protein did not readily crystallize (figs. S13 and S14 and table S4). The first four helices of the design model match the lowest-energy member of the NMR ensemble very closely, with a Cα RMSD of 0.8 Å (Fig. 4 and fig. S6). However, the NMR data indicate that the final helix of the protein is disordered in solution. In an effort to identify the errors in the design model that led to the unstructured region, structural preference for the designed sequence was evaluated with fragment analysis as described previously (1). The fragments extracted for the unstructured region showed especially poor preference for the designed helical structure (fig. S15). We attempted to design a new final helix for the DA05 design using the continuous SEWING method. The final helix of the initial design model was removed, and the remainder of the model was added as a node to the continuous graph. New helices were evaluated by following a single edge from this new node (Fig. 4A). Three models designed in this way were selected for experimental testing. Two of the tested designs, DA05R1 and DA05R2, show a significant increase in melting temperature relative to the initial DA05 design (fig. S16). The NMR structure of DA05R1 shows that the newly designed helix adopts the designed conformation, which highlights the utility of combining the continuous and discontinuous graphs (Fig. 4B, figs. S17 and S18, and table S4).

      The additional step of loop building is a critical difference between discontinuous and continuous SEWING. The accurate design of loops is a long-standing challenge for protein design, and this additional step may have contributed to the relatively lower success rate observed for discontinuous SEWING. In contrast, continuous SEWING maintains the relative orientation between adjacent helices, which allows many of the designed loop sequences to be taken directly from the native substructure. The power of this strategy is seen in the high structural accuracy achieved for the loops in the CA01 design (Fig. 3 and fig. S2).

      Our results show that computational adaptations of basic evolutionary principles, such as recombination and mutation, can be used to design, accurately and rapidly, a diverse set of helical protein structures. The diversity of SEWING designs will further increase when alternative types of substructures are included, such as β-α motifs and β hairpins. Furthermore, discontinuous and continuous SEWING can be merged, as in the DA05R1 design, to create additional diversity. We anticipate that this structural diversity will be advantageous for functional design, as every backbone generated with SEWING has new surface and pocket features that provide potential binding sites for ligands or macromolecules. Additionally, SEWING offers an approach for stitching together functional motifs from naturally occurring proteins, an evolutionary mechanism to generate multifunctional proteins and allosteric systems.

      (A) Continuous SEWING workflow for CA01. (B) Discontinuous SEWING workflow for DA05. From left to right: Parental PDBs are shown with extracted substructures graph schematic—colored nodes indicate substructures contained in the final design model, and superimposed structures show structural similarity indicated by adjacent edges design model before sequence optimization and loop design and final design models.

      (A) Design models obtained with continuous (CA) and discontinuous (DA) SEWING. (B) Temperature denaturation curves for well-folded SEWING designs, colored to match design models. (C) A comparison of previously designed helical structures (black dots) to SEWING models (colored squares) demonstrates the structural complexity of SEWING designs. We calculated crossing angles between all pairs of helices in each structure. Crossing angles were considered unique if they differed by >20 degrees from all other calculated angles in the same structure. Variance in helix size describes the calculated variance in the number of residues per helix for all helices in a single structure. A complete list of helix and crossing-angle definitions for de novo designs can be found in tables S5 and S6.

      (A) Backbone superimposition of the design model (green) and crystal structure (purple). (B) In the chemical and temperature denaturation experiment, a sharp unfolding transition is observed at 5 M GdHCl and 75°C. (C) Comparison of side-chain packing between the design model (green) and crystal structure (purple) at three different layers of the structure.

      (A) From left to right: Backbone superimposition of the DA05 design model (orange) with a member of the NMR ensemble (blue). An example of continuous substructure graph for the design of a new final helix onto DA05. Superimposition of three design models containing new helices. (B) From left to right: Backbone superimposition of the DA05R1 design model (light blue) with a member of the DA05R1 NMR ensemble (blue). A comparison of side-chain packing between the DA05R1 design model and the NMR structure for DA05R1.


      Watch the video: BC1 - Proteinfaltung (December 2021).