Integrated System for Regulon Inference by Comparative Genomics Approach

RegPredict is a web-based computational platform for fast and accurate semi-automatic inference of regulons in well-populated groups of closely-related prokaryotic genomes.

RegPredict implements the whole lifecylce of regulon inference, where all components of the system are well integrated. It provides two main strategies for analysis of regulation:

  • Regulon inference based on comprehensive collection of manually curated position weight matrices (PWMs).
  • De novo regulon inference, starting with a set of functionally linked genes being potential members of regulon.

RegPredict provides a highly interactive, rich graphical user interface with resizable panels, pull-down context menus, and intuitive navigation, which allows for detailed analyses of potential members of regulons and decision-making.

During the past decade our group have focused on in silico reconstruction and manual curation of various metabolic regulons across large sets of bacterial genomes using comparative genomics techniques. The RegPredict is designed to bring all our accumulated extensive experience in computational analysis of transcriptional regulation to the scientific community and facilitate the procedure of regulon inference in constantly growing number of prokaryotic genomes by providing a highly integrated set of computational modules for computation and visual examination of results.

Two approaches to regulon inference

Regulon inference baseding on the known position weight matrix (PWM). The input of this module is a set of closely related genomes selected from a provided taxonomic tree, PWM (RegPredict provides a comprehensive collection of manually curated PWMs from several resources, such as RegTransBase, RegulonDB, RegPrecise), and parameters for scanning genomes. Once all genomes in a set are scanned by selected PWM, clusters of co-regulated orthologous operons are computed, ranked and provided as an output for manual curation. The top ranked operon clusters are the best candidates for true members of regulon.

De novo regulon inference. The input of this module is a set of functionally linked genes that can be potential members of regulon. The module allows to search motif profiles in upstream regions of selected genes considering several different types of motif simultaneously (palindromes of different length, direct repeats, etc.). The list of candidate motif profiles, ranked by information content is represented as an output for the subsequent testing. This module is tightly integrated with the module based on the known PWM, and each motif profile can be immediately tested. Once the best motif is found, the detailed charactericharacterizationstic of regulon can be performed by the first module.

To facilitate the analysis of putative members of regulon RegPredict utilizes the concept of cluster of co-regulated orthologous operons. All operons with potential transcription factor binding sites are organized into a set of independent and self-sufficient clusters, which can be investigated separately to focus on the analysis of a small subset of genes to be able make a final decision whether genes from a cluster are the true members of a regulon.

The procedure of building clusters of co-regulated orthologous operons:

For a given transcription factor motif, search candidate transcription factor binding sites in upstream regions of genes in all genomes under analysis (three genomes are considered on the figure). Consider each operon as a vertex in the graph.

Join two vertices by edge if the correspondent operons: i) are orthologous; ii) have cantiodate transcription factor binding sites. Collect all linked components. Two operons from two different genomes are called orthologous if they share at least one orthologous gene.

Build up each linked component by adding operons without candidate transcription factor binding sites that are orthologous to one of operons from the linked component.

Select Genomes dialog

To start the regulon inference procedure select a set of closely related genomes. Genomes available for analysis are manually organized into groups based on the MicrobesOnline species phylogenetic tree.

select genomes dialog img

Once genomes are selected, they will be a default set for all subsequent analysis in this session

Workspace for analysis of regulons

The main workspace for analysis of regulons is organized as a set of panels visualizing specific types of data, and designed to provide the necessary functionality allowing to navigate across the automatically computed Clusters of co-Regulated Orthologous operoNs (CRONs) and analyze each of them in-depth.

regulon workspace image
Clusters of co-regulated orthologous operons panel

The left-most panel provides a list of all automatically calculated clusters of co-regulated orthologous operons. Each row in the table corresponds to a particular cluster and provides general statistics about the cluster.

operon clusters panel image

Table columns:

  • ID - identfier of the cluster
  • Genomes - the number of genomes in a cluster in which at least one potential TFBS above the selected threshold was found. For example, if there are 5 genomes under analysis, and in a given cluster the TFBSs were found in upstream regions of genes from two genomes only, then Genomes = 2. The higher number means the stronger support from comparative genomics.
  • Operons - the total number of operons in a cluster
  • Genes - the total number of genes in a cluster
  • Sites - the total number of TFBSs in a cluster
  • Max score - maximum score among all TFBSs in a cluster

Clusters are shown page by page. In order to move to the next page use the navigator buttons in the bottom toolbar of the panel.

The list of clusters can be sorted by any column. Click on the header of a desirable column and select either "Sort Ascending" or "Sort Descending". By default all clusters are sorted first in descending order by "Genomes" column and second in descending order by "Max score" column, so that the clusters with the most probable members of regulon are shown at the top of the list.

Click on a particular cluster in a list to start the in-depth analysis of the cluster.

Genomics context panel

The central panel shows the graphical representation of a selected cluster of regulated orthologous operons in a genomic context. Each line in the panel corresponds to a particular genome.

genomics context panel image
Genes

Genes are shown as colored rectangles. The length of rectangle is proportional to the length of the gene. Orthologous genes are shown by the same color. If a particular gene does not have orthologs in a cluster, it is uncolored (depicted by the border of a rectangle only) .

Operons

Genes are organized into automatically predicted putative operons. Operons are shown by a bigger underlying rectangle with the same bluish background color. A particular cluster can have more than one operon in a genome.

Sites

Circles show predicted transcription factor binding sites. RegPredict utilizes two types of sites:

  • Major sites (shown by orange circles) - TFBSs which score is greater than preselected threshold
  • Minor sites (shown by dark color) - TFBSs which score is less than selected threshold, but greater than 90% of selected threshold.

By default only major sites are shown. To show/hide minor sites use menu Filter->Show Minor Sites. The analysis of minors sites allows to check the presence of weak sites which are still can be true positive TFBSs, and lead to decision of selecting more tolerant score threshold.

Additional navigation

The bottom toolbar of the central panel has additional buttons "Previous Cluster" and "Next Cluster" which allows navigation across operon clusters without using "Clusters of orthologous operons" panel.

Export of the results

Export button on the bottom toolbar of the central panel allows exporting operon cluster in a text format.

The export file contains detailed description of genes (gene ID in MicrobesOnline, locus tag, gene name, ortholog ID in MicrobesOnline, function note) and putative TFBSs (site sequence, score and position relative to the following gene start) for each analyzed genome (with listed genome name and its NCBI taxonomy ID). Genes are grouped in putative operons.

Clicks/Hover over/Hot Keys

Detailed information about the genes in a particular operon can be obtained by mouse clicking, and summary of the operon will be shown in the bottom "Genes properties" panel.

The details about particular gene or TFBS can also be obtained by hover over.

Before export of operon cluster, the user may want to exclude a set of genes from the export. It can be the case when expert sees that predicted operon structure is wrong, and there are few genes which showuld be excluded. To exclude such genes, click on the gene holding down Alt key. The selected gene and all genes to the right will be marked by white color and excluded from the subsequent export. Using this approach the whole operon can be excluded from the export by the single click on the first gene of the operon.

Filtering

Menu "Filter" provides several options which are designed to help in analysis of operon cluster, especially if a cluster is very big.

  • Show Operons With Sites Only - when selected, the only operons which have predicted sites will be shown.
  • Show Ortologs of Selected Operon Only - when selected, the only genes are colored which are orthologous to one of the genes from the currently selected operon.
  • Show Minor Sites - when selected, minor sites are shown in all site-related panels: "Genomic context", "Genes properties", and "Sites" panels.
Genes properties and Sites panels

The bottom part of workspace has two tabbed panels: Gene properties and Site summary.

Gene properties panel shows the table with detailed information about all genes of the selected operon. The order of genes in the table is the same as an order of genes in the operon shown in "Genomics context" panel.

genes properties panel image

Table columns:

  • Colored rectangle - the color of gene used in "Genomics context" panel
  • VIMSS Id - identifier of gene in VIMSS system
  • Dist to prev - distance to the previous gene on the chromosome (in nucleotides)
  • Coordinates/Length - the strand (=> positive, <= negative ), the coordinates, and the length of a gene (in nucleotides)
  • Notes - Name, locus tag and other types of gene annotations. The information about all potential TFBSs found in upstream region of a gene is shown in this column as well.
  • Links - the links to MicrobesOnline database: MO:Gene - link to central gene web page, MO:Domain - link to web page with all known domains of a gene, MO:Tree - link to phylogenetic tree of gene homologs.

Site summary panel provides a table with a list of all sites in operon cluster. The content can vary depending on whether the Filter->Show Minor Sites is selected or not.

site properties panel image

Table columns:

  • Genome - the name of genome
  • Gene - locus tag of gene in upstream region of which the TFBS was found
  • Position - position of TFBS relative to the translation start of gene
  • Score - score of TFBS
  • Sequence - the sequence of found TFBS with flanking sequence

When a particular operon is selected in "Genomics context" panel, the corresponding TFBSs are highlighted in "Sites" panel.

Run Profile dialog

Run Profile dialog allows one to scan the genomes with a selected PWM profile using optional parameters and to automatically generate a set of candidate CRONs .

genes properties panel image

The collection of PWMs available for the analysis is classified by the source database and taxonomy, where each motif profile is described by its name, length, training set size, information content per nucleotide position, and consensus. Optional parameters for the genome scanning with a PWM are the upstream-region intervals to be searched and the threshold for the TFBS score, which by default is set to the minimal score among all sites in the training set for a given PWM. In addition to the collection of PWM profiles available in RegPredict, the user may submit any set of pre-aligned TFBSs in the FASTA format as a training set for a new PWM (use Sequences tab).

To build a consesus, the difference between two highest weigths was calculated for each position. The uppercase letter is shown if the difference is greater than 0.2, otherwise lower case letter is shown if the difference is greater than 0.1, otherwise the position is depicted by 'n' symbol.

Discover Profile dialog

Discover Profile dialog allows for inferring candidate motifs in a set of DNA fragments provided in the FASTA format and to generate the input set of upstream gene regions automatically from various gene sets.

genes properties panel image

To facilitate the procedure of profile discovery, the user may select several types of profiles and a range of profile lengths to be searched simultaneously.

For each motif type the user specify a set of parameters to be used for the motif discovery procedure including the motif length, the minimal number of palindromic site positions, the size of the training set to be included in the profile, and the minimum number of GC pairs among palindromic site positions.

genes properties panel image
Gene cart

Gene cart is used to work with a collection of genes selected by the user. Genes can be added to the Gene cart either from the Gene properties panel or from additional Gene search dialogs. Once a collection of genes is compiled, it can be used as a training set in the de novo regulon inference Discover Profile workflow.

genes properties panel image

In the Gene cart dialog one can export sets of genes to a text file. Each gene is described by genome name, identifier of gene in MicrobesOnline database (VIMSS Id), name, locus tag, identifier of ortholog, and functional annotation.

Site cart

Site cart provides a tool for the comparative analysis of any subset of TFBSs selected during the regulon analysis session.

genes properties panel image

Site cart provides export of all TFBS sequences to a file, and can be used in Run Profile procedure.

Summary info panel

The right-most panel contains a set of tabbed panes with different types of summary information.

Selected genomes

The list of currently selected genomes, the Phylum, and the name of genome group are shown in this tabbed pane.

selected genomes panel image
Profile run parameters

Parameters used to scan PWM are shown in this tabbed pane.

profile run parameters image
Functional annotations

This tabbed pane provides a general overview of functional annotations of genes in a selected operon cluster. The genes are organized into orthologous groups depicted by the same color. This view also allows to check the consistency in functional annotation of orthologous genes.

functional annotations panel image