Paired-End Analysis of Transcription Start Sites in Arabidopsis Reveals Plant-Specific Promoter Signatures

Publication Online:

About this study:
Understanding of plant gene promoter architecture has long been a difficult challenge due to the lack of large-scale data and analysis methods on the topic. In this study we present a publicly available large-scale transcription start site (TSS) dataset in plants using a high-resolution method for analysis of 5’ ends of mRNA transcripts. Our dataset is produced using the Paired-End Analysis of Transcription Start Sites (PEAT) protocol, providing millions of TSS locations from wildtype Col-0 Arabidopsis whole root samples. Using this quality-filtered dataset, we first categorize the different shapes taken on by the TSS location distributions into TSS “tag clusters”. We then design a high-resolution machine-learning model that predicts the presence of a TSS tag cluster with an auROC near 0.98 for each cluster shape. We use this model to analyze the transcription factor binding site content of different promoter shapes. We find that while canonical notions of sharp narrow peak TATA-containing promoters vs more broad “TATA-less” promoters have some merit, the model shows that a large compendium of known DNA sequence binding elements is actually necessary and sufficient for accurate promoter prediction in the case of all tag cluster shapes. These elements form promoter signatures for transcription initiation. We present precise results on the usage of these elements, and provide our Plant PEAT Peaks (3PEAT) model which predicts the presence of PEAT tag clusters directly from sequence.

Online Data Access

Supplementary Materials

Supplementary Tables

  1. ROE.xls: ROE Tables for all the models
  2. GO.xls: Unique GO terms by peak shape
  3. DataSetStatistics.xls: Counts of tag clusters in each peak shape dataset used for 3PEAT model
  4. LogRegCoef.xls: Model Coefficients
  5. TATAlessPeaks.xls: TATA+/TATA- Data
  6. 3PEAT_Model_AnnotatedPeaks.xls: The locations, gene annotations, and initiation patterns of called peaks used to build the 3PEAT model.
  7. CrossValidationPerformance_With_Regparams.xls: auROC and auPRC statistics and L1 regularization parameter of each cross-validation fold for each 3PEAT model.

Raw Data

Deposted under NCBI SRA Accession SRR1425301.

Genome Scans

Scans of 8 KB genomic regions surrounding TSS Test Sets by peak shape.

Additional Supplementary Data

  1. RawAnnotatedPEATPeaks.xlsx: The locations, gene annotations, and initiation patterns for raw, unfiltered PEAT peak calls.
  2. AnnotatedPEATPeaks.xlsx: The locations, gene annotations, and initiation patterns of all final called peaks in PEAT data.
  3. AnnotatedPEATPeaks.txt: The locations, gene annotations, and initiation patterns of all final called peaks in PEAT data as a GFF file.

The 3PEAT-TFBS-Scanner and 3PEAT Model Scanner software tools developed in this paper are made freely available under a non-commerical license.


Morton T, Petricka J, Corcoran DL, Li S, Winter CM, Carda A, Benfey PN, Ohler U, Megraw M. (2014). Paired-end analysis of transcription start sites in Arabidopsis reveals plant-specific promoter signatures. Plant Cell, 26:2746-60.