SPeakerDataset

All Single-Peak locations (TSSs from the set of 2399 CAGE Tag Clusters from which all training and test sets were derived):
Fasta file (gzipped) containing sequence from 5kb upstream to 5kb downstream of the TSS
/sites/default/files/files/mouse_ALL_single_TSS_TCs_-5000_5000_fa(1).gz

Training Sets (CAGE Tag Cluster IDs):
Annotation-Supported_TrainingSet.ids.txt
CAGE-Only-Supported_TrainingSet.ids.txt
CpG-Island_TrainingSet.ids.txt
Non-CpG-Island_TrainingSet.ids.txt

Test Sets (CAGE Tag Cluster IDs):
Annotation-Supported_TestSet.ids.txt
CAGE-Only-Supported_TestSet.ids.txt

Cross-validation Sets:
Each archive directory (tar-gzip) contains three subdirectories, TSS_set, IGC_set, and CDS_set for the positive data, negative intergenic data, and negative cds data respectively. Each subdirectory contains 10 fasta files representing the 10 parts. Sequence for each example location is taken from (-250, +50) with respect to the example location. Note that because TSS and corresponding upstream intergenic examples must be extracted from mm5 (the genome build of the original CAGE Tag mappings), sequences will occasionally contain "N's" at nucleotides which were not yet identified in the build.
Annotation-Supported_Crossval.tgz
CAGE-Only-Supported_Crossval.tgz

Notes:
A descriptor file containing the Tag Cluster IDS, genomic locations for the highest TSS in a cluster, and other detailed information can be downloaded at http://fantom31p.gsc.riken.jp/cage_analysis/export/mm5/tss_summary.tsv.bz2.

You are here

SPeakerDataset