Read more about the clustering process that yields our curated datasets of poly(A) sites.

A clustering procedure has been implemented to group together closely-spaced poly(A) sites, that most likely are due to imprecision in cleavage or processing.

For this, individual cleavage positions in the genome are sorted, first by the number of samples in which a site has been identified and, for equal numbers of samples, by the total tags per million (TPM).

The list is then traversed from most supported sites to least supported ones and the clustering procedure is applied. We have determined that the number of clusters decreases rapidly up to a distance of 12 nucleotides around the most used cluster representative. Thus, we constructed clusters of sites by grouping sites with lower read support that were located from -12 to +12 nucleotides around sites with strong support.

Clusters that were flagged as 'putative internal priming clusters' (because one of their poly(A) sites resides within an A-rich poly(A) signal) were retained if

  1. a) they shared a poly(A) signal with a non-IP cluster downstream in which case they were merged into the downstream cluster, or
  2. b) their most downstream poly(A) signal was at least 15 nucleotides upstream of the most distal poly(A) site.
    1. In another pass, sites that were located within 25 nucleotides of each other were clustered together, and finally, for clusters with no annotated poly(A) signals a more permissive distance of 50 nucleotides was used in clustering.


Caenorhabditis elegans - version: 2-0 (WBcel235)

Date of release: 2019-09-24

Total reads: 67,268,436

Protocols: PAT-seq, 3P-Seq

Number of poly(A) site clusters: 20,931

Percentage of clusters with poly(A) signal: 81
List of poly(A) signals.

Poly(A) signals are considered if they reside in a region of 60 nt upstream to 10 nt downstream of one of the poly(A) sites of a cluster

  1. AATAAA
  2. ATTAAA
  3. TATAAA
  4. AGTAAA
  5. AATACA
  6. CATAAA
  7. AATATA
  8. GATAAA
  9. AATGAA
  10. AATAAT
  11. AAGAAA
  12. ACTAAA
  13. AATAGA
  14. ATTACA
  15. AACAAA
  16. ATTATA
  17. AACAAG
  18. AATAAG

Cluster annotations

Terminal exon: 13,885
Other exon: 660
Intron: 567
Downstream of terminal exon: 5,023
Antisense exon: 78
Antisense intron: 326
Antisense upstream of a gene: 85
Intergenic: 307

Add as custom track @ UCSC genome browser

Atlas BED file

We follow the standard BED specification with 0-based coordinates. Additionally, we appended extra column(s). For more information click here.

The columns represent:

first - chromosome name

second and third - start and end positions of the poly(A) site cluster, respectively

fourth - unique cluster ID, composed of the chromosome name, the representative poly(A) site of the cluster and the strand. Note that this ID format is inspired by UCSC's position format, which uses 1-based coordinates instead of the 0-based bed coordinates used in the second and third columns. Thus, to convert the position of the representative site to bed coordinates, subtract 1.

fifth - average expression (tags per million, tpm) across all samples

sixth - strand on which the cluster is encoded

seventh - percentage of samples that support the particular cluster

eighth - number of different 3' end sequencing protocols that support the particular cluster

ninth - average expression (tags per million, tpm) across all samples

tenth - two letter code for the cluster annotation (in order of decreasing priority: TE, terminal exon; EX, exonic; IN, intronic; DS, 1,000 nt downstream of an annotated terminal exon; AE, anti-sense to an exon; AI, anti-sense to an intron; AU, 1,000 nt upstream in anti-sense direction of a transcription start site; IG, intergenic)

eleventh - information about the poly(A) signal(s) that are present upstream of the poly(A) site, including the motif, the location with respect to the cleavage site and the genomic coordinate

Alternatively, you can download our atlas with average TPMs for all contributing samples as a tab separated file:

Atlas with Samples TSV

For more information on columns click here.

The columns represent:

first - chromosome name

second and third - start and end positions of the poly(A) site cluster, respectively

fourth - unique cluster ID, composed of the chromosome name, the representative poly(A) site of the cluster and the strand. Note that this ID format is inspired by UCSC's position format, which uses 1-based coordinates instead of the 0-based bed coordinates used in the second and third columns. Thus, to convert the position of the representative site to bed coordinates, subtract 1.

fifth - average expression (tags per million, tpm) across all samples

sixth - strand on which the cluster is encoded

seventh - representative poly(A) site of the cluster

eighth - percentage of samples that support the particular cluster

ninth - number of different 3' end sequencing protocols that support the particular cluster

tenth - two letter code for the cluster annotation (in order of decreasing priority: TE, terminal exon; EX, exonic; IN, intronic; DS, 1,000 nt downstream of an annotated terminal exon; AE, anti-sense to an exon; AI, anti-sense to an intron; AU, 1,000 nt upstream in anti-sense direction of a transcription start site; IG, intergenic)

eleventh - gene symbols for annotated genes overlapping with the cluster

twelfth - Ensembl gene IDs for annotated genes overlapping with the cluster

thirteenth - poly(A) signals in the region of -60 to +10 nucleotides around the representative site of the cluster with relative (e.g. @-28) and absolute position on the chromosome (e.g. @1001018)

fourteenth onwards - Sample information, consisting of SAMPLE_ID|PROTOCOL|SOURCE|TITLE|TREATMENT


Homo sapiens - version: 2-0 (GRCh38-96)

Date of release: 2019-08-13

Total reads: 1,104,077,259

3'-end sequencing libraries: 221
See
all contributing samples.

Protocols: 3'-Seq (Mayr), 3'READS, DRS, QuantSeq_REV, SAPAS, PAPERCLIP, PolyA-seq, PAS-Seq, A-seq, 3P-Seq

Number of poly(A) site clusters: 569,005

Percentage of clusters with poly(A) signal: 76
List of poly(A) signals.

Poly(A) signals are considered if they reside in a region of 60 nt upstream to 10 nt downstream of one of the poly(A) sites of a cluster

  1. AATAAA
  2. ATTAAA
  3. TATAAA
  4. AGTAAA
  5. AATACA
  6. CATAAA
  7. AATATA
  8. GATAAA
  9. AATGAA
  10. AATAAT
  11. AAGAAA
  12. ACTAAA
  13. AATAGA
  14. ATTACA
  15. AACAAA
  16. ATTATA
  17. AACAAG
  18. AATAAG

Cluster annotations

Terminal exon: 143,658
Other exon: 21,804
Intron: 165,859
Downstream of terminal exon: 19,865
Antisense exon: 16,135
Antisense intron: 78,441
Antisense upstream of a gene: 4,353
Intergenic: 118,890

Add as custom track @ UCSC genome browser

Atlas BED file

We follow the standard BED specification with 0-based coordinates. Additionally, we appended extra column(s). For more information click here.

The columns represent:

first - chromosome name

second and third - start and end positions of the poly(A) site cluster, respectively

fourth - unique cluster ID, composed of the chromosome name, the representative poly(A) site of the cluster and the strand. Note that this ID format is inspired by UCSC's position format, which uses 1-based coordinates instead of the 0-based bed coordinates used in the second and third columns. Thus, to convert the position of the representative site to bed coordinates, subtract 1.

fifth - average expression (tags per million, tpm) across all samples

sixth - strand on which the cluster is encoded

seventh - percentage of samples that support the particular cluster

eighth - number of different 3' end sequencing protocols that support the particular cluster

ninth - average expression (tags per million, tpm) across all samples

tenth - two letter code for the cluster annotation (in order of decreasing priority: TE, terminal exon; EX, exonic; IN, intronic; DS, 1,000 nt downstream of an annotated terminal exon; AE, anti-sense to an exon; AI, anti-sense to an intron; AU, 1,000 nt upstream in anti-sense direction of a transcription start site; IG, intergenic)

eleventh - information about the poly(A) signal(s) that are present upstream of the poly(A) site, including the motif, the location with respect to the cleavage site and the genomic coordinate

Alternatively, you can download our atlas with average TPMs for all contributing samples as a tab separated file:

Atlas with Samples TSV

For more information on columns click here.

The columns represent:

first - chromosome name

second and third - start and end positions of the poly(A) site cluster, respectively

fourth - unique cluster ID, composed of the chromosome name, the representative poly(A) site of the cluster and the strand. Note that this ID format is inspired by UCSC's position format, which uses 1-based coordinates instead of the 0-based bed coordinates used in the second and third columns. Thus, to convert the position of the representative site to bed coordinates, subtract 1.

fifth - average expression (tags per million, tpm) across all samples

sixth - strand on which the cluster is encoded

seventh - representative poly(A) site of the cluster

eighth - percentage of samples that support the particular cluster

ninth - number of different 3' end sequencing protocols that support the particular cluster

tenth - two letter code for the cluster annotation (in order of decreasing priority: TE, terminal exon; EX, exonic; IN, intronic; DS, 1,000 nt downstream of an annotated terminal exon; AE, anti-sense to an exon; AI, anti-sense to an intron; AU, 1,000 nt upstream in anti-sense direction of a transcription start site; IG, intergenic)

eleventh - gene symbols for annotated genes overlapping with the cluster

twelfth - Ensembl gene IDs for annotated genes overlapping with the cluster

thirteenth - poly(A) signals in the region of -60 to +10 nucleotides around the representative site of the cluster with relative (e.g. @-28) and absolute position on the chromosome (e.g. @1001018)

fourteenth onwards - Sample information, consisting of SAMPLE_ID|PROTOCOL|SOURCE|TITLE|TREATMENT


Mus musculus - version: 2-0 (GRCm38-96)

Date of release: 2019-08-13

Total reads: 1,167,552,603

3'-end sequencing libraries: 178
See
all contributing samples.

Protocols: 3'READS, DRS, SAPAS, PAPERCLIP, 2P-Seq, PolyA-seq, PAS-Seq, A-seq, 3P-Seq

Number of poly(A) site clusters: 301,001

Percentage of clusters with poly(A) signal: 72
List of poly(A) signals.

Poly(A) signals are considered if they reside in a region of 60 nt upstream to 10 nt downstream of one of the poly(A) sites of a cluster

  1. AATAAA
  2. ATTAAA
  3. TATAAA
  4. AGTAAA
  5. AATACA
  6. CATAAA
  7. AATATA
  8. GATAAA
  9. AATGAA
  10. AATAAT
  11. AAGAAA
  12. ACTAAA
  13. AATAGA
  14. ATTACA
  15. AACAAA
  16. ATTATA
  17. AACAAG
  18. AATAAG

Cluster annotations

Terminal exon: 101,531
Other exon: 13,079
Intron: 58,184
Downstream of terminal exon: 14,657
Antisense exon: 4,386
Antisense intron: 34,671
Antisense upstream of a gene: 4,124
Intergenic: 70,369

Add as custom track @ UCSC genome browser

Atlas BED file

We follow the standard BED specification with 0-based coordinates. Additionally, we appended extra column(s). For more information click here.

The columns represent:

first - chromosome name

second and third - start and end positions of the poly(A) site cluster, respectively

fourth - unique cluster ID, composed of the chromosome name, the representative poly(A) site of the cluster and the strand. Note that this ID format is inspired by UCSC's position format, which uses 1-based coordinates instead of the 0-based bed coordinates used in the second and third columns. Thus, to convert the position of the representative site to bed coordinates, subtract 1.

fifth - average expression (tags per million, tpm) across all samples

sixth - strand on which the cluster is encoded

seventh - percentage of samples that support the particular cluster

eighth - number of different 3' end sequencing protocols that support the particular cluster

ninth - average expression (tags per million, tpm) across all samples

tenth - two letter code for the cluster annotation (in order of decreasing priority: TE, terminal exon; EX, exonic; IN, intronic; DS, 1,000 nt downstream of an annotated terminal exon; AE, anti-sense to an exon; AI, anti-sense to an intron; AU, 1,000 nt upstream in anti-sense direction of a transcription start site; IG, intergenic)

eleventh - information about the poly(A) signal(s) that are present upstream of the poly(A) site, including the motif, the location with respect to the cleavage site and the genomic coordinate

Alternatively, you can download our atlas with average TPMs for all contributing samples as a tab separated file:

Atlas with Samples TSV

For more information on columns click here.

The columns represent:

first - chromosome name

second and third - start and end positions of the poly(A) site cluster, respectively

fourth - unique cluster ID, composed of the chromosome name, the representative poly(A) site of the cluster and the strand. Note that this ID format is inspired by UCSC's position format, which uses 1-based coordinates instead of the 0-based bed coordinates used in the second and third columns. Thus, to convert the position of the representative site to bed coordinates, subtract 1.

fifth - average expression (tags per million, tpm) across all samples

sixth - strand on which the cluster is encoded

seventh - representative poly(A) site of the cluster

eighth - percentage of samples that support the particular cluster

ninth - number of different 3' end sequencing protocols that support the particular cluster

tenth - two letter code for the cluster annotation (in order of decreasing priority: TE, terminal exon; EX, exonic; IN, intronic; DS, 1,000 nt downstream of an annotated terminal exon; AE, anti-sense to an exon; AI, anti-sense to an intron; AU, 1,000 nt upstream in anti-sense direction of a transcription start site; IG, intergenic)

eleventh - gene symbols for annotated genes overlapping with the cluster

twelfth - Ensembl gene IDs for annotated genes overlapping with the cluster

thirteenth - poly(A) signals in the region of -60 to +10 nucleotides around the representative site of the cluster with relative (e.g. @-28) and absolute position on the chromosome (e.g. @1001018)

fourteenth onwards - Sample information, consisting of SAMPLE_ID|PROTOCOL|SOURCE|TITLE|TREATMENT


Missing older versions of our atlas? Find it in our archive.