Read more about the clustering process that yields our curated datasets of PAS.

A clustering procedure has been implemented to group together closely-spaced cleavage sites that are likely to reflect imperfect precision in 3’ end cleavage.

For this, individual cleavage positions in the genome are sorted, first by the number of samples in which a site has been identified and, for equal numbers of samples, by the average reads per million (RPM).

The list is then traversed from most supported sites to least supported ones and the clustering procedure is applied. We have determined that the number of clusters decreases rapidly up to a distance of 25 nucleotides around the most used cluster representative. Thus, we constructed clusters of sites by grouping sites with lower read support that were located from -25 to +25 nucleotides around sites with strong support. Such clusters are then termed poly(A) sites (PAS).

NOTE, that in the current v3.0 version, complete sets of all PAS identified from polyadenylated reads in scRNA-seq data are provided, and in addition, a stringency level is provided for each PAS. This is assigned based on the presence of canonical polyadenylation motifs, and thus reflects the confidence in the PAS being a true PAS rather than a technical artifact. Users can filter the catalog by their stringency level of choice (columns 9 and 14 in atlas .bed and .tsv files, respectively), to balance sensitivity and specificity in their analysis. In comparison, in v2.0 PolyASite Atlases, provided catalogs were pre-filtered for a specific stringency level.


Mus musculus: v3.0 (GRCm38.GENCODE_M25)

Date of release: 2024-09-15

3'-end sequencing libraries: 188
Explore
contributing samples.

Protocols: 10X

Number of PAS 1,750,661

Percentage of PAS with poly(A) signal: 31
List of poly(A) signals.

Poly(A) motif signals are considered if they reside in a region from 35 to 10 nt upstream of the representative cleavage site of the PAS.

  1. AATAAA
  2. ATTAAA
  3. TATAAA
  4. AGTAAA
  5. AATGAA
  6. CATAAA
  7. AATACA
  8. AATATA
  9. GATAAA
  10. ACTAAA
  11. AATAGA
  12. AAGAAA

PAS annotations

Terminal exon: 157,405
Exonic: 79,934
Intronic: 693,688
Alternatively spliced: 80,615
Downstream intergenic: 451,686
Upstream intergenic: 287,219
Ambiguous: 114

Go to PolyASite v3.0 UCSC trackhub

Atlas BED file

We follow the standard BED specification with 0-based coordinates. Additionally, to reflect new features of PAS inferred from scRNA-seq data, we appended extra column(s). Thus, the format is similar to the one used in PolyASite v2.0, but contains a number of updates. For more information click here.

The columns represent:

first - chromosome name

second and third - start and end positions of the PAS, respectively. Differences > 1 between start and end positions reflect the imprecision of the cleavage process.

fourth - unique PAS ID, composed of the chromosome name, the coordinate of the representative cleavage site of the PAS and the strand. Note that this ID format is inspired by that of UCSC's genome browser and uses 1-based instead of 0-based coordinates, as per bed file format. Thus, to convert the coordinate of the representative site to bed coordinates, subtract 1 from the given coordinate of the representative site to have a start coordinate.

fifth - average expression (reads per million, RPM), calculated as the average of within-tissue mean expression values, to avoid biases from variable number of samples in different tissues. Note that the meaning of this column changed since PolyASite v2.0, where each sample contributed equally to the average.

sixth - strand on which the PAS is encoded

seventh - percentage of tissues that support the particular PAS

eighth - number of different experimental protocols that support the particular PAS (in this case 1, 10X genomics-based scRNA-seq)

ninth - stringency level, the maximal motif percentage at which the PAS is retained upon filtration. See Figure 1B and Supplementary Figure 1 of the PolyASite v3.0 paper for details. Note that the meaning of this column changed since PolyASite v2.0, where it contained the average expression, as column 5.

tenth - two letter code for the genomic class annotation of the PAS (TE, terminal exon; EX, exonic; IN, intronic; AL, alternative (exonic/intronic); DI, downstream intergenic; UI, upstream intergenic; NA, not available (ambiguous)). See Figure 1C of the PolyASite v3.0 paper for details. Note that the classification scheme changed since PolyASite v2.0. Specifically, in v2.0 each PAS could be assigned to several genes, including in anti-sense direction, and alternatively-spliced (exonic/intronic) regions were not distinguished as a separate class. In v3.0, each PAS is uniquely assigned to one gene. In case the PAS coordinate is located in the intergenic region or in a region where multiple genes overlap on the same strand, PAS is assigned to the gene with the closest annotated 3’end on the same strand.

eleventh - information about the poly(A) signal(s) (core hexamer) that are present in the region from 35 to 10 nts upstream of the PAS. Note that in PolyASite v2.0, a larger window of 60 nt upstream to 10 nt downstream relative to the cleavage site, was used, and 18 motifs were considered, instead of 12 as in PolyASite v3.0.

Alternatively, you can download our atlas with average RPMs for all contributing tissues as a tab separated file:

Atlas with Tissues TSV

The format is similar to the one used in PolyASite v2.0, but contains a number of updates. For more information on columns click here.

The columns represent:

first (chrom) - chromosome name

second (chromStart) and third (chromEnd) - start and end positions of the PAS, respectively. Differences > 1 between start and end positions reflect the imprecision of the cleavage process.

fourth (name) - unique PAS ID, composed of the chromosome name, the coordinate of the representative cleavage site of the PAS and the strand. Note that this ID format is inspired by that of UCSC's genome browser and uses 1-based instead of 0-based coordinates, as per bed file format. Thus, to convert the coordinate of the representative site to bed coordinates, subtract 1 from the given coordinate of the representative site to have a start coordinate.

fifth (avg_expression) - average expression (reads per million, RPM), calculated as the average of within-tissue mean expression values, to avoid biases from variable number of samples in different tissues. Note that the meaning of this column changed since PolyASite v2.0, where each sample contributed equally to the average.

sixth (strand) - strand on which the PAS is encoded

seventh (rep) - representative cleavage site of the cluster (genomic coordinate)

eighth (perc_tissues) - percentage of tissues that support the particular PAS

ninth (nr_prots) - number of different experimental protocols that support the particular PAS

tenth (annotation) - two letter code for the genomic class annotation of the PAS (TE, terminal exon; EX, exonic; IN, intronic; AL, alternative (exonic/intronic); DI, downstream intergenic; UI, upstream intergenic; NA, not available (ambiguous)). See Figure 1C of the PolyASite v3.0 paper for details. Note that the classification scheme changed since PolyASite v2.0 where each PAS could be assigned to several genes, including in anti-sense direction, and alternative (exonic/intronic) regions were not distinguished as a separate class. Now, each PAS is uniquely assigned to one gene. In case the PAS coordinate is located in the intergenic region or the region where multiple genes overlap on the same strand, PAS is assigned to the gene with the closest 3’end on the same strand.

eleventh (gene_name) - gene symbol of the annotated gene to which the PAS is assigned. Note that each PAS is now uniquely assigned to one gene. In case the PAS coordinate is located in an intergenic region or in a region where multiple genes overlap on the same strand, PAS is assigned to the gene with the closest annotated 3’end on the same strand. This classification scheme is different from that of PolyASite v2.0, where a PAS could be assigned to multiple overlapping genes. Moreover PAS were associated with genes only when the genomic coordinates of PAS directly overlapped with those of the gene.

twelfth (gene_id) - gene id of the annotated gene to which PAS is assigned. See the description for column 11 for more details.

thirteenth (repSite_signals) - information about the poly(A) signal(s) (core hexamer) that are present in the region from 35 to 10 nts upstream of the PAS. Note that in PolyASite v2.0, a larger window of 60 nt upstream to 10 nt downstream relative to the cleavage site, was used, and 18 motifs were considered, instead of 12 as in PolyASite v3.0.

fourteenth (stringency_level) - stringency level, the maximal motif percentage at which the PAS is retained upon filtration. See Figure 1B and Supplementary Figure 1 of the PolyASite v3.0 paper for details. Note that the meaning of this column changed since PolyASite v2.0, where it contained the average expression, as column 5.

fifteenth onwards - average expression (reads per million, RPM) within tissues. Note that the meaning of these columns changed since PolyASite v2.0, where these columns represented RPM values within each of the analyzed samples.


Homo sapiens: v3.0 (GRCh38.GENCODE_42)

Date of release: 2024-09-15

3'-end sequencing libraries: 813
Explore
contributing samples.

Protocols: 10X

Number of PAS 18,432,135

Percentage of PAS with poly(A) signal: 20
List of poly(A) signals.

Poly(A) motif signals are considered if they reside in a region from 35 to 10 nt upstream of the representative cleavage site of the PAS.

  1. AATAAA
  2. ATTAAA
  3. TATAAA
  4. AGTAAA
  5. AATGAA
  6. CATAAA
  7. AATACA
  8. AATATA
  9. GATAAA
  10. ACTAAA
  11. AATAGA
  12. AAGAAA

PAS annotations

Terminal exon: 564,914
Exonic: 348,266
Intronic: 10,507,671
Alternatively spliced: 635,988
Downstream intergenic: 3,920,458
Upstream intergenic: 2,454,745
Ambiguous: 93

Go to PolyASite v3.0 UCSC trackhub

Atlas BED file

We follow the standard BED specification with 0-based coordinates. Additionally, to reflect new features of PAS inferred from scRNA-seq data, we appended extra column(s). Thus, the format is similar to the one used in PolyASite v2.0, but contains a number of updates. For more information click here.

The columns represent:

first - chromosome name

second and third - start and end positions of the PAS, respectively. Differences > 1 between start and end positions reflect the imprecision of the cleavage process.

fourth - unique PAS ID, composed of the chromosome name, the coordinate of the representative cleavage site of the PAS and the strand. Note that this ID format is inspired by that of UCSC's genome browser and uses 1-based instead of 0-based coordinates, as per bed file format. Thus, to convert the coordinate of the representative site to bed coordinates, subtract 1 from the given coordinate of the representative site to have a start coordinate.

fifth - average expression (reads per million, RPM), calculated as the average of within-tissue mean expression values, to avoid biases from variable number of samples in different tissues. Note that the meaning of this column changed since PolyASite v2.0, where each sample contributed equally to the average.

sixth - strand on which the PAS is encoded

seventh - percentage of tissues that support the particular PAS

eighth - number of different experimental protocols that support the particular PAS (in this case 1, 10X genomics-based scRNA-seq)

ninth - stringency level, the maximal motif percentage at which the PAS is retained upon filtration. See Figure 1B and Supplementary Figure 1 of the PolyASite v3.0 paper for details. Note that the meaning of this column changed since PolyASite v2.0, where it contained the average expression, as column 5.

tenth - two letter code for the genomic class annotation of the PAS (TE, terminal exon; EX, exonic; IN, intronic; AL, alternative (exonic/intronic); DI, downstream intergenic; UI, upstream intergenic; NA, not available (ambiguous)). See Figure 1C of the PolyASite v3.0 paper for details. Note that the classification scheme changed since PolyASite v2.0. Specifically, in v2.0 each PAS could be assigned to several genes, including in anti-sense direction, and alternatively-spliced (exonic/intronic) regions were not distinguished as a separate class. In v3.0, each PAS is uniquely assigned to one gene. In case the PAS coordinate is located in the intergenic region or in a region where multiple genes overlap on the same strand, PAS is assigned to the gene with the closest annotated 3’end on the same strand.

eleventh - information about the poly(A) signal(s) (core hexamer) that are present in the region from 35 to 10 nts upstream of the PAS. Note that in PolyASite v2.0, a larger window of 60 nt upstream to 10 nt downstream relative to the cleavage site, was used, and 18 motifs were considered, instead of 12 as in PolyASite v3.0.

Alternatively, you can download our atlas with average RPMs for all contributing tissues as a tab separated file:

Atlas with Tissues TSV

The format is similar to the one used in PolyASite v2.0, but contains a number of updates. For more information on columns click here.

The columns represent:

first (chrom) - chromosome name

second (chromStart) and third (chromEnd) - start and end positions of the PAS, respectively. Differences > 1 between start and end positions reflect the imprecision of the cleavage process.

fourth (name) - unique PAS ID, composed of the chromosome name, the coordinate of the representative cleavage site of the PAS and the strand. Note that this ID format is inspired by that of UCSC's genome browser and uses 1-based instead of 0-based coordinates, as per bed file format. Thus, to convert the coordinate of the representative site to bed coordinates, subtract 1 from the given coordinate of the representative site to have a start coordinate.

fifth (avg_expression) - average expression (reads per million, RPM), calculated as the average of within-tissue mean expression values, to avoid biases from variable number of samples in different tissues. Note that the meaning of this column changed since PolyASite v2.0, where each sample contributed equally to the average.

sixth (strand) - strand on which the PAS is encoded

seventh (rep) - representative cleavage site of the cluster (genomic coordinate)

eighth (perc_tissues) - percentage of tissues that support the particular PAS

ninth (nr_prots) - number of different experimental protocols that support the particular PAS

tenth (annotation) - two letter code for the genomic class annotation of the PAS (TE, terminal exon; EX, exonic; IN, intronic; AL, alternative (exonic/intronic); DI, downstream intergenic; UI, upstream intergenic; NA, not available (ambiguous)). See Figure 1C of the PolyASite v3.0 paper for details. Note that the classification scheme changed since PolyASite v2.0 where each PAS could be assigned to several genes, including in anti-sense direction, and alternative (exonic/intronic) regions were not distinguished as a separate class. Now, each PAS is uniquely assigned to one gene. In case the PAS coordinate is located in the intergenic region or the region where multiple genes overlap on the same strand, PAS is assigned to the gene with the closest 3’end on the same strand.

eleventh (gene_name) - gene symbol of the annotated gene to which the PAS is assigned. Note that each PAS is now uniquely assigned to one gene. In case the PAS coordinate is located in an intergenic region or in a region where multiple genes overlap on the same strand, PAS is assigned to the gene with the closest annotated 3’end on the same strand. This classification scheme is different from that of PolyASite v2.0, where a PAS could be assigned to multiple overlapping genes. Moreover PAS were associated with genes only when the genomic coordinates of PAS directly overlapped with those of the gene.

twelfth (gene_id) - gene id of the annotated gene to which PAS is assigned. See the description for column 11 for more details.

thirteenth (repSite_signals) - information about the poly(A) signal(s) (core hexamer) that are present in the region from 35 to 10 nts upstream of the PAS. Note that in PolyASite v2.0, a larger window of 60 nt upstream to 10 nt downstream relative to the cleavage site, was used, and 18 motifs were considered, instead of 12 as in PolyASite v3.0.

fourteenth (stringency_level) - stringency level, the maximal motif percentage at which the PAS is retained upon filtration. See Figure 1B and Supplementary Figure 1 of the PolyASite v3.0 paper for details. Note that the meaning of this column changed since PolyASite v2.0, where it contained the average expression, as column 5.

fifteenth onwards - average expression (reads per million, RPM) within tissues. Note that the meaning of these columns changed since PolyASite v2.0, where these columns represented RPM values within each of the analyzed samples.


Caenorhabditis elegans: v3.0 (WBcel235.WormBase_WS293)

Date of release: 2024-09-15

3'-end sequencing libraries: 55
Explore
contributing samples.

Protocols: 10X

Number of PAS 66,458

Percentage of PAS with poly(A) signal: 51
List of poly(A) signals.

Poly(A) motif signals are considered if they reside in a region from 35 to 10 nt upstream of the representative cleavage site of the PAS.

  1. AATAAA
  2. ATTAAA
  3. TATAAA
  4. AGTAAA
  5. AATGAA
  6. CATAAA
  7. AATACA
  8. AATATA
  9. GATAAA
  10. ACTAAA
  11. AATAGA
  12. AAGAAA

PAS annotations

Terminal exon: 26,294
Exonic: 11,353
Intronic: 9,342
Alternatively spliced: 963
Downstream intergenic: 15,113
Upstream intergenic: 3,393
Ambiguous: 0

Go to PolyASite v3.0 UCSC trackhub

Atlas BED file

We follow the standard BED specification with 0-based coordinates. Additionally, to reflect new features of PAS inferred from scRNA-seq data, we appended extra column(s). Thus, the format is similar to the one used in PolyASite v2.0, but contains a number of updates. For more information click here.

The columns represent:

first - chromosome name

second and third - start and end positions of the PAS, respectively. Differences > 1 between start and end positions reflect the imprecision of the cleavage process.

fourth - unique PAS ID, composed of the chromosome name, the coordinate of the representative cleavage site of the PAS and the strand. Note that this ID format is inspired by that of UCSC's genome browser and uses 1-based instead of 0-based coordinates, as per bed file format. Thus, to convert the coordinate of the representative site to bed coordinates, subtract 1 from the given coordinate of the representative site to have a start coordinate.

fifth - average expression (reads per million, RPM) over samples

sixth - strand on which the PAS is encoded

seventh - percentage of samples that support the particular PAS

eighth - number of different experimental protocols that support the particular PAS (in this case 1, 10X genomics-based scRNA-seq)

ninth - stringency level, the maximal motif percentage at which the PAS is retained upon filtration. See Figure 1B and Supplementary Figure 1 of the PolyASite v3.0 paper for details. Note that the meaning of this column changed since PolyASite v2.0, where it contained the average expression, as column 5.

tenth - two letter code for the genomic class annotation of the PAS (TE, terminal exon; EX, exonic; IN, intronic; AL, alternative (exonic/intronic); DI, downstream intergenic; UI, upstream intergenic; NA, not available (ambiguous)). See Figure 1C of the PolyASite v3.0 paper for details. Note that the classification scheme changed since PolyASite v2.0. Specifically, in v2.0 each PAS could be assigned to several genes, including in anti-sense direction, and alternatively-spliced (exonic/intronic) regions were not distinguished as a separate class. In v3.0, each PAS is uniquely assigned to one gene. In case the PAS coordinate is located in the intergenic region or in a region where multiple genes overlap on the same strand, PAS is assigned to the gene with the closest annotated 3’end on the same strand.

eleventh - information about the poly(A) signal(s) (core hexamer) that are present in the region from 35 to 10 nts upstream of the PAS. Note that in PolyASite v2.0, a larger window of 60 nt upstream to 10 nt downstream relative to the cleavage site, was used, and 18 motifs were considered, instead of 12 as in PolyASite v3.0.

Alternatively, you can download our atlas with average RPMs for all contributing samples as a tab separated file:

Atlas with Samples TSV

The format is similar to the one used in PolyASite v2.0, but contains a number of updates. For more information on columns click here.

The columns represent:

first (chrom) - chromosome name

second (chromStart) and third (chromEnd) - start and end positions of the PAS, respectively. Differences > 1 between start and end positions reflect the imprecision of the cleavage process.

fourth (name) - unique PAS ID, composed of the chromosome name, the coordinate of the representative cleavage site of the PAS and the strand. Note that this ID format is inspired by that of UCSC's genome browser and uses 1-based instead of 0-based coordinates, as per bed file format. Thus, to convert the coordinate of the representative site to bed coordinates, subtract 1 from the given coordinate of the representative site to have a start coordinate.

fifth (avg_expression) - average expression (reads per million, RPM) over samples

sixth (strand) - strand on which the PAS is encoded

seventh (rep) - representative cleavage site of the cluster (genomic coordinate)

eighth (perc_samples) - percentage of samples that support the particular PAS

ninth (nr_prots) - number of different experimental protocols that support the particular PAS

tenth (annotation) - two letter code for the genomic class annotation of the PAS (TE, terminal exon; EX, exonic; IN, intronic; AL, alternative (exonic/intronic); DI, downstream intergenic; UI, upstream intergenic; NA, not available (ambiguous)). See Figure 1C of the PolyASite v3.0 paper for details. Note that the classification scheme changed since PolyASite v2.0 where each PAS could be assigned to several genes, including in anti-sense direction, and alternative (exonic/intronic) regions were not distinguished as a separate class. Now, each PAS is uniquely assigned to one gene. In case the PAS coordinate is located in the intergenic region or the region where multiple genes overlap on the same strand, PAS is assigned to the gene with the closest 3’end on the same strand.

eleventh (gene_name) - gene symbol of the annotated gene to which the PAS is assigned. Note that each PAS is now uniquely assigned to one gene. In case the PAS coordinate is located in an intergenic region or in a region where multiple genes overlap on the same strand, PAS is assigned to the gene with the closest annotated 3’end on the same strand. This classification scheme is different from that of PolyASite v2.0, where a PAS could be assigned to multiple overlapping genes. Moreover PAS were associated with genes only when the genomic coordinates of PAS directly overlapped with those of the gene.

twelfth (gene_id) - gene id of the annotated gene to which PAS is assigned. See the description for column 11 for more details.

thirteenth (repSite_signals) - information about the poly(A) signal(s) (core hexamer) that are present in the region from 35 to 10 nts upstream of the PAS. Note that in PolyASite v2.0, a larger window of 60 nt upstream to 10 nt downstream relative to the cleavage site, was used, and 18 motifs were considered, instead of 12 as in PolyASite v3.0.

fourteenth (stringency_level) - stringency level, the maximal motif percentage at which the PAS is retained upon filtration. See Figure 1B and Supplementary Figure 1 of the PolyASite v3.0 paper for details. Note that the meaning of this column changed since PolyASite v2.0, where it contained the average expression, as column 5.

fifteenth onwards - average expression (reads per million, RPM) within samples


Missing the bulk RNA 3' end sequencing versions of our atlas? No worries, it's still here.