Additional utilities
Swan now comes with several utilities that can be used fo compute and output various metrics using data in the SwanGraph.
Table of contents
We'll be using the same SwanGraph as the rest of the tutorial pages to demonstrate these utilities. Load it using the following code:
Calculating TPM values
Swan allows for users to calculate the TPM of their data using various groupby metrics using the calc_tpm()
function. You can use this to calculate TPM of any of the AnnData SwanGraph objects (SwanGraph.adata
for transcripts, SwanGraph.tss_adata
for TSSs, SwanGraph.tes_adata
for TESs, and SwanGraph.edge_adata
for edges; see the Data structure FAQ page for more information on these tables.
First, we'll calculate the TPM for each transcript in each dataset:
tid | ENST00000000233.9 | ENST00000000412.7 | ENST00000000442.10 | ENST00000001008.5 | ENST00000001146.6 | ENST00000002125.8 | ENST00000002165.10 | ENST00000002501.10 | ENST00000002596.5 | ENST00000002829.7 | ... | TALONT000482711 | TALONT000482903 | TALONT000483195 | TALONT000483284 | TALONT000483315 | TALONT000483322 | TALONT000483327 | TALONT000483978 | TALONT000484004 | TALONT000484796 |
hepg2_1 | 196.138474 | 86.060760 | 8.005652 | 46.032497 | 0.0 | 16.011305 | 258.182281 | 60.042389 | 0.0 | 0.0 | ... | 0.000000 | 4.002826 | 2.001413 | 12.008478 | 0.000000 | 0.000000 | 4.002826 | 14.009891 | 8.005652 | 0.000000 |
hepg2_2 | 243.975174 | 77.789185 | 7.071744 | 61.288448 | 0.0 | 12.964864 | 380.695557 | 64.824318 | 0.0 | 0.0 | ... | 1.178624 | 14.143488 | 4.714496 | 7.071744 | 2.357248 | 8.250368 | 2.357248 | 11.786240 | 10.607616 | 1.178624 |
hffc6_1 | 131.320969 | 194.355042 | 0.000000 | 107.683197 | 0.0 | 6.566049 | 278.400452 | 0.000000 | 0.0 | 0.0 | ... | 6.566049 | 13.132097 | 9.192468 | 1.313210 | 6.566049 | 9.192468 | 6.566049 | 0.000000 | 15.758516 | 1.313210 |
hffc6_2 | 137.061584 | 242.395935 | 0.000000 | 124.370689 | 0.0 | 8.883621 | 219.552338 | 0.000000 | 0.0 | 0.0 | ... | 15.229064 | 10.152709 | 6.345443 | 8.883621 | 1.269089 | 10.152709 | 15.229064 | 0.000000 | 16.498154 | 8.883621 |
hffc6_3 | 147.986496 | 273.205841 | 3.252450 | 172.379868 | 0.0 | 9.757351 | 200.025696 | 1.626225 | 0.0 | 0.0 | ... | 14.636026 | 11.383576 | 8.131125 | 8.131125 | 11.383576 | 11.383576 | 6.504900 | 0.000000 | 24.393377 | 9.757351 |
5 rows × 208306 columns
We can swap out the first argument with the different AnnData structures in the SwanGraph. For instance, say we want to calculate the TPM of each TSS:
tss_id | ENSG00000000003.14_1 | ENSG00000000003.14_2 | ENSG00000000003.14_3 | ENSG00000000003.14_4 | ENSG00000000005.5_1 | ENSG00000000005.5_2 | ENSG00000000419.12_1 | ENSG00000000419.12_2 | ENSG00000000457.13_1 | ENSG00000000457.13_2 | ... | TALONG000085596_1 | TALONG000085799_1 | TALONG000085978_1 | TALONG000086022_1 | TALONG000086057_1 | TALONG000086218_1 | TALONG000086443_1 | TALONG000086539_1 | TALONG000086553_1 | TALONG000086766_1 |
hepg2_1 | 0.0 | 232.163910 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 54.038151 | 0.0 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 60.042389 | 0.000000 | 0.000000 | 6.004239 | 0.000000 | 0.000000 | 8.005652 |
hepg2_2 | 0.0 | 276.976654 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 103.718910 | 0.0 | 2.357248 | ... | 0.000000 | 0.000000 | 0.000000 | 95.468544 | 0.000000 | 0.000000 | 31.822847 | 0.000000 | 1.178624 | 10.607616 |
hffc6_1 | 0.0 | 45.962341 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 101.117149 | 0.0 | 0.000000 | ... | 2.626419 | 6.566049 | 9.192468 | 0.000000 | 7.879258 | 11.818888 | 233.751328 | 9.192468 | 6.566049 | 15.758516 |
hffc6_2 | 0.0 | 53.301723 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 85.028938 | 0.0 | 1.269089 | ... | 6.345443 | 1.269089 | 12.690886 | 0.000000 | 8.883621 | 20.305418 | 119.294334 | 12.690886 | 2.538177 | 16.498154 |
hffc6_3 | 0.0 | 68.301460 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 89.442383 | 0.0 | 1.626225 | ... | 8.131125 | 8.131125 | 11.383576 | 0.000000 | 11.383576 | 17.888477 | 134.976685 | 27.645828 | 8.131125 | 24.393377 |
5 rows × 130176 columns
And finally, we can use an alternative metadata column to compute TPM on. For instance, we can use the cell_line
column:
tid | ENST00000000233.9 | ENST00000000412.7 | ENST00000000442.10 | ENST00000001008.5 | ENST00000001146.6 | ENST00000002125.8 | ENST00000002165.10 | ENST00000002501.10 | ENST00000002596.5 | ENST00000002829.7 | ... | TALONT000482711 | TALONT000482903 | TALONT000483195 | TALONT000483284 | TALONT000483315 | TALONT000483322 | TALONT000483327 | TALONT000483978 | TALONT000484004 | TALONT000484796 |
hepg2 | 226.245346 | 80.854897 | 7.417881 | 55.634102 | 0.0 | 14.093972 | 335.288177 | 63.051983 | 0.0 | 0.0 | ... | 0.741788 | 10.385033 | 3.70894 | 8.901457 | 1.483576 | 5.192516 | 2.967152 | 12.610396 | 9.643245 | 0.741788 |
hffc6 | 138.145737 | 234.247116 | 0.924052 | 132.139404 | 0.0 | 8.316465 | 234.709137 | 0.462026 | 0.0 | 0.0 | ... | 12.012672 | 11.550647 | 7.85444 | 6.006336 | 6.006336 | 10.164569 | 9.702543 | 0.000000 | 18.481035 | 6.468362 |
2 rows × 208306 columns
Calculating pi values
You can use the calc_pi()
function to calculate percent isoform use (pi) per gene in nearly the exact same way that you can use calc_tpm()
: you can run it on either the transcript, edge, TSS, or TES level, and you can choose the metadata variable to groupby. The only difference is that for calc_pi()
you must also provide an additional DataFrame object as the second argument that tells Swan what gene each entry comes from. Below the corresponding DataFrame that must be provided is listed for each AnnData:
AnnData | DataFrame |
|
|
|
|
|
|
First, we'll calculate the pi value for each transcript in each dataset:
tid | ENST00000000233.9 | ENST00000000412.7 | ENST00000000442.10 | ENST00000001008.5 | ENST00000001146.6 | ENST00000002125.8 | ENST00000002165.10 | ENST00000002501.10 | ENST00000002596.5 | ENST00000002829.7 | ... | TALONT000482711 | TALONT000482903 | TALONT000483195 | TALONT000483284 | TALONT000483315 | TALONT000483322 | TALONT000483327 | TALONT000483978 | TALONT000484004 | TALONT000484796 |
hepg2_1 | 100.000000 | 100.0 | 100.000000 | 100.0 | 0.0 | 100.000000 | 100.0 | 93.750000 | 0.0 | 0.0 | ... | 0.000000 | 1.904762 | 6.666667 | 13.043478 | 0.000000 | 0.000000 | 1.333333 | 100.0 | 100.0 | 0.000000 |
hepg2_2 | 99.519226 | 100.0 | 60.000004 | 100.0 | 0.0 | 100.000000 | 100.0 | 80.882355 | 0.0 | 0.0 | ... | 5.263158 | 3.225806 | 13.793103 | 8.695652 | 0.884956 | 3.097345 | 0.884956 | 100.0 | 100.0 | 2.380952 |
hffc6_1 | 98.039215 | 100.0 | 0.000000 | 100.0 | 0.0 | 100.000000 | 100.0 | 0.000000 | 0.0 | 0.0 | ... | 2.604167 | 2.092050 | 16.279070 | 1.428571 | 0.854701 | 1.196581 | 0.854701 | 0.0 | 100.0 | 1.886792 |
hffc6_2 | 99.082573 | 100.0 | 0.000000 | 100.0 | 0.0 | 77.777779 | 100.0 | 0.000000 | 0.0 | 0.0 | ... | 4.285715 | 2.144772 | 11.627908 | 14.893617 | 0.166667 | 1.333333 | 2.000000 | 0.0 | 100.0 | 9.859155 |
hffc6_3 | 100.000000 | 100.0 | 100.000000 | 100.0 | 0.0 | 85.714287 | 100.0 | 100.000000 | 0.0 | 0.0 | ... | 4.326923 | 2.536232 | 15.151516 | 10.638298 | 1.711491 | 1.711491 | 0.977995 | 0.0 | 100.0 | 13.636364 |
5 rows × 208306 columns
As a note, the calc_pi()
function outputs not only a table of pi values but of counts per isoform per condition, which is used as an intermediate during DIE testing. To avoid recalculation, it is output here.
ENST00000000233.9 | ENST00000000412.7 | ENST00000000442.10 | ENST00000001008.5 | ENST00000001146.6 | ENST00000002125.8 | ENST00000002165.10 | ENST00000002501.10 | ENST00000002596.5 | ENST00000002829.7 | ... | TALONT000482711 | TALONT000482903 | TALONT000483195 | TALONT000483284 | TALONT000483315 | TALONT000483322 | TALONT000483327 | TALONT000483978 | TALONT000484004 | TALONT000484796 | |
hepg2_1 | 98.0 | 43.0 | 4.0 | 23.0 | 0.0 | 8.0 | 129.0 | 30.0 | 0.0 | 0.0 | ... | 0.0 | 2.0 | 1.0 | 6.0 | 0.0 | 0.0 | 2.0 | 7.0 | 4.0 | 0.0 |
hepg2_2 | 207.0 | 66.0 | 6.0 | 52.0 | 0.0 | 11.0 | 323.0 | 55.0 | 0.0 | 0.0 | ... | 1.0 | 12.0 | 4.0 | 6.0 | 2.0 | 7.0 | 2.0 | 10.0 | 9.0 | 1.0 |
hffc6_1 | 100.0 | 148.0 | 0.0 | 82.0 | 0.0 | 5.0 | 212.0 | 0.0 | 0.0 | 0.0 | ... | 5.0 | 10.0 | 7.0 | 1.0 | 5.0 | 7.0 | 5.0 | 0.0 | 12.0 | 1.0 |
hffc6_2 | 108.0 | 191.0 | 0.0 | 98.0 | 0.0 | 7.0 | 173.0 | 0.0 | 0.0 | 0.0 | ... | 12.0 | 8.0 | 5.0 | 7.0 | 1.0 | 8.0 | 12.0 | 0.0 | 13.0 | 7.0 |
hffc6_3 | 91.0 | 168.0 | 2.0 | 106.0 | 0.0 | 6.0 | 123.0 | 1.0 | 0.0 | 0.0 | ... | 9.0 | 7.0 | 5.0 | 5.0 | 7.0 | 7.0 | 4.0 | 0.0 | 15.0 | 6.0 |
5 rows × 208306 columns
We can also calculate the pi value for the TSSs and TESs in each dataset:
And we can also choose to calculate pi values using a different metadata column, here shown on the cell_line
column:
tid | ENST00000000233.9 | ENST00000000412.7 | ENST00000000442.10 | ENST00000001008.5 | ENST00000001146.6 | ENST00000002125.8 | ENST00000002165.10 | ENST00000002501.10 | ENST00000002596.5 | ENST00000002829.7 | ... | TALONT000482711 | TALONT000482903 | TALONT000483195 | TALONT000483284 | TALONT000483315 | TALONT000483322 | TALONT000483327 | TALONT000483978 | TALONT000484004 | TALONT000484796 |
hepg2 | 99.673203 | 100.0 | 71.428574 | 100.0 | 0.0 | 100.000000 | 100.0 | 85.0 | 0.0 | 0.0 | ... | 4.545455 | 2.935011 | 11.363637 | 10.434782 | 0.531915 | 1.861702 | 1.06383 | 100.0 | 100.0 | 1.785714 |
hffc6 | 99.006622 | 100.0 | 100.000000 | 100.0 | 0.0 | 85.714287 | 100.0 | 100.0 | 0.0 | 0.0 | ... | 3.823529 | 2.218279 | 14.285715 | 7.926829 | 0.815558 | 1.380176 | 1.31744 | 0.0 | 100.0 | 8.333334 |
2 rows × 208306 columns
Obtaining edge abundance information
In case you're interested in doing outside analyses on the level (For instance, using intron counting to assess alternative splicing), Swan provides a tool to output a DataFrame with edge abundance on the dataset level.
If we just want to get access to the edge abundance DataFrame, just use the get_edge_abundance()
function:
strand | edge_type | annotation | chrom | start | stop | hepg2_1 | hepg2_2 | hffc6_1 | hffc6_2 | hffc6_3 | |
0 | + | exon | True | chr1 | 11869 | 12227 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | + | exon | True | chr1 | 12010 | 12057 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | + | intron | True | chr1 | 12057 | 12179 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | + | exon | True | chr1 | 12179 | 12227 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | + | intron | True | chr1 | 12227 | 12613 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
You can also specify if you want the data to be output in raw counts (kind='counts'
) or TPM (kind='tpm
). By default, this function returns counts. Here's an example with TPM:
strand | edge_type | annotation | chrom | start | stop | hepg2_1 | hepg2_2 | hffc6_1 | hffc6_2 | hffc6_3 | |
0 | + | exon | True | chr1 | 11869 | 12227 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | + | exon | True | chr1 | 12010 | 12057 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | + | intron | True | chr1 | 12057 | 12179 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | + | exon | True | chr1 | 12179 | 12227 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | + | intron | True | chr1 | 12227 | 12613 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
And finally, if you wish, you can provide the function with a prefix
value which will indicate that you want the output DataFrame to be saved in TSV form.
strand | edge_type | annotation | chrom | start | stop | hepg2_1 | hepg2_2 | hffc6_1 | hffc6_2 | hffc6_3 | |
0 | + | exon | True | chr1 | 11869 | 12227 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | + | exon | True | chr1 | 12010 | 12057 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | + | intron | True | chr1 | 12057 | 12179 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | + | exon | True | chr1 | 12179 | 12227 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | + | intron | True | chr1 | 12227 | 12613 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
The results will be saved in '{prefix}_edge_abundance.tsv'.
Obtaining TSS/TES abundance information
Similarly, if you wish to do analysis involving your TSS or TES data, you can also output these using the get_tss_abundance()
and get_tes_abundance()
functions respectively. These have identical options to get_edge_abundance()
so they can either output counts or TPM and optionally save to an output file.
First, let's output the TSS TPM to a file:
tss_id | gid | gname | vertex_id | tss_name | chrom | coord | hepg2_1 | hepg2_2 | hffc6_1 | hffc6_2 | hffc6_3 | |
0 | ENSG00000000003.14_1 | ENSG00000000003.14 | TSPAN6 | 926111 | TSPAN6_1 | chrX | 100636191 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 |
1 | ENSG00000000003.14_2 | ENSG00000000003.14 | TSPAN6 | 926112 | TSPAN6_2 | chrX | 100636608 | 232.16391 | 276.976654 | 45.962341 | 53.301723 | 68.30146 |
2 | ENSG00000000003.14_3 | ENSG00000000003.14 | TSPAN6 | 926114 | TSPAN6_3 | chrX | 100636793 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 |
3 | ENSG00000000003.14_4 | ENSG00000000003.14 | TSPAN6 | 926117 | TSPAN6_4 | chrX | 100639945 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 |
4 | ENSG00000000005.5_1 | ENSG00000000005.5 | TNMD | 926077 | TNMD_1 | chrX | 100585066 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 |
Now we'll get the counts of each TES without saving to a file:
tes_id | gid | gname | vertex_id | tes_name | chrom | coord | hepg2_1 | hepg2_2 | hffc6_1 | hffc6_2 | hffc6_3 | |
0 | ENSG00000000003.14_1 | ENSG00000000003.14 | TSPAN6 | 926092 | TSPAN6_1 | chrX | 100627109 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | ENSG00000000003.14_2 | ENSG00000000003.14 | TSPAN6 | 926093 | TSPAN6_2 | chrX | 100628670 | 116.0 | 235.0 | 35.0 | 42.0 | 42.0 |
2 | ENSG00000000003.14_3 | ENSG00000000003.14 | TSPAN6 | 926097 | TSPAN6_3 | chrX | 100632063 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | ENSG00000000003.14_4 | ENSG00000000003.14 | TSPAN6 | 926100 | TSPAN6_4 | chrX | 100632541 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | ENSG00000000003.14_5 | ENSG00000000003.14 | TSPAN6 | 926103 | TSPAN6_5 | chrX | 100633442 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Last updated