Additional utilities
Swan now comes with several utilities that can be used fo compute and output various metrics using data in the SwanGraph.
Table of contents
We'll be using the same SwanGraph as the rest of the tutorial pages to demonstrate these utilities. Load it using the following code:
Calculating TPM values
Swan allows for users to calculate the TPM of their data using various groupby metrics using the calc_tpm()
function. You can use this to calculate TPM of any of the AnnData SwanGraph objects (SwanGraph.adata
for transcripts, SwanGraph.tss_adata
for TSSs, SwanGraph.tes_adata
for TESs, and SwanGraph.edge_adata
for edges; see the Data structure FAQ page for more information on these tables.
First, we'll calculate the TPM for each transcript in each dataset:
tid
ENST00000000233.9
ENST00000000412.7
ENST00000000442.10
ENST00000001008.5
ENST00000001146.6
ENST00000002125.8
ENST00000002165.10
ENST00000002501.10
ENST00000002596.5
ENST00000002829.7
...
TALONT000482711
TALONT000482903
TALONT000483195
TALONT000483284
TALONT000483315
TALONT000483322
TALONT000483327
TALONT000483978
TALONT000484004
TALONT000484796
hepg2_1
196.138474
86.060760
8.005652
46.032497
0.0
16.011305
258.182281
60.042389
0.0
0.0
...
0.000000
4.002826
2.001413
12.008478
0.000000
0.000000
4.002826
14.009891
8.005652
0.000000
hepg2_2
243.975174
77.789185
7.071744
61.288448
0.0
12.964864
380.695557
64.824318
0.0
0.0
...
1.178624
14.143488
4.714496
7.071744
2.357248
8.250368
2.357248
11.786240
10.607616
1.178624
hffc6_1
131.320969
194.355042
0.000000
107.683197
0.0
6.566049
278.400452
0.000000
0.0
0.0
...
6.566049
13.132097
9.192468
1.313210
6.566049
9.192468
6.566049
0.000000
15.758516
1.313210
hffc6_2
137.061584
242.395935
0.000000
124.370689
0.0
8.883621
219.552338
0.000000
0.0
0.0
...
15.229064
10.152709
6.345443
8.883621
1.269089
10.152709
15.229064
0.000000
16.498154
8.883621
hffc6_3
147.986496
273.205841
3.252450
172.379868
0.0
9.757351
200.025696
1.626225
0.0
0.0
...
14.636026
11.383576
8.131125
8.131125
11.383576
11.383576
6.504900
0.000000
24.393377
9.757351
5 rows × 208306 columns
We can swap out the first argument with the different AnnData structures in the SwanGraph. For instance, say we want to calculate the TPM of each TSS:
tss_id
ENSG00000000003.14_1
ENSG00000000003.14_2
ENSG00000000003.14_3
ENSG00000000003.14_4
ENSG00000000005.5_1
ENSG00000000005.5_2
ENSG00000000419.12_1
ENSG00000000419.12_2
ENSG00000000457.13_1
ENSG00000000457.13_2
...
TALONG000085596_1
TALONG000085799_1
TALONG000085978_1
TALONG000086022_1
TALONG000086057_1
TALONG000086218_1
TALONG000086443_1
TALONG000086539_1
TALONG000086553_1
TALONG000086766_1
hepg2_1
0.0
232.163910
0.0
0.0
0.0
0.0
0.0
54.038151
0.0
0.000000
...
0.000000
0.000000
0.000000
60.042389
0.000000
0.000000
6.004239
0.000000
0.000000
8.005652
hepg2_2
0.0
276.976654
0.0
0.0
0.0
0.0
0.0
103.718910
0.0
2.357248
...
0.000000
0.000000
0.000000
95.468544
0.000000
0.000000
31.822847
0.000000
1.178624
10.607616
hffc6_1
0.0
45.962341
0.0
0.0
0.0
0.0
0.0
101.117149
0.0
0.000000
...
2.626419
6.566049
9.192468
0.000000
7.879258
11.818888
233.751328
9.192468
6.566049
15.758516
hffc6_2
0.0
53.301723
0.0
0.0
0.0
0.0
0.0
85.028938
0.0
1.269089
...
6.345443
1.269089
12.690886
0.000000
8.883621
20.305418
119.294334
12.690886
2.538177
16.498154
hffc6_3
0.0
68.301460
0.0
0.0
0.0
0.0
0.0
89.442383
0.0
1.626225
...
8.131125
8.131125
11.383576
0.000000
11.383576
17.888477
134.976685
27.645828
8.131125
24.393377
5 rows × 130176 columns
And finally, we can use an alternative metadata column to compute TPM on. For instance, we can use the cell_line
column:
tid
ENST00000000233.9
ENST00000000412.7
ENST00000000442.10
ENST00000001008.5
ENST00000001146.6
ENST00000002125.8
ENST00000002165.10
ENST00000002501.10
ENST00000002596.5
ENST00000002829.7
...
TALONT000482711
TALONT000482903
TALONT000483195
TALONT000483284
TALONT000483315
TALONT000483322
TALONT000483327
TALONT000483978
TALONT000484004
TALONT000484796
hepg2
226.245346
80.854897
7.417881
55.634102
0.0
14.093972
335.288177
63.051983
0.0
0.0
...
0.741788
10.385033
3.70894
8.901457
1.483576
5.192516
2.967152
12.610396
9.643245
0.741788
hffc6
138.145737
234.247116
0.924052
132.139404
0.0
8.316465
234.709137
0.462026
0.0
0.0
...
12.012672
11.550647
7.85444
6.006336
6.006336
10.164569
9.702543
0.000000
18.481035
6.468362
2 rows × 208306 columns
Calculating pi values
You can use the calc_pi()
function to calculate percent isoform use (pi) per gene in nearly the exact same way that you can use calc_tpm()
: you can run it on either the transcript, edge, TSS, or TES level, and you can choose the metadata variable to groupby. The only difference is that for calc_pi()
you must also provide an additional DataFrame object as the second argument that tells Swan what gene each entry comes from. Below the corresponding DataFrame that must be provided is listed for each AnnData:
AnnData
DataFrame
SwanGraph.adata
SwanGraph.t_df
SwanGraph.tss_adata
SwanGraph.tss_adata.var
SwanGraph.tes_adata
SwanGraph.tes_adata.var
First, we'll calculate the pi value for each transcript in each dataset:
tid
ENST00000000233.9
ENST00000000412.7
ENST00000000442.10
ENST00000001008.5
ENST00000001146.6
ENST00000002125.8
ENST00000002165.10
ENST00000002501.10
ENST00000002596.5
ENST00000002829.7
...
TALONT000482711
TALONT000482903
TALONT000483195
TALONT000483284
TALONT000483315
TALONT000483322
TALONT000483327
TALONT000483978
TALONT000484004
TALONT000484796
hepg2_1
100.000000
100.0
100.000000
100.0
0.0
100.000000
100.0
93.750000
0.0
0.0
...
0.000000
1.904762
6.666667
13.043478
0.000000
0.000000
1.333333
100.0
100.0
0.000000
hepg2_2
99.519226
100.0
60.000004
100.0
0.0
100.000000
100.0
80.882355
0.0
0.0
...
5.263158
3.225806
13.793103
8.695652
0.884956
3.097345
0.884956
100.0
100.0
2.380952
hffc6_1
98.039215
100.0
0.000000
100.0
0.0
100.000000
100.0
0.000000
0.0
0.0
...
2.604167
2.092050
16.279070
1.428571
0.854701
1.196581
0.854701
0.0
100.0
1.886792
hffc6_2
99.082573
100.0
0.000000
100.0
0.0
77.777779
100.0
0.000000
0.0
0.0
...
4.285715
2.144772
11.627908
14.893617
0.166667
1.333333
2.000000
0.0
100.0
9.859155
hffc6_3
100.000000
100.0
100.000000
100.0
0.0
85.714287
100.0
100.000000
0.0
0.0
...
4.326923
2.536232
15.151516
10.638298
1.711491
1.711491
0.977995
0.0
100.0
13.636364
5 rows × 208306 columns
As a note, the calc_pi()
function outputs not only a table of pi values but of counts per isoform per condition, which is used as an intermediate during DIE testing. To avoid recalculation, it is output here.
ENST00000000233.9
ENST00000000412.7
ENST00000000442.10
ENST00000001008.5
ENST00000001146.6
ENST00000002125.8
ENST00000002165.10
ENST00000002501.10
ENST00000002596.5
ENST00000002829.7
...
TALONT000482711
TALONT000482903
TALONT000483195
TALONT000483284
TALONT000483315
TALONT000483322
TALONT000483327
TALONT000483978
TALONT000484004
TALONT000484796
hepg2_1
98.0
43.0
4.0
23.0
0.0
8.0
129.0
30.0
0.0
0.0
...
0.0
2.0
1.0
6.0
0.0
0.0
2.0
7.0
4.0
0.0
hepg2_2
207.0
66.0
6.0
52.0
0.0
11.0
323.0
55.0
0.0
0.0
...
1.0
12.0
4.0
6.0
2.0
7.0
2.0
10.0
9.0
1.0
hffc6_1
100.0
148.0
0.0
82.0
0.0
5.0
212.0
0.0
0.0
0.0
...
5.0
10.0
7.0
1.0
5.0
7.0
5.0
0.0
12.0
1.0
hffc6_2
108.0
191.0
0.0
98.0
0.0
7.0
173.0
0.0
0.0
0.0
...
12.0
8.0
5.0
7.0
1.0
8.0
12.0
0.0
13.0
7.0
hffc6_3
91.0
168.0
2.0
106.0
0.0
6.0
123.0
1.0
0.0
0.0
...
9.0
7.0
5.0
5.0
7.0
7.0
4.0
0.0
15.0
6.0
5 rows × 208306 columns
We can also calculate the pi value for the TSSs and TESs in each dataset:
And we can also choose to calculate pi values using a different metadata column, here shown on the cell_line
column:
tid
ENST00000000233.9
ENST00000000412.7
ENST00000000442.10
ENST00000001008.5
ENST00000001146.6
ENST00000002125.8
ENST00000002165.10
ENST00000002501.10
ENST00000002596.5
ENST00000002829.7
...
TALONT000482711
TALONT000482903
TALONT000483195
TALONT000483284
TALONT000483315
TALONT000483322
TALONT000483327
TALONT000483978
TALONT000484004
TALONT000484796
hepg2
99.673203
100.0
71.428574
100.0
0.0
100.000000
100.0
85.0
0.0
0.0
...
4.545455
2.935011
11.363637
10.434782
0.531915
1.861702
1.06383
100.0
100.0
1.785714
hffc6
99.006622
100.0
100.000000
100.0
0.0
85.714287
100.0
100.0
0.0
0.0
...
3.823529
2.218279
14.285715
7.926829
0.815558
1.380176
1.31744
0.0
100.0
8.333334
2 rows × 208306 columns
Obtaining edge abundance information
In case you're interested in doing outside analyses on the level (For instance, using intron counting to assess alternative splicing), Swan provides a tool to output a DataFrame with edge abundance on the dataset level.
If we just want to get access to the edge abundance DataFrame, just use the get_edge_abundance()
function:
strand
edge_type
annotation
chrom
start
stop
hepg2_1
hepg2_2
hffc6_1
hffc6_2
hffc6_3
0
+
exon
True
chr1
11869
12227
0.0
0.0
0.0
0.0
0.0
1
+
exon
True
chr1
12010
12057
0.0
0.0
0.0
0.0
0.0
2
+
intron
True
chr1
12057
12179
0.0
0.0
0.0
0.0
0.0
3
+
exon
True
chr1
12179
12227
0.0
0.0
0.0
0.0
0.0
4
+
intron
True
chr1
12227
12613
0.0
0.0
0.0
0.0
0.0
You can also specify if you want the data to be output in raw counts (kind='counts'
) or TPM (kind='tpm
). By default, this function returns counts. Here's an example with TPM:
strand
edge_type
annotation
chrom
start
stop
hepg2_1
hepg2_2
hffc6_1
hffc6_2
hffc6_3
0
+
exon
True
chr1
11869
12227
0.0
0.0
0.0
0.0
0.0
1
+
exon
True
chr1
12010
12057
0.0
0.0
0.0
0.0
0.0
2
+
intron
True
chr1
12057
12179
0.0
0.0
0.0
0.0
0.0
3
+
exon
True
chr1
12179
12227
0.0
0.0
0.0
0.0
0.0
4
+
intron
True
chr1
12227
12613
0.0
0.0
0.0
0.0
0.0
And finally, if you wish, you can provide the function with a prefix
value which will indicate that you want the output DataFrame to be saved in TSV form.
strand
edge_type
annotation
chrom
start
stop
hepg2_1
hepg2_2
hffc6_1
hffc6_2
hffc6_3
0
+
exon
True
chr1
11869
12227
0.0
0.0
0.0
0.0
0.0
1
+
exon
True
chr1
12010
12057
0.0
0.0
0.0
0.0
0.0
2
+
intron
True
chr1
12057
12179
0.0
0.0
0.0
0.0
0.0
3
+
exon
True
chr1
12179
12227
0.0
0.0
0.0
0.0
0.0
4
+
intron
True
chr1
12227
12613
0.0
0.0
0.0
0.0
0.0
The results will be saved in '{prefix}_edge_abundance.tsv'.
Obtaining TSS/TES abundance information
Similarly, if you wish to do analysis involving your TSS or TES data, you can also output these using the get_tss_abundance()
and get_tes_abundance()
functions respectively. These have identical options to get_edge_abundance()
so they can either output counts or TPM and optionally save to an output file.
First, let's output the TSS TPM to a file:
tss_id
gid
gname
vertex_id
tss_name
chrom
coord
hepg2_1
hepg2_2
hffc6_1
hffc6_2
hffc6_3
0
ENSG00000000003.14_1
ENSG00000000003.14
TSPAN6
926111
TSPAN6_1
chrX
100636191
0.00000
0.000000
0.000000
0.000000
0.00000
1
ENSG00000000003.14_2
ENSG00000000003.14
TSPAN6
926112
TSPAN6_2
chrX
100636608
232.16391
276.976654
45.962341
53.301723
68.30146
2
ENSG00000000003.14_3
ENSG00000000003.14
TSPAN6
926114
TSPAN6_3
chrX
100636793
0.00000
0.000000
0.000000
0.000000
0.00000
3
ENSG00000000003.14_4
ENSG00000000003.14
TSPAN6
926117
TSPAN6_4
chrX
100639945
0.00000
0.000000
0.000000
0.000000
0.00000
4
ENSG00000000005.5_1
ENSG00000000005.5
TNMD
926077
TNMD_1
chrX
100585066
0.00000
0.000000
0.000000
0.000000
0.00000
Now we'll get the counts of each TES without saving to a file:
tes_id
gid
gname
vertex_id
tes_name
chrom
coord
hepg2_1
hepg2_2
hffc6_1
hffc6_2
hffc6_3
0
ENSG00000000003.14_1
ENSG00000000003.14
TSPAN6
926092
TSPAN6_1
chrX
100627109
0.0
0.0
0.0
0.0
0.0
1
ENSG00000000003.14_2
ENSG00000000003.14
TSPAN6
926093
TSPAN6_2
chrX
100628670
116.0
235.0
35.0
42.0
42.0
2
ENSG00000000003.14_3
ENSG00000000003.14
TSPAN6
926097
TSPAN6_3
chrX
100632063
0.0
0.0
0.0
0.0
0.0
3
ENSG00000000003.14_4
ENSG00000000003.14
TSPAN6
926100
TSPAN6_4
chrX
100632541
0.0
0.0
0.0
0.0
0.0
4
ENSG00000000003.14_5
ENSG00000000003.14
TSPAN6
926103
TSPAN6_5
chrX
100633442
0.0
0.0
0.0
0.0
0.0
Last updated