Analysis
Swan has several analysis options to use.
Differential expression tests
Swan's old differential gene and transcript expression tests using diffxpy
have now been deprecated as it seems that the library is unsupported. I recommend that users interested in running differential gene or transcript expression test either use PyDESeq2 or Scanpy's rank_genes_groups
test, which both support the AnnData format for simple compatibility with Swan's AnnData expression representation.
Using PyDESeq2
Please read the PyDESeq2 documentation for details on how to use one of the SwanGraph AnnData objects to obtain differential expression results. Below is an example on how to find differentially expressed transcripts between cell lines.
tid
TALONT000296400
3.497574
2.416436
1.472616
1.640914
0.100815
0.193840
ENST00000591581.1
0.162115
0.624956
5.002765
0.124922
0.900585
NaN
ENST00000546893.5
8.173445
-0.053334
0.793179
-0.067241
0.946390
0.968527
ENST00000537289.1
5.140216
-1.248175
0.978927
-1.275044
0.202294
0.328286
ENST00000258382.9
2.012104
-0.011819
1.493866
-0.007912
0.993687
NaN
...
...
...
...
...
...
...
ENST00000506914.1
1.022692
0.947588
2.272813
0.416923
0.676735
NaN
ENST00000571080.1
0.473263
-2.824409
3.426473
-0.824291
0.409774
NaN
ENST00000378615.7
1.334873
-1.199810
1.790781
-0.669993
0.502862
NaN
ENST00000409586.7
0.338615
1.464364
4.023896
0.363917
0.715920
NaN
ENST00000370278.3
1.328549
-1.201246
1.962358
-0.612144
0.540443
NaN
34814 rows × 6 columns
tid
TALONT000296400
3.497574
2.416436
1.472616
1.640914
0.100815
0.193840
ENST00000591581.1
0.162115
0.624956
5.002765
0.124922
0.900585
NaN
ENST00000546893.5
8.173445
-0.053334
0.793179
-0.067241
0.946390
0.968527
ENST00000537289.1
5.140216
-1.248175
0.978927
-1.275044
0.202294
0.328286
ENST00000258382.9
2.012104
-0.011819
1.493866
-0.007912
0.993687
NaN
Isoform switching / Differential isoform expression testing
Isoform switching / differential isoform expression (DIE) testing is implemented according to the strategy in Joglekar et. al., 2021. DIE can roughly be described as finding statistically significant changes in isoform expression between two conditions along with a change in percent isoform usage per gene.
Pairwise comparisons can be set up using different columns in the metadata that was added to the SwanGraph with the obs_col
and obs_conditions
arguments.
dataset
hepg2_1
hepg2
1
hepg2_1
499647.0
hepg2_2
hepg2
2
hepg2_2
848447.0
hffc6_1
hffc6
1
hffc6_1
761493.0
hffc6_2
hffc6
2
hffc6_2
787967.0
hffc6_3
hffc6
3
hffc6_3
614921.0
The resultant table contains an entry for each gene with the p value (p_val
), adjusted p value (adj_p_val
), and change in percent isoform usage for the top two isoforms (dpi
), as well as the identities of the top isoforms involved in the switch. Exact details on these calculations can be found in Joglekar et. al., 2021.
0
ENSG00000130175
0.206667
39.264992
TALONT000296399
NaN
39.264992
NaN
TALONT000296400
ENST00000589838.5
-20.116056
-10.638298
0.469420
PRKCSH
1
ENSG00000130202
0.000367
11.893251
ENST00000252485.8
ENST00000252483.9
9.684967
2.208281
TALONT000406668
ENST00000591581.1
-11.473083
-0.420168
0.003560
NECTIN2
2
ENSG00000111371
0.680435
9.401713
ENST00000398637.9
ENST00000439706.5
8.547012
0.854701
ENST00000546893.5
NaN
-9.401711
NaN
0.886243
SLC38A1
3
ENSG00000181924
0.028195
9.452934
ENST00000537289.1
ENST00000545127.1
7.298603
2.154328
ENST00000355693.4
NaN
-9.452934
NaN
0.122619
COA4
4
ENSG00000163468
0.255788
0.680048
ENST00000295688.7
TALONT000476055
0.366966
0.277693
ENST00000368259.6
ENST00000489870.1
-0.568503
-0.111545
0.525066
CCT3
Differential isoform expression testing results are stored automatically in sg.adata.uns
, and will be saved to the SwanGraph if you save it. You can regenerate this key and access the summary table by running the following code:
0
ENSG00000130175
0.206667
39.264992
TALONT000296399
NaN
39.264992
NaN
TALONT000296400
ENST00000589838.5
-20.116056
-10.638298
0.46942
PRKCSH
1
ENSG00000130202
0.000367
11.893251
ENST00000252485.8
ENST00000252483.9
9.684967
2.208281
TALONT000406668
ENST00000591581.1
-11.473083
-0.420168
0.00356
NECTIN2
Swan comes with an easy way to filter your DIE test results based on adjusted p value and dpi thresholds.
1
ENSG00000130202
3.674234e-04
11.893251
ENST00000252485.8
ENST00000252483.9
9.684967
2.208281
TALONT000406668
ENST00000591581.1
-11.473083
-0.420168
3.560035e-03
NECTIN2
8
ENSG00000025156
1.857411e-03
35.714287
ENST00000452194.5
ENST00000465214.2
25.714287
10.000000
ENST00000368455.8
NaN
-35.714287
NaN
1.405048e-02
HSF2
20
ENSG00000105254
4.779408e-38
25.419458
ENST00000585910.5
TALONT000366329
20.394853
4.702971
ENST00000221855.7
ENST00000589996.5
-25.016304
-0.403154
7.343219e-36
TBCB
23
ENSG00000105379
1.802782e-307
81.136333
ENST00000354232.8
NaN
81.136333
NaN
ENST00000309244.8
ENST00000596253.1
-80.857935
-0.278394
1.551113e-304
ETFB
28
ENSG00000148180
0.000000e+00
89.319039
ENST00000373818.8
TALONT000419680
85.245850
4.073189
TALONT000418752
ENST00000373808.8
-68.603180
-7.768958
0.000000e+00
GSN
Swan also now automatically tracks transcription start site (TSS) and transcription end site (TES) usage, and can find genes that exhibit DIE on the basis of their starts or ends. To do this, use the kind
argument to die_gene_test
.
0
ENSG00000000419
0.249561
10.002187
ENSG00000000419_2
ENSG00000000419_1
9.906052
0.096133
ENSG00000000419_3
ENSG00000000419_4
-7.218704
-2.783483
0.536906
DPM1
1
ENSG00000001461
0.852914
9.093739
ENSG00000001461_5
NaN
9.093739
NaN
ENSG00000001461_1
ENSG00000001461_3
-5.543442
-1.775148
1.000000
NIPAL3
2
ENSG00000001497
0.891496
3.846153
ENSG00000001497_1
NaN
3.846146
NaN
ENSG00000001497_2
NaN
-3.846153
NaN
1.000000
LAS1L
3
ENSG00000001630
0.286866
2.580168
ENSG00000001630_3
NaN
2.580168
NaN
ENSG00000001630_2
ENSG00000001630_1
-1.549232
-1.030928
0.582636
CYP51A1
4
ENSG00000002330
0.184635
12.817431
ENSG00000002330_2
ENSG00000002330_1
11.868484
0.948944
ENSG00000002330_4
ENSG00000002330_3
-12.603584
-0.213847
0.454053
BAD
To access the die results on the tss level, use die_kind='tss'
as input to make_uns_key()
.
0
ENSG00000000419
0.249561
10.002187
ENSG00000000419_2
ENSG00000000419_1
9.906052
0.096133
ENSG00000000419_3
ENSG00000000419_4
-7.218704
-2.783483
0.536906
DPM1
1
ENSG00000001461
0.852914
9.093739
ENSG00000001461_5
NaN
9.093739
NaN
ENSG00000001461_1
ENSG00000001461_3
-5.543442
-1.775148
1.000000
NIPAL3
And provide the kind='tss'
option to get_die_genes()
when trying to filter your test results.
10
ENSG00000003402
7.338098e-15
84.848486
ENSG00000003402_3
ENSG00000003402_7
77.272728
7.575758
ENSG00000003402_1
ENSG00000003402_5
-31.717173
-22.222223
4.390279e-13
CFLAR
11
ENSG00000003436
3.812087e-06
34.072281
ENSG00000003436_1
ENSG00000003436_3
28.073771
4.564083
ENSG00000003436_4
NaN
-34.072281
NaN
7.064168e-05
TFPI
16
ENSG00000004487
1.135407e-03
75.714287
ENSG00000004487_1
NaN
75.714287
NaN
ENSG00000004487_2
NaN
-75.714285
NaN
1.020405e-02
KDM1A
33
ENSG00000006282
4.544048e-03
19.819595
ENSG00000006282_2
ENSG00000006282_1
13.679245
6.140350
ENSG00000006282_3
NaN
-19.819595
NaN
3.236475e-02
SPATA20
43
ENSG00000007376
1.256379e-04
27.898551
ENSG00000007376_3
ENSG00000007376_1
23.454107
4.444445
ENSG00000007376_5
ENSG00000007376_7
-14.130434
-7.512077
1.525134e-03
RPUSD1
For TESs, use kind='tes'
as input to die_genes_test()
, die_kind='tes'
to make_uns_key()
, and kind='tes'
to get_die_genes()
.
0
ENSG00000000419
1.000000
0.096133
ENSG00000000419_2
NaN
0.096133
NaN
ENSG00000000419_1
NaN
-0.096130
NaN
1.000000
DPM1
1
ENSG00000001461
0.852914
9.093739
ENSG00000001461_3
NaN
9.093739
NaN
ENSG00000001461_4
ENSG00000001461_1
-5.543442
-1.775148
1.000000
NIPAL3
2
ENSG00000001630
0.286866
2.580168
ENSG00000001630_2
NaN
2.580168
NaN
ENSG00000001630_1
ENSG00000001630_3
-1.549232
-1.030928
0.618077
CYP51A1
3
ENSG00000002330
0.184635
12.817431
ENSG00000002330_1
ENSG00000002330_4
11.868484
0.948944
ENSG00000002330_2
ENSG00000002330_3
-12.603584
-0.213847
0.480860
BAD
4
ENSG00000002549
0.679148
0.694543
ENSG00000002549_2
NaN
0.694542
NaN
ENSG00000002549_1
NaN
-0.694543
NaN
0.926889
LAP3
0
ENSG00000000419
1.000000
0.096133
ENSG00000000419_2
NaN
0.096133
NaN
ENSG00000000419_1
NaN
-0.096130
NaN
1.0
DPM1
1
ENSG00000001461
0.852914
9.093739
ENSG00000001461_3
NaN
9.093739
NaN
ENSG00000001461_4
ENSG00000001461_1
-5.543442
-1.775148
1.0
NIPAL3
9
ENSG00000003402
7.338098e-15
84.848486
ENSG00000003402_5
ENSG00000003402_6
77.272728
7.575758
ENSG00000003402_9
ENSG00000003402_7
-31.717173
-22.222223
5.296535e-13
CFLAR
10
ENSG00000003436
2.772096e-07
32.637854
ENSG00000003436_1
ENSG00000003436_4
22.082711
10.555142
ENSG00000003436_2
ENSG00000003436_5
-30.435921
-1.818182
7.003007e-06
TFPI
14
ENSG00000004487
1.135407e-03
75.714287
ENSG00000004487_2
NaN
75.714287
NaN
ENSG00000004487_1
NaN
-75.714285
NaN
1.141621e-02
KDM1A
32
ENSG00000006282
6.858641e-03
19.819595
ENSG00000006282_2
NaN
19.819595
NaN
ENSG00000006282_1
NaN
-19.819595
NaN
4.915014e-02
SPATA20
60
ENSG00000010278
3.191221e-22
39.024387
ENSG00000010278_2
NaN
39.024387
NaN
ENSG00000010278_3
ENSG00000010278_1
-38.605976
-0.418410
3.486194e-20
CD9
Finally, for transcriptomes generated with Cerberus, Swan automatically tracks intron chain usage, and you can perform intron chain switching analysis with kind='ic'
as input to the die_gene_test()
function.
0
ENCODEMG000055991
4.100695e-01
6.071428
ENCODEMG000055991_2
ENCODEMG000055991_3
3.571428
2.500000
ENCODEMG000055991_1
NaN
-6.071426
NaN
0.707992
ENCODEMG000055991
1
ENCODEMG000055998
6.190162e-01
9.722223
ENCODEMG000055998_2
NaN
9.722223
NaN
ENCODEMG000055998_1
NaN
-9.722218
NaN
0.836200
ENCODEMG000055998
2
ENCODEMG000056718
8.081718e-07
6.289159
NaN
NaN
2.329770
1.953835
NaN
NaN
-3.339438
-2.949721
0.000020
ENCODEMG000056718
3
ENCODEMG000056804
2.107047e-03
27.863049
ENCODEMG000056804_1
NaN
27.863047
NaN
ENCODEMG000056804_2
NaN
-27.863049
NaN
0.021510
ENCODEMG000056804
4
ENCODEMG000063411
7.699701e-01
12.500000
ENCODEMG000063411_1
NaN
12.500000
NaN
ENCODEMG000063411_2
NaN
-12.500000
NaN
0.919808
ENCODEMG000063411
Combining metadata columns
What if none of the metadata columns you have summarize the comparisons you want to make? What if I want to find differentially expressed genes or transcripts, or find isoform-switching genes between hffc6 replicate 3 and hepg2 replicate 1?
dataset
hepg2_1
hepg2
1
hepg2_1
499647.0
hepg2_2
hepg2
2
hepg2_2
848447.0
hffc6_1
hffc6
1
hffc6_1
761493.0
hffc6_2
hffc6
2
hffc6_2
787967.0
hffc6_3
hffc6
3
hffc6_3
614921.0
Let's ignore for a moment the fact that the dataset
column does effectively capture both replicate as well as cell line metadata. This may not always be the case with more complex datasets. Swan has a function to concatenate columns together and add them as an additional column to the metadata tables. Use the following code to generate a new column that concatenates as many preexisting metadata columns as you wish:
The added column in col_name
can then be used as the obs_col
input to die_gene_test()
as follows:
Exon skipping and intron retention
Swan can detect novel (unannotated) exon skipping and intron retention events.
To obtain a dataframe of novel exon skipping events, run the following code:
0
ENSG00000157916.19
TALONT000218256
952616
0
ENSG00000122406.13
TALONT000425229
952716
0
ENSG00000224093.5
TALONT000434035
952720
0
ENSG00000224093.5
TALONT000434035
952720
0
ENSG00000224093.5
TALONT000434035
952720
To obtain a list of genes containing novel intron retention events, run the following code:
You can pass gene IDs from es_df
into gen_report()
or plot_graph()
to visualize where these exon-skipping events are in gene reports or gene summary graphs respectively.
Last updated