Analysis
Swan has several analysis options to use.
Differential expression tests
Swan's old differential gene and transcript expression tests using diffxpy
have now been deprecated as it seems that the library is unsupported. I recommend that users interested in running differential gene or transcript expression test either use PyDESeq2 or Scanpy's rank_genes_groups
test, which both support the AnnData format for simple compatibility with Swan's AnnData expression representation.
Using PyDESeq2
Please read the PyDESeq2 documentation for details on how to use one of the SwanGraph AnnData objects to obtain differential expression results. Below is an example on how to find differentially expressed transcripts between cell lines.
baseMean | log2FoldChange | lfcSE | stat | pvalue | padj | |
---|---|---|---|---|---|---|
tid | ||||||
TALONT000296400 | 3.497574 | 2.416436 | 1.472616 | 1.640914 | 0.100815 | 0.193840 |
ENST00000591581.1 | 0.162115 | 0.624956 | 5.002765 | 0.124922 | 0.900585 | NaN |
ENST00000546893.5 | 8.173445 | -0.053334 | 0.793179 | -0.067241 | 0.946390 | 0.968527 |
ENST00000537289.1 | 5.140216 | -1.248175 | 0.978927 | -1.275044 | 0.202294 | 0.328286 |
ENST00000258382.9 | 2.012104 | -0.011819 | 1.493866 | -0.007912 | 0.993687 | NaN |
... | ... | ... | ... | ... | ... | ... |
ENST00000506914.1 | 1.022692 | 0.947588 | 2.272813 | 0.416923 | 0.676735 | NaN |
ENST00000571080.1 | 0.473263 | -2.824409 | 3.426473 | -0.824291 | 0.409774 | NaN |
ENST00000378615.7 | 1.334873 | -1.199810 | 1.790781 | -0.669993 | 0.502862 | NaN |
ENST00000409586.7 | 0.338615 | 1.464364 | 4.023896 | 0.363917 | 0.715920 | NaN |
ENST00000370278.3 | 1.328549 | -1.201246 | 1.962358 | -0.612144 | 0.540443 | NaN |
34814 rows × 6 columns
baseMean | log2FoldChange | lfcSE | stat | pvalue | padj | |
---|---|---|---|---|---|---|
tid | ||||||
TALONT000296400 | 3.497574 | 2.416436 | 1.472616 | 1.640914 | 0.100815 | 0.193840 |
ENST00000591581.1 | 0.162115 | 0.624956 | 5.002765 | 0.124922 | 0.900585 | NaN |
ENST00000546893.5 | 8.173445 | -0.053334 | 0.793179 | -0.067241 | 0.946390 | 0.968527 |
ENST00000537289.1 | 5.140216 | -1.248175 | 0.978927 | -1.275044 | 0.202294 | 0.328286 |
ENST00000258382.9 | 2.012104 | -0.011819 | 1.493866 | -0.007912 | 0.993687 | NaN |
Isoform switching / Differential isoform expression testing
Isoform switching / differential isoform expression (DIE) testing is implemented according to the strategy in Joglekar et. al., 2021. DIE can roughly be described as finding statistically significant changes in isoform expression between two conditions along with a change in percent isoform usage per gene.
Pairwise comparisons can be set up using different columns in the metadata that was added to the SwanGraph with the obs_col
and obs_conditions
arguments.
cell_line | replicate | dataset | total_counts | |
---|---|---|---|---|
dataset | ||||
hepg2_1 | hepg2 | 1 | hepg2_1 | 499647.0 |
hepg2_2 | hepg2 | 2 | hepg2_2 | 848447.0 |
hffc6_1 | hffc6 | 1 | hffc6_1 | 761493.0 |
hffc6_2 | hffc6 | 2 | hffc6_2 | 787967.0 |
hffc6_3 | hffc6 | 3 | hffc6_3 | 614921.0 |
The resultant table contains an entry for each gene with the p value (p_val
), adjusted p value (adj_p_val
), and change in percent isoform usage for the top two isoforms (dpi
), as well as the identities of the top isoforms involved in the switch. Exact details on these calculations can be found in Joglekar et. al., 2021.
gid | p_val | dpi | pos_iso_1 | pos_iso_2 | pos_iso_1_dpi | pos_iso_2_dpi | neg_iso_1 | neg_iso_2 | neg_iso_1_dpi | neg_iso_2_dpi | adj_p_val | gname | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ENSG00000130175 | 0.206667 | 39.264992 | TALONT000296399 | NaN | 39.264992 | NaN | TALONT000296400 | ENST00000589838.5 | -20.116056 | -10.638298 | 0.469420 | PRKCSH |
1 | ENSG00000130202 | 0.000367 | 11.893251 | ENST00000252485.8 | ENST00000252483.9 | 9.684967 | 2.208281 | TALONT000406668 | ENST00000591581.1 | -11.473083 | -0.420168 | 0.003560 | NECTIN2 |
2 | ENSG00000111371 | 0.680435 | 9.401713 | ENST00000398637.9 | ENST00000439706.5 | 8.547012 | 0.854701 | ENST00000546893.5 | NaN | -9.401711 | NaN | 0.886243 | SLC38A1 |
3 | ENSG00000181924 | 0.028195 | 9.452934 | ENST00000537289.1 | ENST00000545127.1 | 7.298603 | 2.154328 | ENST00000355693.4 | NaN | -9.452934 | NaN | 0.122619 | COA4 |
4 | ENSG00000163468 | 0.255788 | 0.680048 | ENST00000295688.7 | TALONT000476055 | 0.366966 | 0.277693 | ENST00000368259.6 | ENST00000489870.1 | -0.568503 | -0.111545 | 0.525066 | CCT3 |
Differential isoform expression testing results are stored automatically in sg.adata.uns
, and will be saved to the SwanGraph if you save it. You can regenerate this key and access the summary table by running the following code:
gid | p_val | dpi | pos_iso_1 | pos_iso_2 | pos_iso_1_dpi | pos_iso_2_dpi | neg_iso_1 | neg_iso_2 | neg_iso_1_dpi | neg_iso_2_dpi | adj_p_val | gname | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ENSG00000130175 | 0.206667 | 39.264992 | TALONT000296399 | NaN | 39.264992 | NaN | TALONT000296400 | ENST00000589838.5 | -20.116056 | -10.638298 | 0.46942 | PRKCSH |
1 | ENSG00000130202 | 0.000367 | 11.893251 | ENST00000252485.8 | ENST00000252483.9 | 9.684967 | 2.208281 | TALONT000406668 | ENST00000591581.1 | -11.473083 | -0.420168 | 0.00356 | NECTIN2 |
Swan comes with an easy way to filter your DIE test results based on adjusted p value and dpi thresholds.
gid | p_val | dpi | pos_iso_1 | pos_iso_2 | pos_iso_1_dpi | pos_iso_2_dpi | neg_iso_1 | neg_iso_2 | neg_iso_1_dpi | neg_iso_2_dpi | adj_p_val | gname | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | ENSG00000130202 | 3.674234e-04 | 11.893251 | ENST00000252485.8 | ENST00000252483.9 | 9.684967 | 2.208281 | TALONT000406668 | ENST00000591581.1 | -11.473083 | -0.420168 | 3.560035e-03 | NECTIN2 |
8 | ENSG00000025156 | 1.857411e-03 | 35.714287 | ENST00000452194.5 | ENST00000465214.2 | 25.714287 | 10.000000 | ENST00000368455.8 | NaN | -35.714287 | NaN | 1.405048e-02 | HSF2 |
20 | ENSG00000105254 | 4.779408e-38 | 25.419458 | ENST00000585910.5 | TALONT000366329 | 20.394853 | 4.702971 | ENST00000221855.7 | ENST00000589996.5 | -25.016304 | -0.403154 | 7.343219e-36 | TBCB |
23 | ENSG00000105379 | 1.802782e-307 | 81.136333 | ENST00000354232.8 | NaN | 81.136333 | NaN | ENST00000309244.8 | ENST00000596253.1 | -80.857935 | -0.278394 | 1.551113e-304 | ETFB |
28 | ENSG00000148180 | 0.000000e+00 | 89.319039 | ENST00000373818.8 | TALONT000419680 | 85.245850 | 4.073189 | TALONT000418752 | ENST00000373808.8 | -68.603180 | -7.768958 | 0.000000e+00 | GSN |
Swan also now automatically tracks transcription start site (TSS) and transcription end site (TES) usage, and can find genes that exhibit DIE on the basis of their starts or ends. To do this, use the kind
argument to die_gene_test
.
gid | p_val | dpi | pos_iso_1 | pos_iso_2 | pos_iso_1_dpi | pos_iso_2_dpi | neg_iso_1 | neg_iso_2 | neg_iso_1_dpi | neg_iso_2_dpi | adj_p_val | gname | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ENSG00000000419 | 0.249561 | 10.002187 | ENSG00000000419_2 | ENSG00000000419_1 | 9.906052 | 0.096133 | ENSG00000000419_3 | ENSG00000000419_4 | -7.218704 | -2.783483 | 0.536906 | DPM1 |
1 | ENSG00000001461 | 0.852914 | 9.093739 | ENSG00000001461_5 | NaN | 9.093739 | NaN | ENSG00000001461_1 | ENSG00000001461_3 | -5.543442 | -1.775148 | 1.000000 | NIPAL3 |
2 | ENSG00000001497 | 0.891496 | 3.846153 | ENSG00000001497_1 | NaN | 3.846146 | NaN | ENSG00000001497_2 | NaN | -3.846153 | NaN | 1.000000 | LAS1L |
3 | ENSG00000001630 | 0.286866 | 2.580168 | ENSG00000001630_3 | NaN | 2.580168 | NaN | ENSG00000001630_2 | ENSG00000001630_1 | -1.549232 | -1.030928 | 0.582636 | CYP51A1 |
4 | ENSG00000002330 | 0.184635 | 12.817431 | ENSG00000002330_2 | ENSG00000002330_1 | 11.868484 | 0.948944 | ENSG00000002330_4 | ENSG00000002330_3 | -12.603584 | -0.213847 | 0.454053 | BAD |
To access the die results on the tss level, use die_kind='tss'
as input to make_uns_key()
.
gid | p_val | dpi | pos_iso_1 | pos_iso_2 | pos_iso_1_dpi | pos_iso_2_dpi | neg_iso_1 | neg_iso_2 | neg_iso_1_dpi | neg_iso_2_dpi | adj_p_val | gname | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ENSG00000000419 | 0.249561 | 10.002187 | ENSG00000000419_2 | ENSG00000000419_1 | 9.906052 | 0.096133 | ENSG00000000419_3 | ENSG00000000419_4 | -7.218704 | -2.783483 | 0.536906 | DPM1 |
1 | ENSG00000001461 | 0.852914 | 9.093739 | ENSG00000001461_5 | NaN | 9.093739 | NaN | ENSG00000001461_1 | ENSG00000001461_3 | -5.543442 | -1.775148 | 1.000000 | NIPAL3 |
And provide the kind='tss'
option to get_die_genes()
when trying to filter your test results.
gid | p_val | dpi | pos_iso_1 | pos_iso_2 | pos_iso_1_dpi | pos_iso_2_dpi | neg_iso_1 | neg_iso_2 | neg_iso_1_dpi | neg_iso_2_dpi | adj_p_val | gname | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | ENSG00000003402 | 7.338098e-15 | 84.848486 | ENSG00000003402_3 | ENSG00000003402_7 | 77.272728 | 7.575758 | ENSG00000003402_1 | ENSG00000003402_5 | -31.717173 | -22.222223 | 4.390279e-13 | CFLAR |
11 | ENSG00000003436 | 3.812087e-06 | 34.072281 | ENSG00000003436_1 | ENSG00000003436_3 | 28.073771 | 4.564083 | ENSG00000003436_4 | NaN | -34.072281 | NaN | 7.064168e-05 | TFPI |
16 | ENSG00000004487 | 1.135407e-03 | 75.714287 | ENSG00000004487_1 | NaN | 75.714287 | NaN | ENSG00000004487_2 | NaN | -75.714285 | NaN | 1.020405e-02 | KDM1A |
33 | ENSG00000006282 | 4.544048e-03 | 19.819595 | ENSG00000006282_2 | ENSG00000006282_1 | 13.679245 | 6.140350 | ENSG00000006282_3 | NaN | -19.819595 | NaN | 3.236475e-02 | SPATA20 |
43 | ENSG00000007376 | 1.256379e-04 | 27.898551 | ENSG00000007376_3 | ENSG00000007376_1 | 23.454107 | 4.444445 | ENSG00000007376_5 | ENSG00000007376_7 | -14.130434 | -7.512077 | 1.525134e-03 | RPUSD1 |
For TESs, use kind='tes'
as input to die_genes_test()
, die_kind='tes'
to make_uns_key()
, and kind='tes'
to get_die_genes()
.
gid | p_val | dpi | pos_iso_1 | pos_iso_2 | pos_iso_1_dpi | pos_iso_2_dpi | neg_iso_1 | neg_iso_2 | neg_iso_1_dpi | neg_iso_2_dpi | adj_p_val | gname | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ENSG00000000419 | 1.000000 | 0.096133 | ENSG00000000419_2 | NaN | 0.096133 | NaN | ENSG00000000419_1 | NaN | -0.096130 | NaN | 1.000000 | DPM1 |
1 | ENSG00000001461 | 0.852914 | 9.093739 | ENSG00000001461_3 | NaN | 9.093739 | NaN | ENSG00000001461_4 | ENSG00000001461_1 | -5.543442 | -1.775148 | 1.000000 | NIPAL3 |
2 | ENSG00000001630 | 0.286866 | 2.580168 | ENSG00000001630_2 | NaN | 2.580168 | NaN | ENSG00000001630_1 | ENSG00000001630_3 | -1.549232 | -1.030928 | 0.618077 | CYP51A1 |
3 | ENSG00000002330 | 0.184635 | 12.817431 | ENSG00000002330_1 | ENSG00000002330_4 | 11.868484 | 0.948944 | ENSG00000002330_2 | ENSG00000002330_3 | -12.603584 | -0.213847 | 0.480860 | BAD |
4 | ENSG00000002549 | 0.679148 | 0.694543 | ENSG00000002549_2 | NaN | 0.694542 | NaN | ENSG00000002549_1 | NaN | -0.694543 | NaN | 0.926889 | LAP3 |
gid | p_val | dpi | pos_iso_1 | pos_iso_2 | pos_iso_1_dpi | pos_iso_2_dpi | neg_iso_1 | neg_iso_2 | neg_iso_1_dpi | neg_iso_2_dpi | adj_p_val | gname | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ENSG00000000419 | 1.000000 | 0.096133 | ENSG00000000419_2 | NaN | 0.096133 | NaN | ENSG00000000419_1 | NaN | -0.096130 | NaN | 1.0 | DPM1 |
1 | ENSG00000001461 | 0.852914 | 9.093739 | ENSG00000001461_3 | NaN | 9.093739 | NaN | ENSG00000001461_4 | ENSG00000001461_1 | -5.543442 | -1.775148 | 1.0 | NIPAL3 |
gid | p_val | dpi | pos_iso_1 | pos_iso_2 | pos_iso_1_dpi | pos_iso_2_dpi | neg_iso_1 | neg_iso_2 | neg_iso_1_dpi | neg_iso_2_dpi | adj_p_val | gname | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9 | ENSG00000003402 | 7.338098e-15 | 84.848486 | ENSG00000003402_5 | ENSG00000003402_6 | 77.272728 | 7.575758 | ENSG00000003402_9 | ENSG00000003402_7 | -31.717173 | -22.222223 | 5.296535e-13 | CFLAR |
10 | ENSG00000003436 | 2.772096e-07 | 32.637854 | ENSG00000003436_1 | ENSG00000003436_4 | 22.082711 | 10.555142 | ENSG00000003436_2 | ENSG00000003436_5 | -30.435921 | -1.818182 | 7.003007e-06 | TFPI |
14 | ENSG00000004487 | 1.135407e-03 | 75.714287 | ENSG00000004487_2 | NaN | 75.714287 | NaN | ENSG00000004487_1 | NaN | -75.714285 | NaN | 1.141621e-02 | KDM1A |
32 | ENSG00000006282 | 6.858641e-03 | 19.819595 | ENSG00000006282_2 | NaN | 19.819595 | NaN | ENSG00000006282_1 | NaN | -19.819595 | NaN | 4.915014e-02 | SPATA20 |
60 | ENSG00000010278 | 3.191221e-22 | 39.024387 | ENSG00000010278_2 | NaN | 39.024387 | NaN | ENSG00000010278_3 | ENSG00000010278_1 | -38.605976 | -0.418410 | 3.486194e-20 | CD9 |
Finally, for transcriptomes generated with Cerberus, Swan automatically tracks intron chain usage, and you can perform intron chain switching analysis with kind='ic'
as input to the die_gene_test()
function.
gid | p_val | dpi | pos_iso_1 | pos_iso_2 | pos_iso_1_dpi | pos_iso_2_dpi | neg_iso_1 | neg_iso_2 | neg_iso_1_dpi | neg_iso_2_dpi | adj_p_val | gname | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ENCODEMG000055991 | 4.100695e-01 | 6.071428 | ENCODEMG000055991_2 | ENCODEMG000055991_3 | 3.571428 | 2.500000 | ENCODEMG000055991_1 | NaN | -6.071426 | NaN | 0.707992 | ENCODEMG000055991 |
1 | ENCODEMG000055998 | 6.190162e-01 | 9.722223 | ENCODEMG000055998_2 | NaN | 9.722223 | NaN | ENCODEMG000055998_1 | NaN | -9.722218 | NaN | 0.836200 | ENCODEMG000055998 |
2 | ENCODEMG000056718 | 8.081718e-07 | 6.289159 | NaN | NaN | 2.329770 | 1.953835 | NaN | NaN | -3.339438 | -2.949721 | 0.000020 | ENCODEMG000056718 |
3 | ENCODEMG000056804 | 2.107047e-03 | 27.863049 | ENCODEMG000056804_1 | NaN | 27.863047 | NaN | ENCODEMG000056804_2 | NaN | -27.863049 | NaN | 0.021510 | ENCODEMG000056804 |
4 | ENCODEMG000063411 | 7.699701e-01 | 12.500000 | ENCODEMG000063411_1 | NaN | 12.500000 | NaN | ENCODEMG000063411_2 | NaN | -12.500000 | NaN | 0.919808 | ENCODEMG000063411 |
Combining metadata columns
What if none of the metadata columns you have summarize the comparisons you want to make? What if I want to find differentially expressed genes or transcripts, or find isoform-switching genes between hffc6 replicate 3 and hepg2 replicate 1?
cell_line | replicate | dataset | total_counts | |
---|---|---|---|---|
dataset | ||||
hepg2_1 | hepg2 | 1 | hepg2_1 | 499647.0 |
hepg2_2 | hepg2 | 2 | hepg2_2 | 848447.0 |
hffc6_1 | hffc6 | 1 | hffc6_1 | 761493.0 |
hffc6_2 | hffc6 | 2 | hffc6_2 | 787967.0 |
hffc6_3 | hffc6 | 3 | hffc6_3 | 614921.0 |
Let's ignore for a moment the fact that the dataset
column does effectively capture both replicate as well as cell line metadata. This may not always be the case with more complex datasets. Swan has a function to concatenate columns together and add them as an additional column to the metadata tables. Use the following code to generate a new column that concatenates as many preexisting metadata columns as you wish:
The added column in col_name
can then be used as the obs_col
input to die_gene_test()
as follows:
Exon skipping and intron retention
Swan can detect novel (unannotated) exon skipping and intron retention events.
To obtain a dataframe of novel exon skipping events, run the following code:
gid | tid | edge_id | |
---|---|---|---|
0 | ENSG00000157916.19 | TALONT000218256 | 952616 |
0 | ENSG00000122406.13 | TALONT000425229 | 952716 |
0 | ENSG00000224093.5 | TALONT000434035 | 952720 |
0 | ENSG00000224093.5 | TALONT000434035 | 952720 |
0 | ENSG00000224093.5 | TALONT000434035 | 952720 |
To obtain a list of genes containing novel intron retention events, run the following code:
You can pass gene IDs from es_df
into gen_report()
or plot_graph()
to visualize where these exon-skipping events are in gene reports or gene summary graphs respectively.
Last updated