Analysis

Swan has several analysis options to use.

import swan_vis as swan

sg = swan.read('data/swan.p')
Read in graph from data/swan.p

Differential expression tests

Swan's old differential gene and transcript expression tests using diffxpy have now been deprecated as it seems that the library is unsupported. I recommend that users interested in running differential gene or transcript expression test either use PyDESeq2 or Scanpy's rank_genes_groups test, which both support the AnnData format for simple compatibility with Swan's AnnData expression representation.

Using PyDESeq2

Please read the PyDESeq2 documentation for details on how to use one of the SwanGraph AnnData objects to obtain differential expression results. Below is an example on how to find differentially expressed transcripts between cell lines.

from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats
import numpy as np

adata = sg.adata.copy()

# PyDESeq2 currently doesn't support column names with underscores, so change that
adata.obs.rename({'cell_line': 'cellline'}, axis=1, inplace=True)
obs_col = 'cellline'

threads = 8

# densify matrix
adata.X = np.array(adata.X.todense())

# run test
dds = DeseqDataSet(adata=adata,
               design_factors=obs_col,
               n_cpus=threads,
               refit_cooks=True)
dds.deseq2()
stat_res = DeseqStats(dds,
                  n_cpus=threads)
stat_res.summary()

df = stat_res.results_df
baseMean
log2FoldChange
lfcSE
stat
pvalue
padj

tid

TALONT000296400

3.497574

2.416436

1.472616

1.640914

0.100815

0.193840

ENST00000591581.1

0.162115

0.624956

5.002765

0.124922

0.900585

NaN

ENST00000546893.5

8.173445

-0.053334

0.793179

-0.067241

0.946390

0.968527

ENST00000537289.1

5.140216

-1.248175

0.978927

-1.275044

0.202294

0.328286

ENST00000258382.9

2.012104

-0.011819

1.493866

-0.007912

0.993687

NaN

...

...

...

...

...

...

...

ENST00000506914.1

1.022692

0.947588

2.272813

0.416923

0.676735

NaN

ENST00000571080.1

0.473263

-2.824409

3.426473

-0.824291

0.409774

NaN

ENST00000378615.7

1.334873

-1.199810

1.790781

-0.669993

0.502862

NaN

ENST00000409586.7

0.338615

1.464364

4.023896

0.363917

0.715920

NaN

ENST00000370278.3

1.328549

-1.201246

1.962358

-0.612144

0.540443

NaN

34814 rows × 6 columns

baseMean
log2FoldChange
lfcSE
stat
pvalue
padj

tid

TALONT000296400

3.497574

2.416436

1.472616

1.640914

0.100815

0.193840

ENST00000591581.1

0.162115

0.624956

5.002765

0.124922

0.900585

NaN

ENST00000546893.5

8.173445

-0.053334

0.793179

-0.067241

0.946390

0.968527

ENST00000537289.1

5.140216

-1.248175

0.978927

-1.275044

0.202294

0.328286

ENST00000258382.9

2.012104

-0.011819

1.493866

-0.007912

0.993687

NaN

Isoform switching / Differential isoform expression testing

Isoform switching / differential isoform expression (DIE) testing is implemented according to the strategy in Joglekar et. al., 2021. DIE can roughly be described as finding statistically significant changes in isoform expression between two conditions along with a change in percent isoform usage per gene.

Pairwise comparisons can be set up using different columns in the metadata that was added to the SwanGraph with the obs_col and obs_conditions arguments.

cell_line
replicate
dataset
total_counts

dataset

hepg2_1

hepg2

1

hepg2_1

499647.0

hepg2_2

hepg2

2

hepg2_2

848447.0

hffc6_1

hffc6

1

hffc6_1

761493.0

hffc6_2

hffc6

2

hffc6_2

787967.0

hffc6_3

hffc6

3

hffc6_3

614921.0

The resultant table contains an entry for each gene with the p value (p_val), adjusted p value (adj_p_val), and change in percent isoform usage for the top two isoforms (dpi), as well as the identities of the top isoforms involved in the switch. Exact details on these calculations can be found in Joglekar et. al., 2021.

gid
p_val
dpi
pos_iso_1
pos_iso_2
pos_iso_1_dpi
pos_iso_2_dpi
neg_iso_1
neg_iso_2
neg_iso_1_dpi
neg_iso_2_dpi
adj_p_val
gname

0

ENSG00000130175

0.206667

39.264992

TALONT000296399

NaN

39.264992

NaN

TALONT000296400

ENST00000589838.5

-20.116056

-10.638298

0.469420

PRKCSH

1

ENSG00000130202

0.000367

11.893251

ENST00000252485.8

ENST00000252483.9

9.684967

2.208281

TALONT000406668

ENST00000591581.1

-11.473083

-0.420168

0.003560

NECTIN2

2

ENSG00000111371

0.680435

9.401713

ENST00000398637.9

ENST00000439706.5

8.547012

0.854701

ENST00000546893.5

NaN

-9.401711

NaN

0.886243

SLC38A1

3

ENSG00000181924

0.028195

9.452934

ENST00000537289.1

ENST00000545127.1

7.298603

2.154328

ENST00000355693.4

NaN

-9.452934

NaN

0.122619

COA4

4

ENSG00000163468

0.255788

0.680048

ENST00000295688.7

TALONT000476055

0.366966

0.277693

ENST00000368259.6

ENST00000489870.1

-0.568503

-0.111545

0.525066

CCT3

Differential isoform expression testing results are stored automatically in sg.adata.uns, and will be saved to the SwanGraph if you save it. You can regenerate this key and access the summary table by running the following code:

gid
p_val
dpi
pos_iso_1
pos_iso_2
pos_iso_1_dpi
pos_iso_2_dpi
neg_iso_1
neg_iso_2
neg_iso_1_dpi
neg_iso_2_dpi
adj_p_val
gname

0

ENSG00000130175

0.206667

39.264992

TALONT000296399

NaN

39.264992

NaN

TALONT000296400

ENST00000589838.5

-20.116056

-10.638298

0.46942

PRKCSH

1

ENSG00000130202

0.000367

11.893251

ENST00000252485.8

ENST00000252483.9

9.684967

2.208281

TALONT000406668

ENST00000591581.1

-11.473083

-0.420168

0.00356

NECTIN2

Swan comes with an easy way to filter your DIE test results based on adjusted p value and dpi thresholds.

gid
p_val
dpi
pos_iso_1
pos_iso_2
pos_iso_1_dpi
pos_iso_2_dpi
neg_iso_1
neg_iso_2
neg_iso_1_dpi
neg_iso_2_dpi
adj_p_val
gname

1

ENSG00000130202

3.674234e-04

11.893251

ENST00000252485.8

ENST00000252483.9

9.684967

2.208281

TALONT000406668

ENST00000591581.1

-11.473083

-0.420168

3.560035e-03

NECTIN2

8

ENSG00000025156

1.857411e-03

35.714287

ENST00000452194.5

ENST00000465214.2

25.714287

10.000000

ENST00000368455.8

NaN

-35.714287

NaN

1.405048e-02

HSF2

20

ENSG00000105254

4.779408e-38

25.419458

ENST00000585910.5

TALONT000366329

20.394853

4.702971

ENST00000221855.7

ENST00000589996.5

-25.016304

-0.403154

7.343219e-36

TBCB

23

ENSG00000105379

1.802782e-307

81.136333

ENST00000354232.8

NaN

81.136333

NaN

ENST00000309244.8

ENST00000596253.1

-80.857935

-0.278394

1.551113e-304

ETFB

28

ENSG00000148180

0.000000e+00

89.319039

ENST00000373818.8

TALONT000419680

85.245850

4.073189

TALONT000418752

ENST00000373808.8

-68.603180

-7.768958

0.000000e+00

GSN

Swan also now automatically tracks transcription start site (TSS) and transcription end site (TES) usage, and can find genes that exhibit DIE on the basis of their starts or ends. To do this, use the kind argument to die_gene_test.

gid
p_val
dpi
pos_iso_1
pos_iso_2
pos_iso_1_dpi
pos_iso_2_dpi
neg_iso_1
neg_iso_2
neg_iso_1_dpi
neg_iso_2_dpi
adj_p_val
gname

0

ENSG00000000419

0.249561

10.002187

ENSG00000000419_2

ENSG00000000419_1

9.906052

0.096133

ENSG00000000419_3

ENSG00000000419_4

-7.218704

-2.783483

0.536906

DPM1

1

ENSG00000001461

0.852914

9.093739

ENSG00000001461_5

NaN

9.093739

NaN

ENSG00000001461_1

ENSG00000001461_3

-5.543442

-1.775148

1.000000

NIPAL3

2

ENSG00000001497

0.891496

3.846153

ENSG00000001497_1

NaN

3.846146

NaN

ENSG00000001497_2

NaN

-3.846153

NaN

1.000000

LAS1L

3

ENSG00000001630

0.286866

2.580168

ENSG00000001630_3

NaN

2.580168

NaN

ENSG00000001630_2

ENSG00000001630_1

-1.549232

-1.030928

0.582636

CYP51A1

4

ENSG00000002330

0.184635

12.817431

ENSG00000002330_2

ENSG00000002330_1

11.868484

0.948944

ENSG00000002330_4

ENSG00000002330_3

-12.603584

-0.213847

0.454053

BAD

To access the die results on the tss level, use die_kind='tss' as input to make_uns_key().

gid
p_val
dpi
pos_iso_1
pos_iso_2
pos_iso_1_dpi
pos_iso_2_dpi
neg_iso_1
neg_iso_2
neg_iso_1_dpi
neg_iso_2_dpi
adj_p_val
gname

0

ENSG00000000419

0.249561

10.002187

ENSG00000000419_2

ENSG00000000419_1

9.906052

0.096133

ENSG00000000419_3

ENSG00000000419_4

-7.218704

-2.783483

0.536906

DPM1

1

ENSG00000001461

0.852914

9.093739

ENSG00000001461_5

NaN

9.093739

NaN

ENSG00000001461_1

ENSG00000001461_3

-5.543442

-1.775148

1.000000

NIPAL3

And provide the kind='tss' option to get_die_genes() when trying to filter your test results.

gid
p_val
dpi
pos_iso_1
pos_iso_2
pos_iso_1_dpi
pos_iso_2_dpi
neg_iso_1
neg_iso_2
neg_iso_1_dpi
neg_iso_2_dpi
adj_p_val
gname

10

ENSG00000003402

7.338098e-15

84.848486

ENSG00000003402_3

ENSG00000003402_7

77.272728

7.575758

ENSG00000003402_1

ENSG00000003402_5

-31.717173

-22.222223

4.390279e-13

CFLAR

11

ENSG00000003436

3.812087e-06

34.072281

ENSG00000003436_1

ENSG00000003436_3

28.073771

4.564083

ENSG00000003436_4

NaN

-34.072281

NaN

7.064168e-05

TFPI

16

ENSG00000004487

1.135407e-03

75.714287

ENSG00000004487_1

NaN

75.714287

NaN

ENSG00000004487_2

NaN

-75.714285

NaN

1.020405e-02

KDM1A

33

ENSG00000006282

4.544048e-03

19.819595

ENSG00000006282_2

ENSG00000006282_1

13.679245

6.140350

ENSG00000006282_3

NaN

-19.819595

NaN

3.236475e-02

SPATA20

43

ENSG00000007376

1.256379e-04

27.898551

ENSG00000007376_3

ENSG00000007376_1

23.454107

4.444445

ENSG00000007376_5

ENSG00000007376_7

-14.130434

-7.512077

1.525134e-03

RPUSD1

For TESs, use kind='tes' as input to die_genes_test(), die_kind='tes' to make_uns_key(), and kind='tes' to get_die_genes().

gid
p_val
dpi
pos_iso_1
pos_iso_2
pos_iso_1_dpi
pos_iso_2_dpi
neg_iso_1
neg_iso_2
neg_iso_1_dpi
neg_iso_2_dpi
adj_p_val
gname

0

ENSG00000000419

1.000000

0.096133

ENSG00000000419_2

NaN

0.096133

NaN

ENSG00000000419_1

NaN

-0.096130

NaN

1.000000

DPM1

1

ENSG00000001461

0.852914

9.093739

ENSG00000001461_3

NaN

9.093739

NaN

ENSG00000001461_4

ENSG00000001461_1

-5.543442

-1.775148

1.000000

NIPAL3

2

ENSG00000001630

0.286866

2.580168

ENSG00000001630_2

NaN

2.580168

NaN

ENSG00000001630_1

ENSG00000001630_3

-1.549232

-1.030928

0.618077

CYP51A1

3

ENSG00000002330

0.184635

12.817431

ENSG00000002330_1

ENSG00000002330_4

11.868484

0.948944

ENSG00000002330_2

ENSG00000002330_3

-12.603584

-0.213847

0.480860

BAD

4

ENSG00000002549

0.679148

0.694543

ENSG00000002549_2

NaN

0.694542

NaN

ENSG00000002549_1

NaN

-0.694543

NaN

0.926889

LAP3

gid
p_val
dpi
pos_iso_1
pos_iso_2
pos_iso_1_dpi
pos_iso_2_dpi
neg_iso_1
neg_iso_2
neg_iso_1_dpi
neg_iso_2_dpi
adj_p_val
gname

0

ENSG00000000419

1.000000

0.096133

ENSG00000000419_2

NaN

0.096133

NaN

ENSG00000000419_1

NaN

-0.096130

NaN

1.0

DPM1

1

ENSG00000001461

0.852914

9.093739

ENSG00000001461_3

NaN

9.093739

NaN

ENSG00000001461_4

ENSG00000001461_1

-5.543442

-1.775148

1.0

NIPAL3

gid
p_val
dpi
pos_iso_1
pos_iso_2
pos_iso_1_dpi
pos_iso_2_dpi
neg_iso_1
neg_iso_2
neg_iso_1_dpi
neg_iso_2_dpi
adj_p_val
gname

9

ENSG00000003402

7.338098e-15

84.848486

ENSG00000003402_5

ENSG00000003402_6

77.272728

7.575758

ENSG00000003402_9

ENSG00000003402_7

-31.717173

-22.222223

5.296535e-13

CFLAR

10

ENSG00000003436

2.772096e-07

32.637854

ENSG00000003436_1

ENSG00000003436_4

22.082711

10.555142

ENSG00000003436_2

ENSG00000003436_5

-30.435921

-1.818182

7.003007e-06

TFPI

14

ENSG00000004487

1.135407e-03

75.714287

ENSG00000004487_2

NaN

75.714287

NaN

ENSG00000004487_1

NaN

-75.714285

NaN

1.141621e-02

KDM1A

32

ENSG00000006282

6.858641e-03

19.819595

ENSG00000006282_2

NaN

19.819595

NaN

ENSG00000006282_1

NaN

-19.819595

NaN

4.915014e-02

SPATA20

60

ENSG00000010278

3.191221e-22

39.024387

ENSG00000010278_2

NaN

39.024387

NaN

ENSG00000010278_3

ENSG00000010278_1

-38.605976

-0.418410

3.486194e-20

CD9

Finally, for transcriptomes generated with Cerberus, Swan automatically tracks intron chain usage, and you can perform intron chain switching analysis with kind='ic' as input to the die_gene_test() function.

gid
p_val
dpi
pos_iso_1
pos_iso_2
pos_iso_1_dpi
pos_iso_2_dpi
neg_iso_1
neg_iso_2
neg_iso_1_dpi
neg_iso_2_dpi
adj_p_val
gname

0

ENCODEMG000055991

4.100695e-01

6.071428

ENCODEMG000055991_2

ENCODEMG000055991_3

3.571428

2.500000

ENCODEMG000055991_1

NaN

-6.071426

NaN

0.707992

ENCODEMG000055991

1

ENCODEMG000055998

6.190162e-01

9.722223

ENCODEMG000055998_2

NaN

9.722223

NaN

ENCODEMG000055998_1

NaN

-9.722218

NaN

0.836200

ENCODEMG000055998

2

ENCODEMG000056718

8.081718e-07

6.289159

NaN

NaN

2.329770

1.953835

NaN

NaN

-3.339438

-2.949721

0.000020

ENCODEMG000056718

3

ENCODEMG000056804

2.107047e-03

27.863049

ENCODEMG000056804_1

NaN

27.863047

NaN

ENCODEMG000056804_2

NaN

-27.863049

NaN

0.021510

ENCODEMG000056804

4

ENCODEMG000063411

7.699701e-01

12.500000

ENCODEMG000063411_1

NaN

12.500000

NaN

ENCODEMG000063411_2

NaN

-12.500000

NaN

0.919808

ENCODEMG000063411

Combining metadata columns

What if none of the metadata columns you have summarize the comparisons you want to make? What if I want to find differentially expressed genes or transcripts, or find isoform-switching genes between hffc6 replicate 3 and hepg2 replicate 1?

cell_line
replicate
dataset
total_counts

dataset

hepg2_1

hepg2

1

hepg2_1

499647.0

hepg2_2

hepg2

2

hepg2_2

848447.0

hffc6_1

hffc6

1

hffc6_1

761493.0

hffc6_2

hffc6

2

hffc6_2

787967.0

hffc6_3

hffc6

3

hffc6_3

614921.0

Let's ignore for a moment the fact that the dataset column does effectively capture both replicate as well as cell line metadata. This may not always be the case with more complex datasets. Swan has a function to concatenate columns together and add them as an additional column to the metadata tables. Use the following code to generate a new column that concatenates as many preexisting metadata columns as you wish:

The added column in col_name can then be used as the obs_col input to die_gene_test() as follows:

Exon skipping and intron retention

Swan can detect novel (unannotated) exon skipping and intron retention events.

To obtain a dataframe of novel exon skipping events, run the following code:

gid
tid
edge_id

0

ENSG00000157916.19

TALONT000218256

952616

0

ENSG00000122406.13

TALONT000425229

952716

0

ENSG00000224093.5

TALONT000434035

952720

0

ENSG00000224093.5

TALONT000434035

952720

0

ENSG00000224093.5

TALONT000434035

952720

To obtain a list of genes containing novel intron retention events, run the following code:

You can pass gene IDs from es_df into gen_report() or plot_graph() to visualize where these exon-skipping events are in gene reports or gene summary graphs respectively.

Last updated

Was this helpful?