Analysis

Swan has several analysis options to use.

import swan_vis as swan

sg = swan.read('data/swan.p')
Read in graph from data/swan.p

Differential expression tests

Swan's old differential gene and transcript expression tests using diffxpy have now been deprecated as it seems that the library is unsupported. I recommend that users interested in running differential gene or transcript expression test either use PyDESeq2 or Scanpy's rank_genes_groups test, which both support the AnnData format for simple compatibility with Swan's AnnData expression representation.

Using PyDESeq2

Please read the PyDESeq2 documentation for details on how to use one of the SwanGraph AnnData objects to obtain differential expression results. Below is an example on how to find differentially expressed transcripts between cell lines.

from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats
import numpy as np

adata = sg.adata.copy()

# PyDESeq2 currently doesn't support column names with underscores, so change that
adata.obs.rename({'cell_line': 'cellline'}, axis=1, inplace=True)
obs_col = 'cellline'

threads = 8

# densify matrix
adata.X = np.array(adata.X.todense())

# run test
dds = DeseqDataSet(adata=adata,
               design_factors=obs_col,
               n_cpus=threads,
               refit_cooks=True)
dds.deseq2()
stat_res = DeseqStats(dds,
                  n_cpus=threads)
stat_res.summary()

df = stat_res.results_df
Fitting size factors...
... done in 0.00 seconds.

Fitting dispersions...
... done in 34.09 seconds.

Fitting dispersion trend curve...
... done in 12.01 seconds.

Fitting MAP dispersions...
... done in 13.97 seconds.

Fitting LFCs...
... done in 4.86 seconds.

Refitting 0 outliers.

Running Wald tests...
... done in 2.36 seconds.

Log2 fold change & Wald test p-value: cellline hffc6 vs hepg2
baseMeanlog2FoldChangelfcSEstatpvaluepadj

tid

TALONT000296400

3.497574

2.416436

1.472616

1.640914

0.100815

0.193840

ENST00000591581.1

0.162115

0.624956

5.002765

0.124922

0.900585

NaN

ENST00000546893.5

8.173445

-0.053334

0.793179

-0.067241

0.946390

0.968527

ENST00000537289.1

5.140216

-1.248175

0.978927

-1.275044

0.202294

0.328286

ENST00000258382.9

2.012104

-0.011819

1.493866

-0.007912

0.993687

NaN

...

...

...

...

...

...

...

ENST00000506914.1

1.022692

0.947588

2.272813

0.416923

0.676735

NaN

ENST00000571080.1

0.473263

-2.824409

3.426473

-0.824291

0.409774

NaN

ENST00000378615.7

1.334873

-1.199810

1.790781

-0.669993

0.502862

NaN

ENST00000409586.7

0.338615

1.464364

4.023896

0.363917

0.715920

NaN

ENST00000370278.3

1.328549

-1.201246

1.962358

-0.612144

0.540443

NaN

34814 rows × 6 columns

df.head()
baseMeanlog2FoldChangelfcSEstatpvaluepadj

tid

TALONT000296400

3.497574

2.416436

1.472616

1.640914

0.100815

0.193840

ENST00000591581.1

0.162115

0.624956

5.002765

0.124922

0.900585

NaN

ENST00000546893.5

8.173445

-0.053334

0.793179

-0.067241

0.946390

0.968527

ENST00000537289.1

5.140216

-1.248175

0.978927

-1.275044

0.202294

0.328286

ENST00000258382.9

2.012104

-0.011819

1.493866

-0.007912

0.993687

NaN

Isoform switching / Differential isoform expression testing

Isoform switching / differential isoform expression (DIE) testing is implemented according to the strategy in Joglekar et. al., 2021. DIE can roughly be described as finding statistically significant changes in isoform expression between two conditions along with a change in percent isoform usage per gene.

Pairwise comparisons can be set up using different columns in the metadata that was added to the SwanGraph with the obs_col and obs_conditions arguments.

# look at valid metadata options
sg.adata.obs
cell_linereplicatedatasettotal_counts

dataset

hepg2_1

hepg2

1

hepg2_1

499647.0

hepg2_2

hepg2

2

hepg2_2

848447.0

hffc6_1

hffc6

1

hffc6_1

761493.0

hffc6_2

hffc6

2

hffc6_2

787967.0

hffc6_3

hffc6

3

hffc6_3

614921.0

# find genes that exhibit DIE between HFFc6 and HepG2
obs_col = 'cell_line'
obs_conditions = ['hepg2', 'hffc6']
die_table, die_results = sg.die_gene_test(obs_col=obs_col,
                                          obs_conditions=obs_conditions,
                                          verbose=True)
Testing for DIE for each gene: 100%|██████████| 14684/14684 [03:50<00:00, 123.69it/s]

The resultant table contains an entry for each gene with the p value (p_val), adjusted p value (adj_p_val), and change in percent isoform usage for the top two isoforms (dpi), as well as the identities of the top isoforms involved in the switch. Exact details on these calculations can be found in Joglekar et. al., 2021.

die_table.head()
gidp_valdpipos_iso_1pos_iso_2pos_iso_1_dpipos_iso_2_dpineg_iso_1neg_iso_2neg_iso_1_dpineg_iso_2_dpiadj_p_valgname

0

ENSG00000130175

0.206667

39.264992

TALONT000296399

NaN

39.264992

NaN

TALONT000296400

ENST00000589838.5

-20.116056

-10.638298

0.469420

PRKCSH

1

ENSG00000130202

0.000367

11.893251

ENST00000252485.8

ENST00000252483.9

9.684967

2.208281

TALONT000406668

ENST00000591581.1

-11.473083

-0.420168

0.003560

NECTIN2

2

ENSG00000111371

0.680435

9.401713

ENST00000398637.9

ENST00000439706.5

8.547012

0.854701

ENST00000546893.5

NaN

-9.401711

NaN

0.886243

SLC38A1

3

ENSG00000181924

0.028195

9.452934

ENST00000537289.1

ENST00000545127.1

7.298603

2.154328

ENST00000355693.4

NaN

-9.452934

NaN

0.122619

COA4

4

ENSG00000163468

0.255788

0.680048

ENST00000295688.7

TALONT000476055

0.366966

0.277693

ENST00000368259.6

ENST00000489870.1

-0.568503

-0.111545

0.525066

CCT3

Differential isoform expression testing results are stored automatically in sg.adata.uns, and will be saved to the SwanGraph if you save it. You can regenerate this key and access the summary table by running the following code:

# die_iso - isoform level differential isoform expression test results
uns_key = swan.make_uns_key('die',
                            obs_col=obs_col,
                            obs_conditions=obs_conditions)
test = sg.adata.uns[uns_key]
test.head(2)
gidp_valdpipos_iso_1pos_iso_2pos_iso_1_dpipos_iso_2_dpineg_iso_1neg_iso_2neg_iso_1_dpineg_iso_2_dpiadj_p_valgname

0

ENSG00000130175

0.206667

39.264992

TALONT000296399

NaN

39.264992

NaN

TALONT000296400

ENST00000589838.5

-20.116056

-10.638298

0.46942

PRKCSH

1

ENSG00000130202

0.000367

11.893251

ENST00000252485.8

ENST00000252483.9

9.684967

2.208281

TALONT000406668

ENST00000591581.1

-11.473083

-0.420168

0.00356

NECTIN2

Swan comes with an easy way to filter your DIE test results based on adjusted p value and dpi thresholds.

test = sg.get_die_genes(obs_col=obs_col, obs_conditions=obs_conditions,
                       p=0.05, dpi=10)
test.head()
gidp_valdpipos_iso_1pos_iso_2pos_iso_1_dpipos_iso_2_dpineg_iso_1neg_iso_2neg_iso_1_dpineg_iso_2_dpiadj_p_valgname

1

ENSG00000130202

3.674234e-04

11.893251

ENST00000252485.8

ENST00000252483.9

9.684967

2.208281

TALONT000406668

ENST00000591581.1

-11.473083

-0.420168

3.560035e-03

NECTIN2

8

ENSG00000025156

1.857411e-03

35.714287

ENST00000452194.5

ENST00000465214.2

25.714287

10.000000

ENST00000368455.8

NaN

-35.714287

NaN

1.405048e-02

HSF2

20

ENSG00000105254

4.779408e-38

25.419458

ENST00000585910.5

TALONT000366329

20.394853

4.702971

ENST00000221855.7

ENST00000589996.5

-25.016304

-0.403154

7.343219e-36

TBCB

23

ENSG00000105379

1.802782e-307

81.136333

ENST00000354232.8

NaN

81.136333

NaN

ENST00000309244.8

ENST00000596253.1

-80.857935

-0.278394

1.551113e-304

ETFB

28

ENSG00000148180

0.000000e+00

89.319039

ENST00000373818.8

TALONT000419680

85.245850

4.073189

TALONT000418752

ENST00000373808.8

-68.603180

-7.768958

0.000000e+00

GSN

Swan also now automatically tracks transcription start site (TSS) and transcription end site (TES) usage, and can find genes that exhibit DIE on the basis of their starts or ends. To do this, use the kind argument to die_gene_test.

# find genes that exhibit DIE for TSSs between HFFc6 and HepG2
die_table, die_results = sg.die_gene_test(kind='tss',
                                          obs_col=obs_col,
                                          obs_conditions=obs_conditions,
                                          verbose=True)
die_table.head()
Testing for DIE for each gene: 100%|█████████▉| 14674/14684 [02:18<00:00, 162.60it/s]
gidp_valdpipos_iso_1pos_iso_2pos_iso_1_dpipos_iso_2_dpineg_iso_1neg_iso_2neg_iso_1_dpineg_iso_2_dpiadj_p_valgname

0

ENSG00000000419

0.249561

10.002187

ENSG00000000419_2

ENSG00000000419_1

9.906052

0.096133

ENSG00000000419_3

ENSG00000000419_4

-7.218704

-2.783483

0.536906

DPM1

1

ENSG00000001461

0.852914

9.093739

ENSG00000001461_5

NaN

9.093739

NaN

ENSG00000001461_1

ENSG00000001461_3

-5.543442

-1.775148

1.000000

NIPAL3

2

ENSG00000001497

0.891496

3.846153

ENSG00000001497_1

NaN

3.846146

NaN

ENSG00000001497_2

NaN

-3.846153

NaN

1.000000

LAS1L

3

ENSG00000001630

0.286866

2.580168

ENSG00000001630_3

NaN

2.580168

NaN

ENSG00000001630_2

ENSG00000001630_1

-1.549232

-1.030928

0.582636

CYP51A1

4

ENSG00000002330

0.184635

12.817431

ENSG00000002330_2

ENSG00000002330_1

11.868484

0.948944

ENSG00000002330_4

ENSG00000002330_3

-12.603584

-0.213847

0.454053

BAD

To access the die results on the tss level, use die_kind='tss' as input to make_uns_key().

# die_iso - TSS level differential isoform expression test results
uns_key = swan.make_uns_key('die',
                            obs_col=obs_col,
                            obs_conditions=obs_conditions,
                            die_kind='tss')
test = sg.adata.uns[uns_key]
test.head(2)
gidp_valdpipos_iso_1pos_iso_2pos_iso_1_dpipos_iso_2_dpineg_iso_1neg_iso_2neg_iso_1_dpineg_iso_2_dpiadj_p_valgname

0

ENSG00000000419

0.249561

10.002187

ENSG00000000419_2

ENSG00000000419_1

9.906052

0.096133

ENSG00000000419_3

ENSG00000000419_4

-7.218704

-2.783483

0.536906

DPM1

1

ENSG00000001461

0.852914

9.093739

ENSG00000001461_5

NaN

9.093739

NaN

ENSG00000001461_1

ENSG00000001461_3

-5.543442

-1.775148

1.000000

NIPAL3

And provide the kind='tss' option to get_die_genes() when trying to filter your test results.

test = sg.get_die_genes(kind='tss', obs_col=obs_col,
                        obs_conditions=obs_conditions,
                        p=0.05, dpi=10)
test.head()
gidp_valdpipos_iso_1pos_iso_2pos_iso_1_dpipos_iso_2_dpineg_iso_1neg_iso_2neg_iso_1_dpineg_iso_2_dpiadj_p_valgname

10

ENSG00000003402

7.338098e-15

84.848486

ENSG00000003402_3

ENSG00000003402_7

77.272728

7.575758

ENSG00000003402_1

ENSG00000003402_5

-31.717173

-22.222223

4.390279e-13

CFLAR

11

ENSG00000003436

3.812087e-06

34.072281

ENSG00000003436_1

ENSG00000003436_3

28.073771

4.564083

ENSG00000003436_4

NaN

-34.072281

NaN

7.064168e-05

TFPI

16

ENSG00000004487

1.135407e-03

75.714287

ENSG00000004487_1

NaN

75.714287

NaN

ENSG00000004487_2

NaN

-75.714285

NaN

1.020405e-02

KDM1A

33

ENSG00000006282

4.544048e-03

19.819595

ENSG00000006282_2

ENSG00000006282_1

13.679245

6.140350

ENSG00000006282_3

NaN

-19.819595

NaN

3.236475e-02

SPATA20

43

ENSG00000007376

1.256379e-04

27.898551

ENSG00000007376_3

ENSG00000007376_1

23.454107

4.444445

ENSG00000007376_5

ENSG00000007376_7

-14.130434

-7.512077

1.525134e-03

RPUSD1

For TESs, use kind='tes' as input to die_genes_test(), die_kind='tes' to make_uns_key(), and kind='tes' to get_die_genes().

# find genes that exhibit DIE for TESs between HFFc6 and HepG2
die_table, die_results = sg.die_gene_test(kind='tes',
                                          obs_col='cell_line',
                                          obs_conditions=['hepg2', 'hffc6'])
die_table.head()
Testing for DIE for each gene: 100%|██████████| 14684/14684 [02:19<00:00, 105.51it/s]
gidp_valdpipos_iso_1pos_iso_2pos_iso_1_dpipos_iso_2_dpineg_iso_1neg_iso_2neg_iso_1_dpineg_iso_2_dpiadj_p_valgname

0

ENSG00000000419

1.000000

0.096133

ENSG00000000419_2

NaN

0.096133

NaN

ENSG00000000419_1

NaN

-0.096130

NaN

1.000000

DPM1

1

ENSG00000001461

0.852914

9.093739

ENSG00000001461_3

NaN

9.093739

NaN

ENSG00000001461_4

ENSG00000001461_1

-5.543442

-1.775148

1.000000

NIPAL3

2

ENSG00000001630

0.286866

2.580168

ENSG00000001630_2

NaN

2.580168

NaN

ENSG00000001630_1

ENSG00000001630_3

-1.549232

-1.030928

0.618077

CYP51A1

3

ENSG00000002330

0.184635

12.817431

ENSG00000002330_1

ENSG00000002330_4

11.868484

0.948944

ENSG00000002330_2

ENSG00000002330_3

-12.603584

-0.213847

0.480860

BAD

4

ENSG00000002549

0.679148

0.694543

ENSG00000002549_2

NaN

0.694542

NaN

ENSG00000002549_1

NaN

-0.694543

NaN

0.926889

LAP3

# die_iso - TES level differential isoform expression test results
uns_key = swan.make_uns_key('die',
                            obs_col=obs_col,
                            obs_conditions=obs_conditions,
                            die_kind='tes')
test = sg.adata.uns[uns_key]
test.head(2)
gidp_valdpipos_iso_1pos_iso_2pos_iso_1_dpipos_iso_2_dpineg_iso_1neg_iso_2neg_iso_1_dpineg_iso_2_dpiadj_p_valgname

0

ENSG00000000419

1.000000

0.096133

ENSG00000000419_2

NaN

0.096133

NaN

ENSG00000000419_1

NaN

-0.096130

NaN

1.0

DPM1

1

ENSG00000001461

0.852914

9.093739

ENSG00000001461_3

NaN

9.093739

NaN

ENSG00000001461_4

ENSG00000001461_1

-5.543442

-1.775148

1.0

NIPAL3

test = sg.get_die_genes(kind='tes', obs_col=obs_col,
                        obs_conditions=obs_conditions,
                        p=0.05, dpi=10)
test.head()
gidp_valdpipos_iso_1pos_iso_2pos_iso_1_dpipos_iso_2_dpineg_iso_1neg_iso_2neg_iso_1_dpineg_iso_2_dpiadj_p_valgname

9

ENSG00000003402

7.338098e-15

84.848486

ENSG00000003402_5

ENSG00000003402_6

77.272728

7.575758

ENSG00000003402_9

ENSG00000003402_7

-31.717173

-22.222223

5.296535e-13

CFLAR

10

ENSG00000003436

2.772096e-07

32.637854

ENSG00000003436_1

ENSG00000003436_4

22.082711

10.555142

ENSG00000003436_2

ENSG00000003436_5

-30.435921

-1.818182

7.003007e-06

TFPI

14

ENSG00000004487

1.135407e-03

75.714287

ENSG00000004487_2

NaN

75.714287

NaN

ENSG00000004487_1

NaN

-75.714285

NaN

1.141621e-02

KDM1A

32

ENSG00000006282

6.858641e-03

19.819595

ENSG00000006282_2

NaN

19.819595

NaN

ENSG00000006282_1

NaN

-19.819595

NaN

4.915014e-02

SPATA20

60

ENSG00000010278

3.191221e-22

39.024387

ENSG00000010278_2

NaN

39.024387

NaN

ENSG00000010278_3

ENSG00000010278_1

-38.605976

-0.418410

3.486194e-20

CD9

Finally, for transcriptomes generated with Cerberus, Swan automatically tracks intron chain usage, and you can perform intron chain switching analysis with kind='ic' as input to the die_gene_test() function.

sg_brain = swan.read('swan_modelad.p')

# find genes that exhibit DIE for ICs between genotypes
die_table, die_results = sg_brain.die_gene_test(kind='ic',
                                                obs_col='genotype',
                                                obs_conditions=['b6n', '5xfad'])
die_table.head()
Read in graph from swan_modelad.p
gidp_valdpipos_iso_1pos_iso_2pos_iso_1_dpipos_iso_2_dpineg_iso_1neg_iso_2neg_iso_1_dpineg_iso_2_dpiadj_p_valgname

0

ENCODEMG000055991

4.100695e-01

6.071428

ENCODEMG000055991_2

ENCODEMG000055991_3

3.571428

2.500000

ENCODEMG000055991_1

NaN

-6.071426

NaN

0.707992

ENCODEMG000055991

1

ENCODEMG000055998

6.190162e-01

9.722223

ENCODEMG000055998_2

NaN

9.722223

NaN

ENCODEMG000055998_1

NaN

-9.722218

NaN

0.836200

ENCODEMG000055998

2

ENCODEMG000056718

8.081718e-07

6.289159

NaN

NaN

2.329770

1.953835

NaN

NaN

-3.339438

-2.949721

0.000020

ENCODEMG000056718

3

ENCODEMG000056804

2.107047e-03

27.863049

ENCODEMG000056804_1

NaN

27.863047

NaN

ENCODEMG000056804_2

NaN

-27.863049

NaN

0.021510

ENCODEMG000056804

4

ENCODEMG000063411

7.699701e-01

12.500000

ENCODEMG000063411_1

NaN

12.500000

NaN

ENCODEMG000063411_2

NaN

-12.500000

NaN

0.919808

ENCODEMG000063411

Combining metadata columns

What if none of the metadata columns you have summarize the comparisons you want to make? What if I want to find differentially expressed genes or transcripts, or find isoform-switching genes between hffc6 replicate 3 and hepg2 replicate 1?

sg.adata.obs.head()
cell_linereplicatedatasettotal_counts

dataset

hepg2_1

hepg2

1

hepg2_1

499647.0

hepg2_2

hepg2

2

hepg2_2

848447.0

hffc6_1

hffc6

1

hffc6_1

761493.0

hffc6_2

hffc6

2

hffc6_2

787967.0

hffc6_3

hffc6

3

hffc6_3

614921.0

Let's ignore for a moment the fact that the dataset column does effectively capture both replicate as well as cell line metadata. This may not always be the case with more complex datasets. Swan has a function to concatenate columns together and add them as an additional column to the metadata tables. Use the following code to generate a new column that concatenates as many preexisting metadata columns as you wish:

col_name = sg.add_multi_groupby(['cell_line', 'replicate'])

print(col_name)
print(sg.adata.obs.head())
cell_line_replicate
        cell_line replicate  dataset  total_counts cell_line_replicate
dataset                                                               
hepg2_1     hepg2         1  hepg2_1      499647.0             hepg2_1
hepg2_2     hepg2         2  hepg2_2      848447.0             hepg2_2
hffc6_1     hffc6         1  hffc6_1      761493.0             hffc6_1
hffc6_2     hffc6         2  hffc6_2      787967.0             hffc6_2
hffc6_3     hffc6         3  hffc6_3      614921.0             hffc6_3

The added column in col_name can then be used as the obs_col input to die_gene_test() as follows:

obs_col = col_name
obs_conditions = ['hffc6_3', 'hepg2_1']


die_table, die_results = sg.die_gene_test(kind='iso',
                                          obs_col=obs_col,
                                          obs_conditions=obs_conditions)

Exon skipping and intron retention

Swan can detect novel (unannotated) exon skipping and intron retention events.

To obtain a dataframe of novel exon skipping events, run the following code:

# returns a DataFrame of genes, transcripts, and specific edges in
# the SwanGraph with novel exon skipping events
es_df = sg.find_es_genes(verbose=True)
Testing each novel edge for exon skipping:   0%|          | 0/855 [00:00<?, ?it/s]

Analyzing 855 intronic edges for ES


Testing each novel edge for exon skipping: 100%|██████████| 855/855 [1:26:06<00:00,  6.13s/it]

Found 529 novel es events in 149 transcripts.
es_df.head()
gidtidedge_id

0

ENSG00000157916.19

TALONT000218256

952616

0

ENSG00000122406.13

TALONT000425229

952716

0

ENSG00000224093.5

TALONT000434035

952720

0

ENSG00000224093.5

TALONT000434035

952720

0

ENSG00000224093.5

TALONT000434035

952720

To obtain a list of genes containing novel intron retention events, run the following code:

# returns a DataFrame of genes, transcripts, and specific edges in
# the SwanGraph with novel intron retaining events
ir_df = sg.find_ir_genes(verbose=True)
  0%|          | 0/1186 [00:00<?, ?it/s]
Testing each novel edge for intron retention:   0%|          | 0/1186 [00:00<?, ?it/s]

Analyzing 1186 exonic edges for IR
Testing each novel edge for intron retention: 100%|██████████| 1186/1186 [1:59:16<00:00,  5.97s/it]

Found 35 novel ir events in 27 transcripts.

You can pass gene IDs from es_df into gen_report() or plot_graph() to visualize where these exon-skipping events are in gene reports or gene summary graphs respectively.

Last updated