Analysis

Swan has several analysis options to use.

import swan_vis as swan

sg = swan.read('data/swan.p')

Read in graph from data/swan.p

Differential expression tests

Swan's old differential gene and transcript expression tests using diffxpy have now been deprecated as it seems that the library is unsupported. I recommend that users interested in running differential gene or transcript expression test either use PyDESeq2 or Scanpy's rank_genes_groups test, which both support the AnnData format for simple compatibility with Swan's AnnData expression representation.

Using PyDESeq2

Please read the PyDESeq2 documentation for details on how to use one of the SwanGraph AnnData objects to obtain differential expression results. Below is an example on how to find differentially expressed transcripts between cell lines.

from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats
import numpy as np

adata = sg.adata.copy()

# PyDESeq2 currently doesn't support column names with underscores, so change that
adata.obs.rename({'cell_line': 'cellline'}, axis=1, inplace=True)
obs_col = 'cellline'

threads = 8

# densify matrix
adata.X = np.array(adata.X.todense())

# run test
dds = DeseqDataSet(adata=adata,
               design_factors=obs_col,
               n_cpus=threads,
               refit_cooks=True)
dds.deseq2()
stat_res = DeseqStats(dds,
                  n_cpus=threads)
stat_res.summary()

df = stat_res.results_df

Fitting size factors...
... done in 0.00 seconds.

Fitting dispersions...
... done in 34.09 seconds.

Fitting dispersion trend curve...
... done in 12.01 seconds.

Fitting MAP dispersions...
... done in 13.97 seconds.

Fitting LFCs...
... done in 4.86 seconds.

Refitting 0 outliers.

Running Wald tests...
... done in 2.36 seconds.

Log2 fold change & Wald test p-value: cellline hffc6 vs hepg2

baseMean

log2FoldChange

lfcSE

stat

pvalue

padj

tid

TALONT000296400

3.497574

2.416436

1.472616

1.640914

0.100815

0.193840

ENST00000591581.1

0.162115

0.624956

5.002765

0.124922

0.900585

NaN

ENST00000546893.5

8.173445

-0.053334

0.793179

-0.067241

0.946390

0.968527

ENST00000537289.1

5.140216

-1.248175

0.978927

-1.275044

0.202294

0.328286

ENST00000258382.9

2.012104

-0.011819

1.493866

-0.007912

0.993687

NaN

...

ENST00000506914.1

1.022692

0.947588

2.272813

0.416923

0.676735

NaN

ENST00000571080.1

0.473263

-2.824409

3.426473

-0.824291

0.409774

NaN

ENST00000378615.7

1.334873

-1.199810

1.790781

-0.669993

0.502862

NaN

ENST00000409586.7

0.338615

1.464364

4.023896

0.363917

0.715920

NaN

ENST00000370278.3

1.328549

-1.201246

1.962358

-0.612144

0.540443

NaN

34814 rows × 6 columns

df.head()

baseMean

log2FoldChange

lfcSE

stat

pvalue

padj

tid

TALONT000296400

3.497574

2.416436

1.472616

1.640914

0.100815

0.193840

ENST00000591581.1

0.162115

0.624956

5.002765

0.124922

0.900585

NaN

ENST00000546893.5

8.173445

-0.053334

0.793179

-0.067241

0.946390

0.968527

ENST00000537289.1

5.140216

-1.248175

0.978927

-1.275044

0.202294

0.328286

ENST00000258382.9

2.012104

-0.011819

1.493866

-0.007912

0.993687

NaN

Isoform switching / Differential isoform expression testing

Isoform switching / differential isoform expression (DIE) testing is implemented according to the strategy in Joglekar et. al., 2021. DIE can roughly be described as finding statistically significant changes in isoform expression between two conditions along with a change in percent isoform usage per gene.

Pairwise comparisons can be set up using different columns in the metadata that was added to the SwanGraph with the obs_col and obs_conditions arguments.

# look at valid metadata options
sg.adata.obs

cell_line

replicate

dataset

total_counts

dataset

hepg2_1

hepg2

hepg2_1

499647.0

hepg2_2

hepg2

hepg2_2

848447.0

hffc6_1

hffc6

hffc6_1

761493.0

hffc6_2

hffc6

hffc6_2

787967.0

hffc6_3

hffc6

hffc6_3

614921.0

# find genes that exhibit DIE between HFFc6 and HepG2
obs_col = 'cell_line'
obs_conditions = ['hepg2', 'hffc6']
die_table, die_results = sg.die_gene_test(obs_col=obs_col,
                                          obs_conditions=obs_conditions,
                                          verbose=True)

Testing for DIE for each gene: 100%|██████████| 14684/14684 [03:50<00:00, 123.69it/s]

The resultant table contains an entry for each gene with the p value (p_val), adjusted p value (adj_p_val), and change in percent isoform usage for the top two isoforms (dpi), as well as the identities of the top isoforms involved in the switch. Exact details on these calculations can be found in Joglekar et. al., 2021.

die_table.head()

gid

p_val

dpi

pos_iso_1

pos_iso_2

pos_iso_1_dpi

pos_iso_2_dpi

neg_iso_1

neg_iso_2

neg_iso_1_dpi

neg_iso_2_dpi

adj_p_val

gname

ENSG00000130175

0.206667

39.264992

TALONT000296399

NaN

39.264992

NaN

TALONT000296400

ENST00000589838.5

-20.116056

-10.638298

0.469420

PRKCSH

ENSG00000130202

0.000367

11.893251

ENST00000252485.8

ENST00000252483.9

9.684967

2.208281

TALONT000406668

ENST00000591581.1

-11.473083

-0.420168

0.003560

NECTIN2

ENSG00000111371

0.680435

9.401713

ENST00000398637.9

ENST00000439706.5

8.547012

0.854701

ENST00000546893.5

NaN

-9.401711

NaN

0.886243

SLC38A1

ENSG00000181924

0.028195

9.452934

ENST00000537289.1

ENST00000545127.1

7.298603

2.154328

ENST00000355693.4

NaN

-9.452934

NaN

0.122619

COA4

ENSG00000163468

0.255788

0.680048

ENST00000295688.7

TALONT000476055

0.366966

0.277693

ENST00000368259.6

ENST00000489870.1

-0.568503

-0.111545

0.525066

CCT3

Differential isoform expression testing results are stored automatically in sg.adata.uns, and will be saved to the SwanGraph if you save it. You can regenerate this key and access the summary table by running the following code:

# die_iso - isoform level differential isoform expression test results
uns_key = swan.make_uns_key('die',
                            obs_col=obs_col,
                            obs_conditions=obs_conditions)
test = sg.adata.uns[uns_key]
test.head(2)

gid

p_val

dpi

pos_iso_1

pos_iso_2

pos_iso_1_dpi

pos_iso_2_dpi

neg_iso_1

neg_iso_2

neg_iso_1_dpi

neg_iso_2_dpi

adj_p_val

gname

ENSG00000130175

0.206667

39.264992

TALONT000296399

NaN

39.264992

NaN

TALONT000296400

ENST00000589838.5

-20.116056

-10.638298

0.46942

PRKCSH

ENSG00000130202

0.000367

11.893251

ENST00000252485.8

ENST00000252483.9

9.684967

2.208281

TALONT000406668

ENST00000591581.1

-11.473083

-0.420168

0.00356

NECTIN2

Swan comes with an easy way to filter your DIE test results based on adjusted p value and dpi thresholds.

test = sg.get_die_genes(obs_col=obs_col, obs_conditions=obs_conditions,
                       p=0.05, dpi=10)
test.head()

gid

p_val

dpi

pos_iso_1

pos_iso_2

pos_iso_1_dpi

pos_iso_2_dpi

neg_iso_1

neg_iso_2

neg_iso_1_dpi

neg_iso_2_dpi

adj_p_val

gname

ENSG00000130202

3.674234e-04

11.893251

ENST00000252485.8

ENST00000252483.9

9.684967

2.208281

TALONT000406668

ENST00000591581.1

-11.473083

-0.420168

3.560035e-03

NECTIN2

ENSG00000025156

1.857411e-03

35.714287

ENST00000452194.5

ENST00000465214.2

25.714287

10.000000

ENST00000368455.8

NaN

-35.714287

NaN

1.405048e-02

HSF2

ENSG00000105254

4.779408e-38

25.419458

ENST00000585910.5

TALONT000366329

20.394853

4.702971

ENST00000221855.7

ENST00000589996.5

-25.016304

-0.403154

7.343219e-36

TBCB

ENSG00000105379

1.802782e-307

81.136333

ENST00000354232.8

NaN

81.136333

NaN

ENST00000309244.8

ENST00000596253.1

-80.857935

-0.278394

1.551113e-304

ETFB

ENSG00000148180

0.000000e+00

89.319039

ENST00000373818.8

TALONT000419680

85.245850

4.073189

TALONT000418752

ENST00000373808.8

-68.603180

-7.768958

0.000000e+00

GSN

Swan also now automatically tracks transcription start site (TSS) and transcription end site (TES) usage, and can find genes that exhibit DIE on the basis of their starts or ends. To do this, use the kind argument to die_gene_test.

# find genes that exhibit DIE for TSSs between HFFc6 and HepG2
die_table, die_results = sg.die_gene_test(kind='tss',
                                          obs_col=obs_col,
                                          obs_conditions=obs_conditions,
                                          verbose=True)
die_table.head()

Testing for DIE for each gene: 100%|█████████▉| 14674/14684 [02:18<00:00, 162.60it/s]

gid

p_val

dpi

pos_iso_1

pos_iso_2

pos_iso_1_dpi

pos_iso_2_dpi

neg_iso_1

neg_iso_2

neg_iso_1_dpi

neg_iso_2_dpi

adj_p_val

gname

ENSG00000000419

0.249561

10.002187

ENSG00000000419_2

ENSG00000000419_1

9.906052

0.096133

ENSG00000000419_3

ENSG00000000419_4

-7.218704

-2.783483

0.536906

DPM1

ENSG00000001461

0.852914

9.093739

ENSG00000001461_5

NaN

9.093739

NaN

ENSG00000001461_1

ENSG00000001461_3

-5.543442

-1.775148

1.000000

NIPAL3

ENSG00000001497

0.891496

3.846153

ENSG00000001497_1

NaN

3.846146

NaN

ENSG00000001497_2

NaN

-3.846153

NaN

1.000000

LAS1L

ENSG00000001630

0.286866

2.580168

ENSG00000001630_3

NaN

2.580168

NaN

ENSG00000001630_2

ENSG00000001630_1

-1.549232

-1.030928

0.582636

CYP51A1

ENSG00000002330

0.184635

12.817431

ENSG00000002330_2

ENSG00000002330_1

11.868484

0.948944

ENSG00000002330_4

ENSG00000002330_3

-12.603584

-0.213847

0.454053

BAD

To access the die results on the tss level, use die_kind='tss' as input to make_uns_key().

# die_iso - TSS level differential isoform expression test results
uns_key = swan.make_uns_key('die',
                            obs_col=obs_col,
                            obs_conditions=obs_conditions,
                            die_kind='tss')
test = sg.adata.uns[uns_key]
test.head(2)

gid

p_val

dpi

pos_iso_1

pos_iso_2

pos_iso_1_dpi

pos_iso_2_dpi

neg_iso_1

neg_iso_2

neg_iso_1_dpi

neg_iso_2_dpi

adj_p_val

gname

ENSG00000000419

0.249561

10.002187

ENSG00000000419_2

ENSG00000000419_1

9.906052

0.096133

ENSG00000000419_3

ENSG00000000419_4

-7.218704

-2.783483

0.536906

DPM1

ENSG00000001461

0.852914

9.093739

ENSG00000001461_5

NaN

9.093739

NaN

ENSG00000001461_1

ENSG00000001461_3

-5.543442

-1.775148

1.000000

NIPAL3

And provide the kind='tss' option to get_die_genes() when trying to filter your test results.

test = sg.get_die_genes(kind='tss', obs_col=obs_col,
                        obs_conditions=obs_conditions,
                        p=0.05, dpi=10)
test.head()

gid

p_val

dpi

pos_iso_1

pos_iso_2

pos_iso_1_dpi

pos_iso_2_dpi

neg_iso_1

neg_iso_2

neg_iso_1_dpi

neg_iso_2_dpi

adj_p_val

gname

ENSG00000003402

7.338098e-15

84.848486

ENSG00000003402_3

ENSG00000003402_7

77.272728

7.575758

ENSG00000003402_1

ENSG00000003402_5

-31.717173

-22.222223

4.390279e-13

CFLAR

ENSG00000003436

3.812087e-06

34.072281

ENSG00000003436_1

ENSG00000003436_3

28.073771

4.564083

ENSG00000003436_4

NaN

-34.072281

NaN

7.064168e-05

TFPI

ENSG00000004487

1.135407e-03

75.714287

ENSG00000004487_1

NaN

75.714287

NaN

ENSG00000004487_2

NaN

-75.714285

NaN

1.020405e-02

KDM1A

ENSG00000006282

4.544048e-03

19.819595

ENSG00000006282_2

ENSG00000006282_1

13.679245

6.140350

ENSG00000006282_3

NaN

-19.819595

NaN

3.236475e-02

SPATA20

ENSG00000007376

1.256379e-04

27.898551

ENSG00000007376_3

ENSG00000007376_1

23.454107

4.444445

ENSG00000007376_5

ENSG00000007376_7

-14.130434

-7.512077

1.525134e-03

RPUSD1

For TESs, use kind='tes' as input to die_genes_test(), die_kind='tes' to make_uns_key(), and kind='tes' to get_die_genes().

# find genes that exhibit DIE for TESs between HFFc6 and HepG2
die_table, die_results = sg.die_gene_test(kind='tes',
                                          obs_col='cell_line',
                                          obs_conditions=['hepg2', 'hffc6'])
die_table.head()

Testing for DIE for each gene: 100%|██████████| 14684/14684 [02:19<00:00, 105.51it/s]

gid

p_val

dpi

pos_iso_1

pos_iso_2

pos_iso_1_dpi

pos_iso_2_dpi

neg_iso_1

neg_iso_2

neg_iso_1_dpi

neg_iso_2_dpi

adj_p_val

gname

ENSG00000000419

1.000000

0.096133

ENSG00000000419_2

NaN

0.096133

NaN

ENSG00000000419_1

NaN

-0.096130

NaN

1.000000

DPM1

ENSG00000001461

0.852914

9.093739

ENSG00000001461_3

NaN

9.093739

NaN

ENSG00000001461_4

ENSG00000001461_1

-5.543442

-1.775148

1.000000

NIPAL3

ENSG00000001630

0.286866

2.580168

ENSG00000001630_2

NaN

2.580168

NaN

ENSG00000001630_1

ENSG00000001630_3

-1.549232

-1.030928

0.618077

CYP51A1

ENSG00000002330

0.184635

12.817431

ENSG00000002330_1

ENSG00000002330_4

11.868484

0.948944

ENSG00000002330_2

ENSG00000002330_3

-12.603584

-0.213847

0.480860

BAD

ENSG00000002549

0.679148

0.694543

ENSG00000002549_2

NaN

0.694542

NaN

ENSG00000002549_1

NaN

-0.694543

NaN

0.926889

LAP3

# die_iso - TES level differential isoform expression test results
uns_key = swan.make_uns_key('die',
                            obs_col=obs_col,
                            obs_conditions=obs_conditions,
                            die_kind='tes')
test = sg.adata.uns[uns_key]
test.head(2)

gid

p_val

dpi

pos_iso_1

pos_iso_2

pos_iso_1_dpi

pos_iso_2_dpi

neg_iso_1

neg_iso_2

neg_iso_1_dpi

neg_iso_2_dpi

adj_p_val

gname

ENSG00000000419

1.000000

0.096133

ENSG00000000419_2

NaN

0.096133

NaN

ENSG00000000419_1

NaN

-0.096130

NaN

1.0

DPM1

ENSG00000001461

0.852914

9.093739

ENSG00000001461_3

NaN

9.093739

NaN

ENSG00000001461_4

ENSG00000001461_1

-5.543442

-1.775148

1.0

NIPAL3

test = sg.get_die_genes(kind='tes', obs_col=obs_col,
                        obs_conditions=obs_conditions,
                        p=0.05, dpi=10)
test.head()

gid

p_val

dpi

pos_iso_1

pos_iso_2

pos_iso_1_dpi

pos_iso_2_dpi

neg_iso_1

neg_iso_2

neg_iso_1_dpi

neg_iso_2_dpi

adj_p_val

gname

ENSG00000003402

7.338098e-15

84.848486

ENSG00000003402_5

ENSG00000003402_6

77.272728

7.575758

ENSG00000003402_9

ENSG00000003402_7

-31.717173

-22.222223

5.296535e-13

CFLAR

ENSG00000003436

2.772096e-07

32.637854

ENSG00000003436_1

ENSG00000003436_4

22.082711

10.555142

ENSG00000003436_2

ENSG00000003436_5

-30.435921

-1.818182

7.003007e-06

TFPI

ENSG00000004487

1.135407e-03

75.714287

ENSG00000004487_2

NaN

75.714287

NaN

ENSG00000004487_1

NaN

-75.714285

NaN

1.141621e-02

KDM1A

ENSG00000006282

6.858641e-03

19.819595

ENSG00000006282_2

NaN

19.819595

NaN

ENSG00000006282_1

NaN

-19.819595

NaN

4.915014e-02

SPATA20

ENSG00000010278

3.191221e-22

39.024387

ENSG00000010278_2

NaN

39.024387

NaN

ENSG00000010278_3

ENSG00000010278_1

-38.605976

-0.418410

3.486194e-20

CD9

Finally, for transcriptomes generated with Cerberus, Swan automatically tracks intron chain usage, and you can perform intron chain switching analysis with kind='ic' as input to the die_gene_test() function.

sg_brain = swan.read('swan_modelad.p')

# find genes that exhibit DIE for ICs between genotypes
die_table, die_results = sg_brain.die_gene_test(kind='ic',
                                                obs_col='genotype',
                                                obs_conditions=['b6n', '5xfad'])
die_table.head()

Read in graph from swan_modelad.p

gid

p_val

dpi

pos_iso_1

pos_iso_2

pos_iso_1_dpi

pos_iso_2_dpi

neg_iso_1

neg_iso_2

neg_iso_1_dpi

neg_iso_2_dpi

adj_p_val

gname

ENCODEMG000055991

4.100695e-01

6.071428

ENCODEMG000055991_2

ENCODEMG000055991_3

3.571428

2.500000

ENCODEMG000055991_1

NaN

-6.071426

NaN

0.707992

ENCODEMG000055991

ENCODEMG000055998

6.190162e-01

9.722223

ENCODEMG000055998_2

NaN

9.722223

NaN

ENCODEMG000055998_1

NaN

-9.722218

NaN

0.836200

ENCODEMG000055998

ENCODEMG000056718

8.081718e-07

6.289159

NaN

2.329770

1.953835

NaN

-3.339438

-2.949721

0.000020

ENCODEMG000056718

ENCODEMG000056804

2.107047e-03

27.863049

ENCODEMG000056804_1

NaN

27.863047

NaN

ENCODEMG000056804_2

NaN

-27.863049

NaN

0.021510

ENCODEMG000056804

ENCODEMG000063411

7.699701e-01

12.500000

ENCODEMG000063411_1

NaN

12.500000

NaN

ENCODEMG000063411_2

NaN

-12.500000

NaN

0.919808

ENCODEMG000063411

Combining metadata columns

What if none of the metadata columns you have summarize the comparisons you want to make? What if I want to find differentially expressed genes or transcripts, or find isoform-switching genes between hffc6 replicate 3 and hepg2 replicate 1?

sg.adata.obs.head()

cell_line

replicate

dataset

total_counts

dataset

hepg2_1

hepg2

hepg2_1

499647.0

hepg2_2

hepg2

hepg2_2

848447.0

hffc6_1

hffc6

hffc6_1

761493.0

hffc6_2

hffc6

hffc6_2

787967.0

hffc6_3

hffc6

hffc6_3

614921.0

Let's ignore for a moment the fact that the dataset column does effectively capture both replicate as well as cell line metadata. This may not always be the case with more complex datasets. Swan has a function to concatenate columns together and add them as an additional column to the metadata tables. Use the following code to generate a new column that concatenates as many preexisting metadata columns as you wish:

col_name = sg.add_multi_groupby(['cell_line', 'replicate'])

print(col_name)
print(sg.adata.obs.head())

cell_line_replicate
        cell_line replicate  dataset  total_counts cell_line_replicate
dataset                                                               
hepg2_1     hepg2         1  hepg2_1      499647.0             hepg2_1
hepg2_2     hepg2         2  hepg2_2      848447.0             hepg2_2
hffc6_1     hffc6         1  hffc6_1      761493.0             hffc6_1
hffc6_2     hffc6         2  hffc6_2      787967.0             hffc6_2
hffc6_3     hffc6         3  hffc6_3      614921.0             hffc6_3

The added column in col_name can then be used as the obs_col input to die_gene_test() as follows:

obs_col = col_name
obs_conditions = ['hffc6_3', 'hepg2_1']


die_table, die_results = sg.die_gene_test(kind='iso',
                                          obs_col=obs_col,
                                          obs_conditions=obs_conditions)

Exon skipping and intron retention

Swan can detect novel (unannotated) exon skipping and intron retention events.

To obtain a dataframe of novel exon skipping events, run the following code:

# returns a DataFrame of genes, transcripts, and specific edges in
# the SwanGraph with novel exon skipping events
es_df = sg.find_es_genes(verbose=True)

Testing each novel edge for exon skipping:   0%|          | 0/855 [00:00<?, ?it/s]

Analyzing 855 intronic edges for ES


Testing each novel edge for exon skipping: 100%|██████████| 855/855 [1:26:06<00:00,  6.13s/it]

Found 529 novel es events in 149 transcripts.

es_df.head()

gid

tid

edge_id

ENSG00000157916.19

TALONT000218256

952616

ENSG00000122406.13

TALONT000425229

952716

ENSG00000224093.5

TALONT000434035

952720

ENSG00000224093.5

TALONT000434035

952720

ENSG00000224093.5

TALONT000434035

952720

To obtain a list of genes containing novel intron retention events, run the following code:

# returns a DataFrame of genes, transcripts, and specific edges in
# the SwanGraph with novel intron retaining events
ir_df = sg.find_ir_genes(verbose=True)

  0%|          | 0/1186 [00:00<?, ?it/s][A
Testing each novel edge for intron retention:   0%|          | 0/1186 [00:00<?, ?it/s][A

Analyzing 1186 exonic edges for IR
Testing each novel edge for intron retention: 100%|██████████| 1186/1186 [1:59:16<00:00,  5.97s/it][A

Found 35 novel ir events in 27 transcripts.

You can pass gene IDs from es_df into gen_report() or plot_graph() to visualize where these exon-skipping events are in gene reports or gene summary graphs respectively.

PreviousGetting started NextVisualization

Last updated 2 years ago

Was this helpful?