Additional utilities

Swan now comes with several utilities that can be used fo compute and output various metrics using data in the SwanGraph.

Table of contents

We'll be using the same SwanGraph as the rest of the tutorial pages to demonstrate these utilities. Load it using the following code:

import swan_vis as swan

# code to download this data is in the Getting started tutorial
sg = swan.read('../tutorials/data/swan.p')
Read in graph from ../tutorials/data/swan.p

Calculating TPM values

Swan allows for users to calculate the TPM of their data using various groupby metrics using the calc_tpm() function. You can use this to calculate TPM of any of the AnnData SwanGraph objects (SwanGraph.adata for transcripts, SwanGraph.tss_adata for TSSs, SwanGraph.tes_adata for TESs, and SwanGraph.edge_adata for edges; see the Data structure FAQ page for more information on these tables.

First, we'll calculate the TPM for each transcript in each dataset:

df = swan.calc_tpm(sg.adata)
df.head()

tid

ENST00000000233.9

ENST00000000412.7

ENST00000000442.10

ENST00000001008.5

ENST00000001146.6

ENST00000002125.8

ENST00000002165.10

ENST00000002501.10

ENST00000002596.5

ENST00000002829.7

...

TALONT000482711

TALONT000482903

TALONT000483195

TALONT000483284

TALONT000483315

TALONT000483322

TALONT000483327

TALONT000483978

TALONT000484004

TALONT000484796

hepg2_1

196.138474

86.060760

8.005652

46.032497

0.0

16.011305

258.182281

60.042389

0.0

0.0

...

0.000000

4.002826

2.001413

12.008478

0.000000

0.000000

4.002826

14.009891

8.005652

0.000000

hepg2_2

243.975174

77.789185

7.071744

61.288448

0.0

12.964864

380.695557

64.824318

0.0

0.0

...

1.178624

14.143488

4.714496

7.071744

2.357248

8.250368

2.357248

11.786240

10.607616

1.178624

hffc6_1

131.320969

194.355042

0.000000

107.683197

0.0

6.566049

278.400452

0.000000

0.0

0.0

...

6.566049

13.132097

9.192468

1.313210

6.566049

9.192468

6.566049

0.000000

15.758516

1.313210

hffc6_2

137.061584

242.395935

0.000000

124.370689

0.0

8.883621

219.552338

0.000000

0.0

0.0

...

15.229064

10.152709

6.345443

8.883621

1.269089

10.152709

15.229064

0.000000

16.498154

8.883621

hffc6_3

147.986496

273.205841

3.252450

172.379868

0.0

9.757351

200.025696

1.626225

0.0

0.0

...

14.636026

11.383576

8.131125

8.131125

11.383576

11.383576

6.504900

0.000000

24.393377

9.757351

5 rows × 208306 columns

We can swap out the first argument with the different AnnData structures in the SwanGraph. For instance, say we want to calculate the TPM of each TSS:

df = swan.calc_tpm(sg.tss_adata)
df.head()

tss_id

ENSG00000000003.14_1

ENSG00000000003.14_2

ENSG00000000003.14_3

ENSG00000000003.14_4

ENSG00000000005.5_1

ENSG00000000005.5_2

ENSG00000000419.12_1

ENSG00000000419.12_2

ENSG00000000457.13_1

ENSG00000000457.13_2

...

TALONG000085596_1

TALONG000085799_1

TALONG000085978_1

TALONG000086022_1

TALONG000086057_1

TALONG000086218_1

TALONG000086443_1

TALONG000086539_1

TALONG000086553_1

TALONG000086766_1

hepg2_1

0.0

232.163910

0.0

0.0

0.0

0.0

0.0

54.038151

0.0

0.000000

...

0.000000

0.000000

0.000000

60.042389

0.000000

0.000000

6.004239

0.000000

0.000000

8.005652

hepg2_2

0.0

276.976654

0.0

0.0

0.0

0.0

0.0

103.718910

0.0

2.357248

...

0.000000

0.000000

0.000000

95.468544

0.000000

0.000000

31.822847

0.000000

1.178624

10.607616

hffc6_1

0.0

45.962341

0.0

0.0

0.0

0.0

0.0

101.117149

0.0

0.000000

...

2.626419

6.566049

9.192468

0.000000

7.879258

11.818888

233.751328

9.192468

6.566049

15.758516

hffc6_2

0.0

53.301723

0.0

0.0

0.0

0.0

0.0

85.028938

0.0

1.269089

...

6.345443

1.269089

12.690886

0.000000

8.883621

20.305418

119.294334

12.690886

2.538177

16.498154

hffc6_3

0.0

68.301460

0.0

0.0

0.0

0.0

0.0

89.442383

0.0

1.626225

...

8.131125

8.131125

11.383576

0.000000

11.383576

17.888477

134.976685

27.645828

8.131125

24.393377

5 rows × 130176 columns

And finally, we can use an alternative metadata column to compute TPM on. For instance, we can use the cell_line column:

df = swan.calc_tpm(sg.adata, obs_col='cell_line')
df.head()

tid

ENST00000000233.9

ENST00000000412.7

ENST00000000442.10

ENST00000001008.5

ENST00000001146.6

ENST00000002125.8

ENST00000002165.10

ENST00000002501.10

ENST00000002596.5

ENST00000002829.7

...

TALONT000482711

TALONT000482903

TALONT000483195

TALONT000483284

TALONT000483315

TALONT000483322

TALONT000483327

TALONT000483978

TALONT000484004

TALONT000484796

hepg2

226.245346

80.854897

7.417881

55.634102

0.0

14.093972

335.288177

63.051983

0.0

0.0

...

0.741788

10.385033

3.70894

8.901457

1.483576

5.192516

2.967152

12.610396

9.643245

0.741788

hffc6

138.145737

234.247116

0.924052

132.139404

0.0

8.316465

234.709137

0.462026

0.0

0.0

...

12.012672

11.550647

7.85444

6.006336

6.006336

10.164569

9.702543

0.000000

18.481035

6.468362

2 rows × 208306 columns

Calculating pi values

You can use the calc_pi() function to calculate percent isoform use (pi) per gene in nearly the exact same way that you can use calc_tpm(): you can run it on either the transcript, edge, TSS, or TES level, and you can choose the metadata variable to groupby. The only difference is that for calc_pi() you must also provide an additional DataFrame object as the second argument that tells Swan what gene each entry comes from. Below the corresponding DataFrame that must be provided is listed for each AnnData:

AnnData

DataFrame

SwanGraph.adata

SwanGraph.t_df

SwanGraph.tss_adata

SwanGraph.tss_adata.var

SwanGraph.tes_adata

SwanGraph.tes_adata.var

First, we'll calculate the pi value for each transcript in each dataset:

df, sums = swan.calc_pi(sg.adata, sg.t_df)
df.head()

tid

ENST00000000233.9

ENST00000000412.7

ENST00000000442.10

ENST00000001008.5

ENST00000001146.6

ENST00000002125.8

ENST00000002165.10

ENST00000002501.10

ENST00000002596.5

ENST00000002829.7

...

TALONT000482711

TALONT000482903

TALONT000483195

TALONT000483284

TALONT000483315

TALONT000483322

TALONT000483327

TALONT000483978

TALONT000484004

TALONT000484796

hepg2_1

100.000000

100.0

100.000000

100.0

0.0

100.000000

100.0

93.750000

0.0

0.0

...

0.000000

1.904762

6.666667

13.043478

0.000000

0.000000

1.333333

100.0

100.0

0.000000

hepg2_2

99.519226

100.0

60.000004

100.0

0.0

100.000000

100.0

80.882355

0.0

0.0

...

5.263158

3.225806

13.793103

8.695652

0.884956

3.097345

0.884956

100.0

100.0

2.380952

hffc6_1

98.039215

100.0

0.000000

100.0

0.0

100.000000

100.0

0.000000

0.0

0.0

...

2.604167

2.092050

16.279070

1.428571

0.854701

1.196581

0.854701

0.0

100.0

1.886792

hffc6_2

99.082573

100.0

0.000000

100.0

0.0

77.777779

100.0

0.000000

0.0

0.0

...

4.285715

2.144772

11.627908

14.893617

0.166667

1.333333

2.000000

0.0

100.0

9.859155

hffc6_3

100.000000

100.0

100.000000

100.0

0.0

85.714287

100.0

100.000000

0.0

0.0

...

4.326923

2.536232

15.151516

10.638298

1.711491

1.711491

0.977995

0.0

100.0

13.636364

5 rows × 208306 columns

As a note, the calc_pi() function outputs not only a table of pi values but of counts per isoform per condition, which is used as an intermediate during DIE testing. To avoid recalculation, it is output here.

sums.head()

ENST00000000233.9

ENST00000000412.7

ENST00000000442.10

ENST00000001008.5

ENST00000001146.6

ENST00000002125.8

ENST00000002165.10

ENST00000002501.10

ENST00000002596.5

ENST00000002829.7

...

TALONT000482711

TALONT000482903

TALONT000483195

TALONT000483284

TALONT000483315

TALONT000483322

TALONT000483327

TALONT000483978

TALONT000484004

TALONT000484796

hepg2_1

98.0

43.0

4.0

23.0

0.0

8.0

129.0

30.0

0.0

0.0

...

0.0

2.0

1.0

6.0

0.0

0.0

2.0

7.0

4.0

0.0

hepg2_2

207.0

66.0

6.0

52.0

0.0

11.0

323.0

55.0

0.0

0.0

...

1.0

12.0

4.0

6.0

2.0

7.0

2.0

10.0

9.0

1.0

hffc6_1

100.0

148.0

0.0

82.0

0.0

5.0

212.0

0.0

0.0

0.0

...

5.0

10.0

7.0

1.0

5.0

7.0

5.0

0.0

12.0

1.0

hffc6_2

108.0

191.0

0.0

98.0

0.0

7.0

173.0

0.0

0.0

0.0

...

12.0

8.0

5.0

7.0

1.0

8.0

12.0

0.0

13.0

7.0

hffc6_3

91.0

168.0

2.0

106.0

0.0

6.0

123.0

1.0

0.0

0.0

...

9.0

7.0

5.0

5.0

7.0

7.0

4.0

0.0

15.0

6.0

5 rows × 208306 columns

We can also calculate the pi value for the TSSs and TESs in each dataset:

df, sums = swan.calc_pi(sg.tss_adata, sg.tss_adata.var)
print(df.head())
print()

df, sums = swan.calc_pi(sg.tes_adata, sg.tes_adata.var)
print(df.head())
print()
tss_id   ENSG00000000003.14_1  ENSG00000000003.14_2  ENSG00000000003.14_3  \
hepg2_1                   0.0                 100.0                   0.0   
hepg2_2                   0.0                 100.0                   0.0   
hffc6_1                   0.0                 100.0                   0.0   
hffc6_2                   0.0                 100.0                   0.0   
hffc6_3                   0.0                 100.0                   0.0   

tss_id   ENSG00000000003.14_4  ENSG00000000005.5_1  ENSG00000000005.5_2  \
hepg2_1                   0.0                  0.0                  0.0   
hepg2_2                   0.0                  0.0                  0.0   
hffc6_1                   0.0                  0.0                  0.0   
hffc6_2                   0.0                  0.0                  0.0   
hffc6_3                   0.0                  0.0                  0.0   

tss_id   ENSG00000000419.12_1  ENSG00000000419.12_2  ENSG00000000457.13_1  \
hepg2_1                   0.0                 100.0                   0.0   
hepg2_2                   0.0                 100.0                   0.0   
hffc6_1                   0.0                 100.0                   0.0   
hffc6_2                   0.0                 100.0                   0.0   
hffc6_3                   0.0                 100.0                   0.0   

tss_id   ENSG00000000457.13_2  ...  TALONG000085596_1  TALONG000085799_1  \
hepg2_1                   0.0  ...                0.0                0.0   
hepg2_2                 100.0  ...                0.0                0.0   
hffc6_1                   0.0  ...              100.0              100.0   
hffc6_2                 100.0  ...              100.0              100.0   
hffc6_3                 100.0  ...              100.0              100.0   

tss_id   TALONG000085978_1  TALONG000086022_1  TALONG000086057_1  \
hepg2_1                0.0              100.0                0.0   
hepg2_2                0.0              100.0                0.0   
hffc6_1              100.0                0.0              100.0   
hffc6_2              100.0                0.0              100.0   
hffc6_3              100.0                0.0              100.0   

tss_id   TALONG000086218_1  TALONG000086443_1  TALONG000086539_1  \
hepg2_1                0.0              100.0                0.0   
hepg2_2                0.0              100.0                0.0   
hffc6_1              100.0              100.0              100.0   
hffc6_2              100.0              100.0              100.0   
hffc6_3              100.0              100.0              100.0   

tss_id   TALONG000086553_1  TALONG000086766_1  
hepg2_1                0.0              100.0  
hepg2_2              100.0              100.0  
hffc6_1              100.0              100.0  
hffc6_2              100.0              100.0  
hffc6_3              100.0              100.0  

[5 rows x 130176 columns]

tes_id   ENSG00000000003.14_1  ENSG00000000003.14_2  ENSG00000000003.14_3  \
hepg2_1                   0.0                 100.0                   0.0   
hepg2_2                   0.0                 100.0                   0.0   
hffc6_1                   0.0                 100.0                   0.0   
hffc6_2                   0.0                 100.0                   0.0   
hffc6_3                   0.0                 100.0                   0.0   

tes_id   ENSG00000000003.14_4  ENSG00000000003.14_5  ENSG00000000005.5_1  \
hepg2_1                   0.0                   0.0                  0.0   
hepg2_2                   0.0                   0.0                  0.0   
hffc6_1                   0.0                   0.0                  0.0   
hffc6_2                   0.0                   0.0                  0.0   
hffc6_3                   0.0                   0.0                  0.0   

tes_id   ENSG00000000005.5_2  ENSG00000000419.12_1  ENSG00000000419.12_2  \
hepg2_1                  0.0             92.592590                   0.0   
hepg2_2                  0.0             98.863640                   0.0   
hffc6_1                  0.0             98.701302                   0.0   
hffc6_2                  0.0             95.522385                   0.0   
hffc6_3                  0.0             98.181824                   0.0   

tes_id   ENSG00000000419.12_3  ...  TALONG000085596_1  TALONG000085799_1  \
hepg2_1              7.407407  ...                0.0                0.0   
hepg2_2              1.136364  ...                0.0                0.0   
hffc6_1              1.298701  ...              100.0              100.0   
hffc6_2              4.477612  ...              100.0              100.0   
hffc6_3              1.818182  ...              100.0              100.0   

tes_id   TALONG000085978_1  TALONG000086022_1  TALONG000086057_1  \
hepg2_1                0.0              100.0                0.0   
hepg2_2                0.0              100.0                0.0   
hffc6_1              100.0                0.0              100.0   
hffc6_2              100.0                0.0              100.0   
hffc6_3              100.0                0.0              100.0   

tes_id   TALONG000086218_1  TALONG000086443_1  TALONG000086539_1  \
hepg2_1                0.0              100.0                0.0   
hepg2_2                0.0              100.0                0.0   
hffc6_1              100.0              100.0              100.0   
hffc6_2              100.0              100.0              100.0   
hffc6_3              100.0              100.0              100.0   

tes_id   TALONG000086553_1  TALONG000086766_1  
hepg2_1                0.0              100.0  
hepg2_2              100.0              100.0  
hffc6_1              100.0              100.0  
hffc6_2              100.0              100.0  
hffc6_3              100.0              100.0  

[5 rows x 187454 columns]

And we can also choose to calculate pi values using a different metadata column, here shown on the cell_line column:

df, sums = swan.calc_pi(sg.adata, sg.t_df, obs_col='cell_line')
df.head()

tid

ENST00000000233.9

ENST00000000412.7

ENST00000000442.10

ENST00000001008.5

ENST00000001146.6

ENST00000002125.8

ENST00000002165.10

ENST00000002501.10

ENST00000002596.5

ENST00000002829.7

...

TALONT000482711

TALONT000482903

TALONT000483195

TALONT000483284

TALONT000483315

TALONT000483322

TALONT000483327

TALONT000483978

TALONT000484004

TALONT000484796

hepg2

99.673203

100.0

71.428574

100.0

0.0

100.000000

100.0

85.0

0.0

0.0

...

4.545455

2.935011

11.363637

10.434782

0.531915

1.861702

1.06383

100.0

100.0

1.785714

hffc6

99.006622

100.0

100.000000

100.0

0.0

85.714287

100.0

100.0

0.0

0.0

...

3.823529

2.218279

14.285715

7.926829

0.815558

1.380176

1.31744

0.0

100.0

8.333334

2 rows × 208306 columns

Obtaining edge abundance information

In case you're interested in doing outside analyses on the level (For instance, using intron counting to assess alternative splicing), Swan provides a tool to output a DataFrame with edge abundance on the dataset level.

If we just want to get access to the edge abundance DataFrame, just use the get_edge_abundance() function:

df = sg.get_edge_abundance()
df.head()

strand

edge_type

annotation

chrom

start

stop

hepg2_1

hepg2_2

hffc6_1

hffc6_2

hffc6_3

0

+

exon

True

chr1

11869

12227

0.0

0.0

0.0

0.0

0.0

1

+

exon

True

chr1

12010

12057

0.0

0.0

0.0

0.0

0.0

2

+

intron

True

chr1

12057

12179

0.0

0.0

0.0

0.0

0.0

3

+

exon

True

chr1

12179

12227

0.0

0.0

0.0

0.0

0.0

4

+

intron

True

chr1

12227

12613

0.0

0.0

0.0

0.0

0.0

You can also specify if you want the data to be output in raw counts (kind='counts') or TPM (kind='tpm). By default, this function returns counts. Here's an example with TPM:

df = sg.get_edge_abundance(kind='tpm')
df.head()

strand

edge_type

annotation

chrom

start

stop

hepg2_1

hepg2_2

hffc6_1

hffc6_2

hffc6_3

0

+

exon

True

chr1

11869

12227

0.0

0.0

0.0

0.0

0.0

1

+

exon

True

chr1

12010

12057

0.0

0.0

0.0

0.0

0.0

2

+

intron

True

chr1

12057

12179

0.0

0.0

0.0

0.0

0.0

3

+

exon

True

chr1

12179

12227

0.0

0.0

0.0

0.0

0.0

4

+

intron

True

chr1

12227

12613

0.0

0.0

0.0

0.0

0.0

And finally, if you wish, you can provide the function with a prefix value which will indicate that you want the output DataFrame to be saved in TSV form.

df = sg.get_edge_abundance(kind='tpm', prefix='test')
df.head()

strand

edge_type

annotation

chrom

start

stop

hepg2_1

hepg2_2

hffc6_1

hffc6_2

hffc6_3

0

+

exon

True

chr1

11869

12227

0.0

0.0

0.0

0.0

0.0

1

+

exon

True

chr1

12010

12057

0.0

0.0

0.0

0.0

0.0

2

+

intron

True

chr1

12057

12179

0.0

0.0

0.0

0.0

0.0

3

+

exon

True

chr1

12179

12227

0.0

0.0

0.0

0.0

0.0

4

+

intron

True

chr1

12227

12613

0.0

0.0

0.0

0.0

0.0

The results will be saved in '{prefix}_edge_abundance.tsv'.

Obtaining TSS/TES abundance information

Similarly, if you wish to do analysis involving your TSS or TES data, you can also output these using the get_tss_abundance() and get_tes_abundance() functions respectively. These have identical options to get_edge_abundance() so they can either output counts or TPM and optionally save to an output file.

First, let's output the TSS TPM to a file:

df = sg.get_tss_abundance(kind='tpm', prefix='test')
df.head()

tss_id

gid

gname

vertex_id

tss_name

chrom

coord

hepg2_1

hepg2_2

hffc6_1

hffc6_2

hffc6_3

0

ENSG00000000003.14_1

ENSG00000000003.14

TSPAN6

926111

TSPAN6_1

chrX

100636191

0.00000

0.000000

0.000000

0.000000

0.00000

1

ENSG00000000003.14_2

ENSG00000000003.14

TSPAN6

926112

TSPAN6_2

chrX

100636608

232.16391

276.976654

45.962341

53.301723

68.30146

2

ENSG00000000003.14_3

ENSG00000000003.14

TSPAN6

926114

TSPAN6_3

chrX

100636793

0.00000

0.000000

0.000000

0.000000

0.00000

3

ENSG00000000003.14_4

ENSG00000000003.14

TSPAN6

926117

TSPAN6_4

chrX

100639945

0.00000

0.000000

0.000000

0.000000

0.00000

4

ENSG00000000005.5_1

ENSG00000000005.5

TNMD

926077

TNMD_1

chrX

100585066

0.00000

0.000000

0.000000

0.000000

0.00000

Now we'll get the counts of each TES without saving to a file:

df = sg.get_tes_abundance(kind='counts')
df.head()

tes_id

gid

gname

vertex_id

tes_name

chrom

coord

hepg2_1

hepg2_2

hffc6_1

hffc6_2

hffc6_3

0

ENSG00000000003.14_1

ENSG00000000003.14

TSPAN6

926092

TSPAN6_1

chrX

100627109

0.0

0.0

0.0

0.0

0.0

1

ENSG00000000003.14_2

ENSG00000000003.14

TSPAN6

926093

TSPAN6_2

chrX

100628670

116.0

235.0

35.0

42.0

42.0

2

ENSG00000000003.14_3

ENSG00000000003.14

TSPAN6

926097

TSPAN6_3

chrX

100632063

0.0

0.0

0.0

0.0

0.0

3

ENSG00000000003.14_4

ENSG00000000003.14

TSPAN6

926100

TSPAN6_4

chrX

100632541

0.0

0.0

0.0

0.0

0.0

4

ENSG00000000003.14_5

ENSG00000000003.14

TSPAN6

926103

TSPAN6_5

chrX

100633442

0.0

0.0

0.0

0.0

0.0

Last updated