Skip to content

Commit b945cc4

Browse files
committed
Updated README, minor changes to analysis notebook
1 parent bf9bf40 commit b945cc4

File tree

4 files changed

+36
-119
lines changed

4 files changed

+36
-119
lines changed

README.md

+14-9
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# SCEMILA - README
22

3-
Welcome to the Github repository supplementing the publication "Predicting AML genetic subtypes and diagnostic cells with attention augmented multiple instance learning" (Hehr et al., 2021, currently under review).
3+
Welcome to the Github repository supplementing the publication "Explainable AI identifies diagnostic cells of genetic AML subtypes." (Hehr M, Sadafi A, Matek C, Lienemann P, Pohlkamp C, et al. (2023) PLOS Digital Health 2(3): e0000187. https://doi.org/10.1371/journal.pdig.0000187).
44

55
## Table of contents
66
1. Description
@@ -24,15 +24,16 @@ Welcome to the Github repository supplementing the publication "Predicting AML g
2424

2525
# 1. Description
2626
## About
27-
This Repo contains both the machine learning algorithm and the necessary functions to analyze and plot the figures published in the paper "Predicting AML genetic subtypes and diagnostic cells with attention augmented multiple instance learning" (Hehr et al., 2021, currently under review).
27+
This Repository contains both the machine learning algorithm and the necessary functions to analyze and plot the figures published in the paper "Explainable AI identifies diagnostic cells of genetic AML subtypes." (Hehr M, Sadafi A, Matek C, Lienemann P, Pohlkamp C, et al. (2023) PLOS Digital Health 2(3): e0000187. https://doi.org/10.1371/journal.pdig.0000187).
2828

2929
## Contact
3030
For questions and issues regarding the code, feel free to contact [Matthias Hehr](https://www.linkedin.com/in/matthias-hehr/). Otherwise, please reach out to the corresponding authors.
3131

3232
# 2. Getting started
3333

3434
## 2.1 Data
35-
The data will be published and available for download soon. To reproduce results, download the data and unzip it.
35+
To reproduce results, download the data and unzip it. The publication of our dataset is currently in progress, the data will be available at [The Cancer Imaging Archive (TCIA):](https://www.cancerimagingarchive.net/) https://doi.org/10.7937/6ppe-4020
36+
3637

3738
## 2.2 Dependencies
3839
The pipeline and corresponding analysis requires a python environment with various packages. The [requirements file](requirements.txt) will be of help to build a functioning python environment.
@@ -72,24 +73,28 @@ This will create a new folder in your directory `TARGET_FOLDER` called `result_f
7273
To analyze the data generated and take a look at various visualizations, use the [analysis notebook](analysis/analysis_notebook.ipynb) and adjust the corresponding paths as mentioned in 2.3 (Code Setup).
7374

7475
The notebook is designed to simplify analysis of the results generated with the pipeline, by automated plotting of most of the figures published in the paper. These figures are then exported directly into the [output folder](analysis/output).
76+
The last sections of our notebook require large amounts of RAM (we recommend 32GB), otherwise the pythonkernel might crash.
7577

7678
# 3. Authors
7779
Major contributions were made by the following people:
7880

79-
Matthias Hehr<sup>1,2,3</sup>, Ario Sadafi<sup>1,2</sup>, Christian Matek<sup>1,2,3</sup>, Christian Pohlkamp<sup>4</sup>, Torsten Haferlach<sup>4</sup>, Karsten Spiekermann<sup>3,5,6,+</sup> and Carsten Marr<sup>1,2,+</sup>
81+
Matthias Hehr<sup>1,2,3</sup>, Ario Sadafi<sup>1,2,4</sup>, Christian Matek<sup>1,2,3</sup>, Peter Lienemann<sup>1,3</sup>, Christian Pohlkamp<sup>5</sup>, Torsten Haferlach<sup>5</sup>, Karsten Spiekermann<sup>3,6,7</sup> and Carsten Marr<sup>1,2,*</sup>
8082

8183
<sup>1</sup>Institute of AI for Health, Helmholtz Zentrum München – German Research Center for Environmental Health, Neuherberg, Germany
8284
<sup>2</sup>Institute of Computational Biology, Helmholtz Zentrum München – German Research Center for Environmental Health, Neuherberg, Germany
8385
<sup>3</sup>Laboratory of Leukemia Diagnostics, Department of Medicine III, University Hospital, LMU Munich, Munich, Germany
84-
<sup>4</sup>Munich Leukemia Laboratory, Munich, Germany
85-
<sup>5</sup>German Cancer Consortium (DKTK), Heidelberg, Germany
86-
<sup>6</sup>German Cancer Research Center (DKFZ), Heidelberg, Germany
87-
<sup>+</sup>Corresponding authors
86+
<sup>4</sup>Computer Aided Medical Procedures, Technical University of Munich, Munich, Germany
87+
<sup>5</sup>Munich Leukemia Laboratory, Munich, Germany
88+
<sup>6</sup>German Cancer Consortium (DKTK), Heidelberg, Germany
89+
<sup>7</sup>German Cancer Research Center (DKFZ), Heidelberg, Germany
90+
<sup>*</sup>Corresponding author: [email protected]
8891

8992

9093

9194
# 4. Acknowledgements
9295
M.H. was supported by a José-Carreras-DGHO-Promotionsstipendium. C.M. has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant agreement No. 866411)
9396

9497
# 5. License
95-
[See the license](LICENSE). If you use this code, please cite our original paper (further information about the citation will follow).
98+
[See the license](LICENSE). If you use this code, please cite our original paper:
99+
100+
Hehr M, Sadafi A, Matek C, Lienemann P, Pohlkamp C, et al. (2023) Explainable AI identifies diagnostic cells of genetic AML subtypes. PLOS Digital Health 2(3): e0000187. https://doi.org/10.1371/journal.pdig.0000187

analysis/analysis_notebook.ipynb

+9-97
Original file line numberDiff line numberDiff line change
@@ -105,15 +105,6 @@
105105
"print(\"Images per patient (std): \", sc_df.index.value_counts().std())"
106106
]
107107
},
108-
{
109-
"cell_type": "code",
110-
"execution_count": null,
111-
"metadata": {},
112-
"outputs": [],
113-
"source": [
114-
"patient_df"
115-
]
116-
},
117108
{
118109
"cell_type": "markdown",
119110
"metadata": {},
@@ -198,7 +189,6 @@
198189
"metadata": {},
199190
"outputs": [],
200191
"source": [
201-
"# entropy_plot.entropy_plot(patient_df)\n",
202192
"entropy_plot.entropy_vs_myb(patient_df)"
203193
]
204194
},
@@ -219,12 +209,13 @@
219209
]
220210
},
221211
{
212+
"attachments": {},
222213
"cell_type": "markdown",
223214
"metadata": {},
224215
"source": [
225216
"# Patients: algorithm performance\n",
226217
"### Define a patient\n",
227-
"Put in a patient ID to look at the predictions"
218+
"Enter any patient ID to look at the predictions"
228219
]
229220
},
230221
{
@@ -299,11 +290,12 @@
299290
]
300291
},
301292
{
293+
"attachments": {},
302294
"cell_type": "markdown",
303295
"metadata": {},
304296
"source": [
305297
"### Show swarmplot\n",
306-
"Swarmplot is interactive and shows cells upon mouseover. Classes of cells can be excluded by clicking the corresponding label in the legend on the right. This cell automatically stores a vector graphic in the folder ```output/swarmplots```, and calculates the distribution of cells in each quartile (see dataframe below interactive figure)"
298+
"The presented Swarmplot is interactive and shows cells upon mouseover. Classes of cells can be excluded by clicking the corresponding label in the legend on the right. This cell automatically stores a vector graphic in the folder ```output/swarmplots```, and calculates the distribution of cells in each quartile (see dataframe below interactive figure)"
307299
]
308300
},
309301
{
@@ -384,20 +376,21 @@
384376
"metadata": {},
385377
"outputs": [],
386378
"source": [
387-
"# calculate occlusion and solitary cell predictions\n",
379+
"# calculate occlusion and solitary cell predictions as predicted in Fig. 3b\n",
388380
"occlusion_values = sc_occlusion.calculate_change_on_occlusion(data_with_mappings_and_coordinates, result_folder_path, \n",
389381
" folders_cv_available, feature_prefix, lbl_conv_obj)\n",
390382
"\n",
391383
"bokeh_wrapper.init_sol_plot(occlusion_values)"
392384
]
393385
},
394386
{
387+
"attachments": {},
395388
"cell_type": "markdown",
396389
"metadata": {},
397390
"source": [
398391
"# UMAP embedding\n",
399392
"\n",
400-
"The interactive UMAP figures require quite a lot of RAM and computing power. To calculate the occlusion values, the use of CUDA-capable GPUs is highly recommended and will greatly speed up the process.\n",
393+
"The interactive UMAP figures require quite a lot of RAM and computing power. To calculate the occlusion values, the use of CUDA-capable GPUs is highly recommended and will greatly speed up the process. From here on out, we recommend 32GB of RAM, otherwise the kernel will most likely crash.\n",
401394
"\n",
402395
"1. Calculate or load the UMAP embedding (not necessary, if an old gzip file should be loaded)"
403396
]
@@ -411,10 +404,10 @@
411404
"outputs": [],
412405
"source": [
413406
"# sample cells randomly for embedding\n",
414-
"fold_filter = 0\n",
407+
"fold_filter = 2\n",
415408
"sc_umap_sample = sc_df.loc[sc_df['fold'] == fold_filter].sample(frac=1, random_state=1).copy()\n",
416409
"\n",
417-
"sc_df_umap = umap_embedding.select_embedding(sc_umap_sample)"
410+
"sc_df_umap = umap_embedding.generate_embedding(sc_umap_sample, save=False)"
418411
]
419412
},
420413
{
@@ -587,15 +580,6 @@
587580
"image_excerpt.plot(tmp_frame, show_scalebar=False, show_coordinates=False, cols=cols, path_save=path_save, show_patient_class=False)"
588581
]
589582
},
590-
{
591-
"cell_type": "code",
592-
"execution_count": null,
593-
"metadata": {},
594-
"outputs": [],
595-
"source": [
596-
"tmp_frame"
597-
]
598-
},
599583
{
600584
"cell_type": "markdown",
601585
"metadata": {},
@@ -662,78 +646,6 @@
662646
" bokeh_wrapper.export_umap(sc_prepared, data_column=\"solitary_softmax_{}\".format(class_lbl),grayscatter=True, \n",
663647
" dotsize=10, path_save=path_save)"
664648
]
665-
},
666-
{
667-
"cell_type": "code",
668-
"execution_count": null,
669-
"metadata": {},
670-
"outputs": [],
671-
"source": []
672-
},
673-
{
674-
"cell_type": "code",
675-
"execution_count": null,
676-
"metadata": {},
677-
"outputs": [],
678-
"source": [
679-
"patient_df"
680-
]
681-
},
682-
{
683-
"cell_type": "code",
684-
"execution_count": null,
685-
"metadata": {},
686-
"outputs": [],
687-
"source": [
688-
"import matplotlib.colors as mpt_colors\n",
689-
"import matplotlib.pyplot as plt\n",
690-
"from matplotlib import cm\n",
691-
"import font_matching\n",
692-
"\n",
693-
"# filter by fold\n",
694-
"patient_df_f0 = patient_df.loc[patient_df['fold'] == 0]\n",
695-
" \n",
696-
"clusterplot_structure = {}\n",
697-
" \n",
698-
"# for every entity, iterate in order:\n",
699-
"for entity in ['AML-PML-RARA', 'AML-NPM1', 'AML-CBFB-MYH11', 'AML-RUNX1-RUNX1T1', 'SCD']:\n",
700-
" patient_df_f0_ent = patient_df_f0.loc[patient_df_f0['gt_label'] == entity]\n",
701-
" clusterplot_structure[entity] = patient_df_f0_ent.sort_values(by='mil_prediction_{}'.format(entity), ascending=False).index\n",
702-
" \n",
703-
"# find maximum length for grid subplots\n",
704-
"max_len = max([len(x) for x in clusterplot_structure.values()])\n",
705-
"fig, ax = plt.subplots(max_len, 5, figsize=(10, max_len/2), constrained_layout=True)\n",
706-
"\n",
707-
"for a in ax.flatten():\n",
708-
" a.spines['top'].set_visible(False)\n",
709-
" a.spines['right'].set_visible(False)\n",
710-
" a.spines['bottom'].set_visible(False)\n",
711-
" a.spines['left'].set_visible(False)\n",
712-
" \n",
713-
" a.set_xticks([])\n",
714-
" a.set_yticks([])\n",
715-
"\n",
716-
"column_counter = 0\n",
717-
"for entity in ['AML-PML-RARA', 'AML-NPM1', 'AML-CBFB-MYH11', 'AML-RUNX1-RUNX1T1', 'SCD']:\n",
718-
" row_counter = 0\n",
719-
" for patient in clusterplot_structure[entity]:\n",
720-
" bokeh_wrapper.sol_att_bar_plot(sc_prepared.loc[patient], ax[row_counter, column_counter])\n",
721-
" if(row_counter == 0):\n",
722-
" ax[row_counter, column_counter].set_title(font_matching.edit(entity))\n",
723-
" row_counter += 1\n",
724-
" \n",
725-
" column_counter += 1\n",
726-
"print(\"Plotting...\") \n",
727-
"plt.show()\n",
728-
"print(\"Done!\")"
729-
]
730-
},
731-
{
732-
"cell_type": "code",
733-
"execution_count": null,
734-
"metadata": {},
735-
"outputs": [],
736-
"source": []
737649
}
738650
],
739651
"metadata": {

analysis/functions/bokeh_wrapper.py

+11-11
Original file line numberDiff line numberDiff line change
@@ -1191,17 +1191,17 @@ def init_sol_plot(data):
11911191
def sol_att_bar_plot(data, ax):
11921192
'''plot a cumulative barh plot, optional stacked indicators for attention'''
11931193

1194-
pred_columns = ['mil_prediction_AML-RUNX1-RUNX1T1',
1195-
'mil_prediction_AML-CBFB-MYH11',
1196-
'mil_prediction_AML-PML-RARA',
1197-
'mil_prediction_AML-NPM1',
1198-
'mil_prediction_SCD']
1199-
1200-
CLASS_COLORS = {'solitary_softmax_AML-PML-RARA': ((1.0, 127 / 255, 14 / 255), 0),
1201-
'solitary_softmax_AML-NPM1': ('red', 1),
1202-
'solitary_softmax_AML-CBFB-MYH11': ('sienna', 2),
1203-
'solitary_softmax_AML-RUNX1-RUNX1T1': ('dodgerblue', 3),
1204-
'solitary_softmax_SCD': ('limegreen', 4)}
1194+
pred_columns = ['mil_prediction_RUNX1_RUNX1T1',
1195+
'mil_prediction_CBFB_MYH11',
1196+
'mil_prediction_PML_RARA',
1197+
'mil_prediction_NPM1',
1198+
'mil_prediction_control']
1199+
1200+
CLASS_COLORS = {'solitary_softmax_PML_RARA': ((1.0, 127 / 255, 14 / 255), 0),
1201+
'solitary_softmax_NPM1': ('red', 1),
1202+
'solitary_softmax_CBFB_MYH11': ('sienna', 2),
1203+
'solitary_softmax_RUNX1_RUNX1T1': ('dodgerblue', 3),
1204+
'solitary_softmax_control': ('limegreen', 4)}
12051205

12061206
def col_transform(x): return CLASS_COLORS[x][0]
12071207
def order_transform(x): return CLASS_COLORS[x][1]

analysis/functions/umap_embedding.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ def select_embedding(sc_dataframe, fillup_unmatched=True):
4242
PATH_EMBEDDINGS, name_new + '.pkl'))
4343

4444

45-
def generate_embedding(sc_dataframe, path_target, save=True):
45+
def generate_embedding(sc_dataframe, path_target="", save=True):
4646
global scaler, reducer
4747

4848
# create scalers and reducer
@@ -51,7 +51,7 @@ def generate_embedding(sc_dataframe, path_target, save=True):
5151
cell_data = sc_dataframe_embedding[FEATURES].values
5252
umap_scaler = StandardScaler().fit(cell_data)
5353
scaled_cell_data = umap_scaler.transform(cell_data)
54-
umap_reducer = umap.UMAP(verbose=True).fit(scaled_cell_data)
54+
umap_reducer = umap.UMAP(verbose=False).fit(scaled_cell_data)
5555
embedding = umap_reducer.transform(scaled_cell_data)
5656

5757
sc_dataframe['x'] = embedding[..., 0]

0 commit comments

Comments
 (0)