Updated README, minor changes to analysis notebook

mhehr · mhehr · commit b945cc4ee1bb · 2023-03-16T21:54:18.000+01:00
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # SCEMILA - README
 
-Welcome to the Github repository supplementing the publication "Predicting AML genetic subtypes and diagnostic cells with attention augmented multiple instance learning" (Hehr et al., 2021, currently under review). 
+Welcome to the Github repository supplementing the publication "Explainable AI identifies diagnostic cells of genetic AML subtypes." (Hehr M, Sadafi A, Matek C, Lienemann P, Pohlkamp C, et al. (2023) PLOS Digital Health 2(3): e0000187. https://doi.org/10.1371/journal.pdig.0000187). 
 
 ## Table of contents
 1. Description
@@ -24,15 +24,16 @@ Welcome to the Github repository supplementing the publication "Predicting AML g
 
 # 1. Description
 ## About
-This Repo contains both the machine learning algorithm and the necessary functions to analyze and plot the figures published in the paper "Predicting AML genetic subtypes and diagnostic cells with attention augmented multiple instance learning" (Hehr et al., 2021, currently under review).
+This Repository contains both the machine learning algorithm and the necessary functions to analyze and plot the figures published in the paper "Explainable AI identifies diagnostic cells of genetic AML subtypes." (Hehr M, Sadafi A, Matek C, Lienemann P, Pohlkamp C, et al. (2023) PLOS Digital Health 2(3): e0000187. https://doi.org/10.1371/journal.pdig.0000187).
 
 ## Contact
 For questions and issues regarding the code, feel free to contact [Matthias Hehr](https://www.linkedin.com/in/matthias-hehr/). Otherwise, please reach out to the corresponding authors.  
 
 # 2. Getting started
 
 ## 2.1 Data
-The data will be published and available for download soon. To reproduce results, download the data and unzip it.
+To reproduce results, download the data and unzip it. The publication of our dataset is currently in progress, the data will be available at [The Cancer Imaging Archive (TCIA):](https://www.cancerimagingarchive.net/) https://doi.org/10.7937/6ppe-4020
+
 
 ## 2.2 Dependencies
 The pipeline and corresponding analysis requires a python environment with various packages. The [requirements file](requirements.txt) will be of help to build a functioning python environment. 
@@ -72,24 +73,28 @@ This will create a new folder in your directory `TARGET_FOLDER` called `result_f
 To analyze the data generated and take a look at various visualizations, use the [analysis notebook](analysis/analysis_notebook.ipynb) and adjust the corresponding paths as mentioned in 2.3 (Code Setup).
 
 The notebook is designed to simplify analysis of the results generated with the pipeline, by automated plotting of most of the figures published in the paper. These figures are then exported directly into the [output folder](analysis/output).
+The last sections of our notebook require large amounts of RAM (we recommend 32GB), otherwise the pythonkernel might crash. 
 
 # 3. Authors
 Major contributions were made by the following people:
 
-Matthias Hehr<sup>1,2,3</sup>, Ario Sadafi<sup>1,2</sup>, Christian Matek<sup>1,2,3</sup>, Christian Pohlkamp<sup>4</sup>, Torsten Haferlach<sup>4</sup>, Karsten Spiekermann<sup>3,5,6,+</sup> and Carsten Marr<sup>1,2,+</sup>
+Matthias Hehr<sup>1,2,3</sup>, Ario Sadafi<sup>1,2,4</sup>, Christian Matek<sup>1,2,3</sup>, Peter Lienemann<sup>1,3</sup>, Christian Pohlkamp<sup>5</sup>, Torsten Haferlach<sup>5</sup>, Karsten Spiekermann<sup>3,6,7</sup> and Carsten Marr<sup>1,2,*</sup>
 
 <sup>1</sup>Institute of AI for Health, Helmholtz Zentrum München – German Research Center for Environmental Health, Neuherberg, Germany
 <sup>2</sup>Institute of Computational Biology, Helmholtz Zentrum München – German Research Center for Environmental Health, Neuherberg, Germany
 <sup>3</sup>Laboratory of Leukemia Diagnostics, Department of Medicine III, University Hospital, LMU Munich, Munich, Germany
-<sup>4</sup>Munich Leukemia Laboratory, Munich, Germany
-<sup>5</sup>German Cancer Consortium (DKTK), Heidelberg, Germany
-<sup>6</sup>German Cancer Research Center (DKFZ), Heidelberg, Germany
-<sup>+</sup>Corresponding authors
+<sup>4</sup>Computer Aided Medical Procedures, Technical University of Munich, Munich, Germany
+<sup>5</sup>Munich Leukemia Laboratory, Munich, Germany
+<sup>6</sup>German Cancer Consortium (DKTK), Heidelberg, Germany
+<sup>7</sup>German Cancer Research Center (DKFZ), Heidelberg, Germany
+<sup>*</sup>Corresponding author: carsten.marr@helmholtz-muenchen.de
 
 
 
 # 4. Acknowledgements
 M.H. was supported by a José-Carreras-DGHO-Promotionsstipendium. C.M. has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant agreement No. 866411)
 
 # 5. License
-[See the license](LICENSE). If you use this code, please cite our original paper (further information about the citation will follow).
+[See the license](LICENSE). If you use this code, please cite our original paper:
+
+Hehr M, Sadafi A, Matek C, Lienemann P, Pohlkamp C, et al. (2023) Explainable AI identifies diagnostic cells of genetic AML subtypes. PLOS Digital Health 2(3): e0000187. https://doi.org/10.1371/journal.pdig.0000187
diff --git a/analysis/analysis_notebook.ipynb b/analysis/analysis_notebook.ipynb
@@ -105,15 +105,6 @@
     "print(\"Images per patient (std): \", sc_df.index.value_counts().std())"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "patient_df"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -198,7 +189,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# entropy_plot.entropy_plot(patient_df)\n",
     "entropy_plot.entropy_vs_myb(patient_df)"
    ]
   },
@@ -219,12 +209,13 @@
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "# Patients: algorithm performance\n",
     "### Define a patient\n",
-    "Put in a patient ID to look at the predictions"
+    "Enter any patient ID to look at the predictions"
    ]
   },
   {
@@ -299,11 +290,12 @@
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "### Show swarmplot\n",
-    "Swarmplot is interactive and shows cells upon mouseover. Classes of cells can be excluded by clicking the corresponding label in the legend on the right. This cell automatically stores a vector graphic in the folder ```output/swarmplots```, and calculates the distribution of cells in each quartile (see dataframe below interactive figure)"
+    "The presented Swarmplot is interactive and shows cells upon mouseover. Classes of cells can be excluded by clicking the corresponding label in the legend on the right. This cell automatically stores a vector graphic in the folder ```output/swarmplots```, and calculates the distribution of cells in each quartile (see dataframe below interactive figure)"
    ]
   },
   {
@@ -384,20 +376,21 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# calculate occlusion and solitary cell predictions\n",
+    "# calculate occlusion and solitary cell predictions as predicted in Fig. 3b\n",
     "occlusion_values = sc_occlusion.calculate_change_on_occlusion(data_with_mappings_and_coordinates, result_folder_path, \n",
     "                                               folders_cv_available, feature_prefix, lbl_conv_obj)\n",
     "\n",
     "bokeh_wrapper.init_sol_plot(occlusion_values)"
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "# UMAP embedding\n",
     "\n",
-    "The interactive UMAP figures require quite a lot of RAM and computing power. To calculate the occlusion values, the use of CUDA-capable GPUs is highly recommended and will greatly speed up the process.\n",
+    "The interactive UMAP figures require quite a lot of RAM and computing power. To calculate the occlusion values, the use of CUDA-capable GPUs is highly recommended and will greatly speed up the process. From here on out, we recommend 32GB of RAM, otherwise the kernel will most likely crash.\n",
     "\n",
     "1. Calculate or load the UMAP embedding (not necessary, if an old gzip file should be loaded)"
    ]
@@ -411,10 +404,10 @@
    "outputs": [],
    "source": [
     "# sample cells randomly for embedding\n",
-    "fold_filter = 0\n",
+    "fold_filter = 2\n",
     "sc_umap_sample = sc_df.loc[sc_df['fold'] == fold_filter].sample(frac=1, random_state=1).copy()\n",
     "\n",
-    "sc_df_umap = umap_embedding.select_embedding(sc_umap_sample)"
+    "sc_df_umap = umap_embedding.generate_embedding(sc_umap_sample, save=False)"
    ]
   },
   {
@@ -587,15 +580,6 @@
     "image_excerpt.plot(tmp_frame, show_scalebar=False, show_coordinates=False, cols=cols, path_save=path_save, show_patient_class=False)"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "tmp_frame"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -662,78 +646,6 @@
     "    bokeh_wrapper.export_umap(sc_prepared, data_column=\"solitary_softmax_{}\".format(class_lbl),grayscatter=True, \n",
     "                              dotsize=10, path_save=path_save)"
    ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "patient_df"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import matplotlib.colors as mpt_colors\n",
-    "import matplotlib.pyplot as plt\n",
-    "from matplotlib import cm\n",
-    "import font_matching\n",
-    "\n",
-    "# filter by fold\n",
-    "patient_df_f0 = patient_df.loc[patient_df['fold'] == 0]\n",
-    "    \n",
-    "clusterplot_structure = {}\n",
-    "    \n",
-    "# for every entity, iterate in order:\n",
-    "for entity in ['AML-PML-RARA', 'AML-NPM1', 'AML-CBFB-MYH11', 'AML-RUNX1-RUNX1T1', 'SCD']:\n",
-    "    patient_df_f0_ent = patient_df_f0.loc[patient_df_f0['gt_label'] == entity]\n",
-    "    clusterplot_structure[entity] = patient_df_f0_ent.sort_values(by='mil_prediction_{}'.format(entity), ascending=False).index\n",
-    "    \n",
-    "# find maximum length for grid subplots\n",
-    "max_len = max([len(x) for x in clusterplot_structure.values()])\n",
-    "fig, ax = plt.subplots(max_len, 5, figsize=(10, max_len/2), constrained_layout=True)\n",
-    "\n",
-    "for a in ax.flatten():\n",
-    "    a.spines['top'].set_visible(False)\n",
-    "    a.spines['right'].set_visible(False)\n",
-    "    a.spines['bottom'].set_visible(False)\n",
-    "    a.spines['left'].set_visible(False)\n",
-    "    \n",
-    "    a.set_xticks([])\n",
-    "    a.set_yticks([])\n",
-    "\n",
-    "column_counter = 0\n",
-    "for entity in ['AML-PML-RARA', 'AML-NPM1', 'AML-CBFB-MYH11', 'AML-RUNX1-RUNX1T1', 'SCD']:\n",
-    "    row_counter = 0\n",
-    "    for patient in clusterplot_structure[entity]:\n",
-    "        bokeh_wrapper.sol_att_bar_plot(sc_prepared.loc[patient], ax[row_counter, column_counter])\n",
-    "        if(row_counter == 0):\n",
-    "            ax[row_counter, column_counter].set_title(font_matching.edit(entity))\n",
-    "        row_counter += 1\n",
-    "    \n",
-    "    column_counter += 1\n",
-    "print(\"Plotting...\")        \n",
-    "plt.show()\n",
-    "print(\"Done!\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
   }
  ],
  "metadata": {
diff --git a/analysis/functions/bokeh_wrapper.py b/analysis/functions/bokeh_wrapper.py
@@ -1191,17 +1191,17 @@ def init_sol_plot(data):
 def sol_att_bar_plot(data, ax):
     '''plot a cumulative barh plot, optional stacked indicators for attention'''
 
-    pred_columns = ['mil_prediction_AML-RUNX1-RUNX1T1',
-                    'mil_prediction_AML-CBFB-MYH11',
-                    'mil_prediction_AML-PML-RARA',
-                    'mil_prediction_AML-NPM1',
-                    'mil_prediction_SCD']
-
-    CLASS_COLORS = {'solitary_softmax_AML-PML-RARA': ((1.0, 127 / 255, 14 / 255), 0),
-                    'solitary_softmax_AML-NPM1': ('red', 1),
-                    'solitary_softmax_AML-CBFB-MYH11': ('sienna', 2),
-                    'solitary_softmax_AML-RUNX1-RUNX1T1': ('dodgerblue', 3),
-                    'solitary_softmax_SCD': ('limegreen', 4)}
+    pred_columns = ['mil_prediction_RUNX1_RUNX1T1',
+                    'mil_prediction_CBFB_MYH11',
+                    'mil_prediction_PML_RARA',
+                    'mil_prediction_NPM1',
+                    'mil_prediction_control']
+
+    CLASS_COLORS = {'solitary_softmax_PML_RARA': ((1.0, 127 / 255, 14 / 255), 0),
+                    'solitary_softmax_NPM1': ('red', 1),
+                    'solitary_softmax_CBFB_MYH11': ('sienna', 2),
+                    'solitary_softmax_RUNX1_RUNX1T1': ('dodgerblue', 3),
+                    'solitary_softmax_control': ('limegreen', 4)}
 
     def col_transform(x): return CLASS_COLORS[x][0]
     def order_transform(x): return CLASS_COLORS[x][1]
diff --git a/analysis/functions/umap_embedding.py b/analysis/functions/umap_embedding.py
@@ -42,7 +42,7 @@ def select_embedding(sc_dataframe, fillup_unmatched=True):
                 PATH_EMBEDDINGS, name_new + '.pkl'))
 
 
-def generate_embedding(sc_dataframe, path_target, save=True):
+def generate_embedding(sc_dataframe, path_target="", save=True):
     global scaler, reducer
 
     # create scalers and reducer
@@ -51,7 +51,7 @@ def generate_embedding(sc_dataframe, path_target, save=True):
     cell_data = sc_dataframe_embedding[FEATURES].values
     umap_scaler = StandardScaler().fit(cell_data)
     scaled_cell_data = umap_scaler.transform(cell_data)
-    umap_reducer = umap.UMAP(verbose=True).fit(scaled_cell_data)
+    umap_reducer = umap.UMAP(verbose=False).fit(scaled_cell_data)
     embedding = umap_reducer.transform(scaled_cell_data)
 
     sc_dataframe['x'] = embedding[..., 0]