There are multiple “disease labels” in the pbta-histologies.tsv
file, including (from most broad to most narrow) broad_histology
, cancer_group
, and harmonized_diagnosis
. For context, it is helpful to note that an individual cancer_group
will be nested under a single broad_histology
and that cancer_group
is a shorter form of harmonized_diagnosis
with the following edits:
cancer_group
It is often useful to use color to indicate disease label in a plot where multiple groups are visualized when we can not rely particularly heavily on labels (e.g., scatter plots). Unfortunately, there are too many potential labels for us to generate an effective color palette (e.g., of sufficiently distinct colors). In addition, some groupings will contain very few samples.
The purpose of this notebook is to create color palettes for the following:
broad_histology
values, where a broad_histology
contains at least one cancer_group
with n >= 10cancer_group
values with n >= 10You may find #1174 to be helpful context.
This notebook can be run via the command line from the top directory of the repository as follows:
Rscript -e "rmarkdown::render('figures/mapping-histology-labels.Rmd',
clean = TRUE)"
library(tidyverse)
── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.2.0 ✔ purrr 0.3.2
✔ tibble 2.1.3 ✔ dplyr 0.8.3
✔ tidyr 0.8.3 ✔ stringr 1.4.0
✔ readr 1.3.1 ✔ forcats 0.4.0
── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
library(RColorBrewer)
We should perform some minimal checking to make sure the provided release is at least at version 23.
minimum_allowed_release <- 23
release_version <- as.numeric(stringr::str_match(params$release, "-v(\\d+)-")[[2]])
if (release_version < minimum_allowed_release) {
stop("This notebook should only be used with data release version 23 or above.")
}
# Path to input directory
input_dir <- file.path("..", "data", params$release)
output_dir <- "palettes"
Let’s read in the pbta-histologies.tsv
file.
histologies_df <-
readr::read_tsv(file.path(input_dir, "pbta-histologies.tsv"), guess_max = 10000)
Parsed with column specification:
cols(
.default = col_character(),
OS_days = col_double(),
age_last_update_days = col_double(),
normal_fraction = col_double(),
tumor_fraction = col_double(),
tumor_ploidy = col_double()
)
See spec(...) for full column specifications.
We will use cancer_group
with n >= 10 to guide what values to include in both our cancer_group
and broad_histology
palettes.
included_labels_df <- histologies_df %>%
# Exclude normal samples
filter(sample_type == "Tumor") %>%
# Filter to unique sample--"disease label" pairs
select(sample_id,
broad_histology,
cancer_group) %>%
distinct() %>%
# Count samples (e.g., sample_id)
group_by(broad_histology, cancer_group) %>%
tally() %>%
# Add a column called included which is a logical that can be used as
# a sample size filter & also to drop the NA values
filter(n >= 10,
!is.na(cancer_group))
included_labels_df
In a later step, we will recode certain cancer groups, e.g., “High-grade glioma astrocytoma” will become “Other high-grade gliomas.” There are other cancer groups that would fit in that category despite not meeting the sample size threshold above. For now, this applies to LGG, HGG, and embryonal tumors.
other_cg_df <- histologies_df %>%
filter(sample_type == "Tumor", # Exclude normal samples
!is.na(cancer_group),
broad_histology %in% c("Diffuse astrocytic and oligodendroglial tumor",
"Low-grade astrocytic tumor",
"Embryonal tumor")) %>%
select(broad_histology,
cancer_group) %>%
distinct() %>%
# Drop everything that meets the sample size requirement above
anti_join(included_labels_df)
Joining, by = c("broad_histology", "cancer_group")
other_cg_df
We’ll create some vectors we’ll use in later steps to group different cancer groups under the same “Other” cancer group display category.
# To become "Other low-grade gliomas"
other_lgg <- c(
"Low-grade glioma astrocytoma",
other_cg_df %>%
filter(broad_histology == "Low-grade astrocytic tumor") %>%
pull(cancer_group)
)
other_lgg
[1] "Low-grade glioma astrocytoma" "Gliomatosis cerebri"
[3] "Diffuse fibrillary astrocytoma" "Oligodendroglioma"
Low-grade glioma astrocytoma
Gliomatosis cerebri
Diffuse fibrillary astrocytoma
Oligodendroglioma
# To become "Other high-grade gliomas"
other_hgg <- c(
"High-grade glioma astrocytoma",
other_cg_df %>%
filter(broad_histology == "Diffuse astrocytic and oligodendroglial tumor") %>%
pull(cancer_group)
)
other_hgg
[1] "High-grade glioma astrocytoma" "Oligodendroglioma"
High-grade glioma astrocytoma
Oligodendroglioma
# To become "Other embryonal tumors"
other_embryonal <- c(
"CNS Embryonal tumor", # Has n >= 10, but we're trying to limit # of colors
other_cg_df %>%
filter(broad_histology == "Embryonal tumor") %>%
pull(cancer_group)
)
other_embryonal
[1] "CNS Embryonal tumor"
[2] "Embryonal tumor with multilayer rosettes"
[3] "Neuroblastoma"
[4] "Ganglioneuroblastoma"
[5] "CNS neuroblastoma"
CNS Embryonal tumor
Embryonal tumor with multilayer rosettes
Neuroblastoma
Ganglioneuroblastoma
CNS neuroblastoma
Outside of this notebook, we’ve done quite a bit of work to identify suitable palettes using http://phrogz.net/css/distinct-colors.html as a reference/starting point. Check out the discussion on #1174.
broad_histology
We will start by creating a palette for broad histology display values. Broad histology values will be recoded specifically for display in figures in later step.
broad_histology_df <- data.frame(
broad_histology_display = c("Benign tumor",
"High-grade glioma",
"Embryonal tumor",
# Replace instances of "Ependymal tumor" per reviewer comment
# See: https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/1652
"Ependymoma",
"Germ cell tumor",
"Low-grade glioma",
"Meningioma",
"Mesenchymal non-meningothelial tumor",
"Neuronal and mixed neuronal-glial tumor",
"Tumor of cranial and paraspinal nerves",
"Tumor of sellar region"),
broad_histology_hex = c("#590024",
"#ff80e5",
"#220040",
"#2200ff",
"#0074d9",
"#8f8fbf",
"#2db398",
"#7fbf00",
"#685815",
"#ffaa00",
"#b2502d"),
stringsAsFactors = FALSE
)
# value for "other" histologies
broad_histology_other_hex <- "#808080"
Now to create a legend with legend()
(h/t this StackOverflow answer)
plot(NULL, xaxt = "n", yaxt = "n", bty = "n", ylab = "", xlab = "",
xlim = 0:1, ylim = 0:1)
legend("topleft",
legend = c(broad_histology_df$broad_histology_display, "Other"),
col = c(broad_histology_df$broad_histology_hex,
broad_histology_other_hex),
pch = 15, pt.cex = 2, cex = 1, bty = "n")
mtext("Broad Histology", at = 0.135, cex = 1.5)
cancer_group
There are 20 cancer_group
values that we need to account for. These are best used in conjunction with labels in figures, but are intended to allow readers to “track” labels across figures.
Where there’s a 1:1 mapping between broad_histology
and cancer_group
, the hex codes will be the same.
cancer_group_df <- data.frame(
cancer_group_display = c("Choroid plexus papilloma",
"Diffuse intrinsic pontine glioma",
"Diffuse midline glioma",
"Other high-grade glioma",
"Atypical Teratoid Rhabdoid Tumor",
"Medulloblastoma",
"Other embryonal tumor",
"Ependymoma",
"Teratoma",
"Ganglioglioma",
"Pilocytic astrocytoma",
"Pleomorphic xanthoastrocytoma",
"Subependymal Giant Cell Astrocytoma",
"Other low-grade glioma",
"Meningioma",
"Ewing sarcoma",
"Dysembryoplastic neuroepithelial tumor",
"Neurofibroma Plexiform",
"Schwannoma",
"Craniopharyngioma"),
cancer_group_abbreviation = c("CPP",
"DIPG",
"DMG",
"Other HGG",
"ATRT",
"MB",
"Other ET",
"EPN",
"Teratoma",
"GNG",
"PA",
"PXA",
"SEGA",
"Other LGG",
"Meningioma",
"EWS",
"DNET",
"PNF",
"Schwannoma",
"CRANIO"),
cancer_group_hex = c("#4d2635",
"#bf0099",
"#ff40d9",
"#ffccf5",
"#4d0d85",
"#a340ff",
"#b08ccf",
"#2200ff",
"#058aff",
"#c4c4ff",
"#8c8cff",
"#4e4ede",
"#2828a1",
"#000047",
"#2db398",
"#9fbf60",
"#614e01",
"#e6ac39",
"#ab7200",
"#b33000"),
stringsAsFactors = FALSE
)
# Value for "other" groups
cancer_group_other_hex <- "#b5b5b5"
And again, we’ll create a legend with legend()
# Combine cancer groups with their abbreviations, WHEN an abbreviation exists for use in legend
legend_names <- cancer_group_df %>%
mutate(cancer_group_legend = if_else(cancer_group_display == cancer_group_abbreviation,
cancer_group_display,
as.character(glue::glue("{cancer_group_display} ({cancer_group_abbreviation})")))
) %>%
pull(cancer_group_legend)
plot(NULL, xaxt = "n", yaxt = "n", bty = "n", ylab = "", xlab = "",
xlim = 0:1, ylim = 0:1)
legend("topleft",
legend = c(legend_names, "Other"),
col = c(cancer_group_df$cancer_group_hex, cancer_group_other_hex),
pch = 15, pt.cex = 1.5, cex = 0.75, bty = "n")
mtext("Cancer Group", at = 0.0625, cex = 1)
We can create a data frame that contains both palettes with a series of left joins, where we will then fill the NA values with a single (gray) hex code per column (#808080 for broad_histology
, #b5b5b5 for cancer_group
.)
When multiple values are using the same color, it can be helpful to have a separate value for the legend, e.g., for all #808080
broad histologies, we may want to display Other
. We’ll add a couple columns for legend-making convenience. We also are recoding some of the disease labels for display per #1403 and discussion on #1401.
palette_df <- histologies_df %>%
# Exclude normal samples
filter(sample_type == "Tumor") %>%
# Filter to unique broad histology--cancer group pairs
select(broad_histology,
cancer_group) %>%
distinct() %>%
# For display specifically, recode broad histology label for LGG, HGG, and EPN
mutate(broad_histology_display = case_when(
broad_histology == "Low-grade astrocytic tumor" ~ "Low-grade glioma",
broad_histology == "Diffuse astrocytic and oligodendroglial tumor" ~ "High-grade glioma",
broad_histology == "Tumors of sellar region" ~ "Tumor of sellar region",
broad_histology == "Ependymal tumor" ~ "Ependymoma", # https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/1652
TRUE ~ broad_histology
),
# For display, group some cancer groups together into a specific "Other"
# category
cancer_group_display = case_when(
# We must take into account broad histology for (at least) individual
# Oligodendroglioma samples to get color coded correctly
(broad_histology == "Diffuse astrocytic and oligodendroglial tumor") & (cancer_group %in% other_hgg) ~ "Other high-grade glioma",
(broad_histology == "Low-grade astrocytic tumor") & (cancer_group %in% other_lgg) ~ "Other low-grade glioma",
cancer_group %in% other_embryonal ~ "Other embryonal tumor",
TRUE ~ cancer_group
)
) %>%
# Add broad histology palette
left_join(broad_histology_df, by = "broad_histology_display") %>%
# Add cancer group palette
left_join(cancer_group_df, by = "cancer_group_display") %>%
# Fill all other values with gray colors
replace_na(list(broad_histology_hex = broad_histology_other_hex,
cancer_group_hex = cancer_group_other_hex)) %>%
# The exception being - if cancer_group == NA, so should cancer_group_hex!
mutate(cancer_group_hex = if_else(is.na(cancer_group),
NA_character_,
cancer_group_hex)) %>%
# When a cancer group or broad histology display group does not get its
# own hex code, we will group those samples into an "Other" category in
# figures
mutate(cancer_group_display = if_else(cancer_group_hex == cancer_group_other_hex,
"Other",
cancer_group_display),
broad_histology_display = if_else(broad_histology_hex == broad_histology_other_hex,
"Other",
broad_histology_display)) %>%
# Sort by broad_histology for easy browsing
arrange(broad_histology)
And now let’s take a look!
palette_df
broad_histology_order
Previously, we had a concept known as display_order
where we ordered categories based on their number of samples (from large to small). Now that we’ve dropped display_group
, let’s apply this same concept to broad_histology
.
broad_histology_order_df <- histologies_df %>%
# Exclude normal samples
filter(sample_type == "Tumor",
# Only count histologies that we'll have a hex code for
broad_histology %in% included_labels_df$broad_histology) %>%
# Need to add broad_histology_display here because that's what we need to
# order/count
inner_join(select(palette_df,
broad_histology,
broad_histology_display)) %>%
# Filter to unique sample--broad_histology_display pairs
select(sample_id,
broad_histology_display) %>%
distinct() %>%
# Count samples within a broad histology
count(broad_histology_display) %>%
# Add Other placeholder
bind_rows(data.frame(broad_histology_display = "Other",
n = 0,
stringsAsFactors = FALSE)) %>%
# Reorder based on sample size except Benign tumor and Other should come last
# And then add numeric column with the order
mutate(broad_histology_display = forcats::fct_reorder(broad_histology_display,
n,
.desc = TRUE) %>%
forcats::fct_relevel("Benign tumor",
"Other",
after = Inf),
broad_histology_order = as.numeric(broad_histology_display)) %>%
# No longer require the sample size
select(-n) %>%
# Sort for easy browsing
arrange(broad_histology_display)
Joining, by = "broad_histology"
broad_histology_order_df
And now we’re ready to add this to the palette data frame.
palette_df <- palette_df %>%
left_join(broad_histology_order_df,
by = "broad_histology_display")
Warning: Column `broad_histology_display` joining character vector and
factor, coercing into character vector
oncoprint_group
and oncoprint_main
For most plots that make use of the cancer_group
palette, such as a box or violin plot, we will rely heavily on labels and therefore using the gray hex code for multiple groups will not be a problem.
We will have four panels of individual oncoprints, where many broad_histology
values will get grouped together into the Other CNS
panel which you can see here. We can move the information about which cancer groups will be included in each panel into our palette data frame.
# Taken from the current plot oncoprint script as of the writing of this
# See permalink above
other_cns_broad_histologies <- c(
"Ependymal tumor",
"Tumors of sellar region",
"Neuronal and mixed neuronal-glial tumor",
"Tumor of cranial and paraspinal nerves",
"Meningioma",
"Mesenchymal non-meningothelial tumor",
"Germ cell tumor",
"Choroid plexus tumor",
"Histiocytic tumor",
"Tumor of pineal region",
"Metastatic tumors",
"Other astrocytic tumor",
"Lymphoma",
"Melanocytic tumor",
"Other tumor"
)
palette_df <- palette_df %>%
mutate(oncoprint_group = case_when(
broad_histology %in% other_cns_broad_histologies ~ "Other CNS",
broad_histology %in% c(
"Low-grade astrocytic tumor",
"Embryonal tumor",
"Diffuse astrocytic and oligodendroglial tumor"
) ~ broad_histology,
TRUE ~ NA_character_
))
Now, we need to establish oncoprint_main
, which is used to delineate whether “Other” cancer groups belong in the main text (TRUE
value) or in the supplementary information (FALSE
) oncoprint plots. Groups that belong in the supplement (oncoprint_main
will be FALSE
) should be “Other CNS” oncoprint groups that have an “Other” cancer_group_display
value.
palette_df <- palette_df %>%
mutate(
oncoprint_main = case_when(
oncoprint_group == "Other CNS" & cancer_group_display == "Other" ~ FALSE,
oncoprint_group == "Other CNS" & cancer_group_display != "Other" ~ TRUE,
TRUE ~ NA
)
)
# For browsing
palette_df
readr::write_tsv(palette_df,
file.path(output_dir,
"broad_histology_cancer_group_palette.tsv"))
sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)
Matrix products: default
BLAS/LAPACK: /usr/lib/libopenblasp-r0.2.19.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RColorBrewer_1.1-2 forcats_0.4.0 stringr_1.4.0
[4] dplyr_0.8.3 purrr_0.3.2 readr_1.3.1
[7] tidyr_0.8.3 tibble_2.1.3 ggplot2_3.2.0
[10] tidyverse_1.2.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 cellranger_1.1.0 pillar_1.4.2 compiler_3.6.0
[5] base64enc_0.1-3 tools_3.6.0 digest_0.6.20 lubridate_1.7.4
[9] jsonlite_1.6 evaluate_0.14 nlme_3.1-140 gtable_0.3.0
[13] lattice_0.20-38 pkgconfig_2.0.2 rlang_0.4.0 cli_1.1.0
[17] rstudioapi_0.10 yaml_2.2.0 haven_2.1.1 xfun_0.8
[21] withr_2.1.2 xml2_1.2.0 httr_1.4.0 knitr_1.23
[25] generics_0.0.2 hms_0.4.2 grid_3.6.0 tidyselect_0.2.5
[29] glue_1.3.1 R6_2.4.0 readxl_1.3.1 rmarkdown_1.13
[33] modelr_0.1.4 magrittr_1.5 ellipsis_0.2.0.1 backports_1.1.4
[37] scales_1.0.0 htmltools_0.3.6 rvest_0.3.4 assertthat_0.2.1
[41] colorspace_1.4-1 stringi_1.4.3 lazyeval_0.2.2 munsell_0.5.0
[45] broom_0.5.2 crayon_1.3.4