OpenPedCan-analysis

Data Formats in Data Download

The release notes for each release are provided in the release-notes.md file that accompanies the data files. A table with brief descriptions for each data file is provided in the data-files-description.md file included in the download.

Processed Data Files

Processed data files are all files derived from samples (e.g., tumors, cell lines) that are processed upstream of this repository and are not the product of any analysis code in the AlexsLemonade/OpenPBTA-analysis or PediatricOpenTargets/OpenPedCan-analysis repository.

Consensus Somatic Variant Data

Somatic calls that are retained if they are supported by at least 2 callers OR marked as HotSpotAllele because they overlap SNV/INDELs considered as Cancer Hotspots OR are TERT promoter SNVs. Please find additional information here

Somatic Copy Number Variant (CNV) Data

Somatic Copy Number Variant (CNV) data are provided in a modified SEG format for each of the applied software packages and denoted with the cnv prefix. Somatic copy number data is only generated for whole genome sequencing (WGS) samples.

A Note on Ploidy

The copy number annotated in the CNVkit SEG file is annotated with respect to ploidy 2, however, the status annotated in the ControlFreeC TSV file is annotated with respect to inferred ploidy from the algorithm, which is recorded in the histologies.tsv file.

Gene Expression Data

Gene expression estimates from the applied software packages are provided as a feature (e.g., gene or transcript) by sample matrix. Gene expression are available in multiple forms in the following files:

See the data description file for more information about the individual gene expression files.

If your analysis requires de-duplicated gene symbols as row names, please use the collapsed matrices provided as part of the data download (see below).

RNA splice events Data

RNA splice events rmats file generated from the established pipeline is provided as:

Derived Fusion Files

The filtered and prioritized fusion and downstream files are a product of the analyses/fusion_filtering analysis module.

Binary matrices for the presence of tumor-specific fusions across all RNA biospecimens are the product of fusion-summary.

Structural Variant Data

Structural Variants data produced by the MANTA package is provided as:

Proteomic data

Whole cell proteomic and phosphorylation proteomic data from project HOPE and CPTAC are provided as:

File name Data type Data source Data description
cptac-protein-imputed-phospho-expression-log2-ratio.tsv.gz Processed data CPTAC pediatric brain tumor phospho-proteomics expression Imputed phospho-protein expression, log2 abundance
cptac-protein-imputed-prot-expression-abundance.tsv.gz Processed data CPTAC pediatric brain tumor protein expression Imputed whole cell protein expression, total abundance
cptac-protein-imputed-prot-expression-log2-ratio.tsv.gz Processed data CPTAC pediatric brain tumor protein expression Imputed whole cell protein expression, log2 abundance
gbm-protein-imputed-phospho-expression-abundance.tsv.gz Processed data CPTAC adult GBM brain tumor phospho-proteomics expression Imputed phospho-protein expression, total abundance
gbm-protein-imputed-prot-expression-abundance.tsv.gz Processed data CPTAC adult GBM brain tumor protein expression Imputed whole cell expression, total abundance
hope-protein-imputed-phospho-expression-abundance.tsv.gz Processed data Adult and Young Adolescent (AYA) brain tumor phospho-proteomics expression (Project HOPE) Imputed phospho-protein expression, total abundance
hope-protein-imputed-prot-expression-abundance.tsv.gz Processed data Adult and Young Adolescent (AYA) brain tumor protein expression (Project HOPE) Imputed whole cell protein expression, total abundance

Harmonized Clinical Data

Harmonized clinical data are released as tab separated values in the following files:

Independent Sample Lists

Independent sample list are released as tab separated values in the following files. wgswxspanel indicates it includes all experimental strategies for DNA sequencing, rnaseqpanel indicates it includes all experimental strategies for RNA sequencing. eachcohort indicates the selection is cohort-based. Additionally, .prefer.wxs (or .prefer.wgs) indicates WXS (or WGS) samples were preferentially select when both WGS and WXS are available for a particular participant.

Analysis Files

Analysis files are created by a script in analyses/*. They can be viewed as derivatives of Processed data files.

Collapsed Expression Matrices

Collapsed expression matrices are products of the analyses/collapse-rnaseq analysis module. In cases where more than one Ensembl gene identifier maps to the same gene symbol, the instance of the gene symbol with the maximum mean FPKM in the RSEM FPKM file is retained to produce the following files:

Additionally, available TCGA and GTEx gene expression files with same format are included:

Derived Copy Number Files

Consensus Copy Number File

Copy number consensus calls from the copy number and structural variant callers are a product of the analyses/copy_number_consensus_call analysis module.

Focal Copy Number Files

Focal copy number files map the consensus calls (genomic segments) in WGS samples to genes for downstream analysis and are a product of the analysis/focal-cn-file-preparation. Note: these files contain biospecimens and genes with copy number changes.

Focal copy number files in WXS samples only uses results from CNVkit and no consensus calling is required. analysis/focal-cn-file-preparation. Note: these files contain biospecimens and genes with copy number changes.

Additionally, autosomes file and x_and_y file for either WGS or WXS to generate the two combined files as followed:

And these two files are further merged to generate: