Myeloma gene-expression dataset (CTA benchmark)

A data frame with 256 observations and 19 variables, formatted for use with cta_fit and oda_fit. Derived from the publicly available myeloma gene-expression dataset (GEO accession GSE4581), as distributed in the survminer package.

Format

A data frame with 256 rows and 19 columns:

V1: Survival event indicator (0 = censored, 1 = event). Used as the class variable y in CTA/ODA.
V2: Case weight (observation time in months). Use as w in cta_fit; rows with V2 == 0 should be excluded.
V3: CCND1 gene expression.
V4: CRIM1 gene expression.
V5: DEPDC1 gene expression.
V6: IRF4 gene expression.
V7: TP53 expression / mutation burden.
V8: WHSC1 gene expression.
V9: Molecular group: Cyclin D-1 (binary).
V10: Molecular group: Cyclin D-2 (binary).
V11: Molecular group: Hyperdiploid (binary).
V12: Molecular group: Low bone disease (binary).
V13: Molecular group: MAF (binary).
V14: Molecular group: MMSET (binary).
V15: Molecular group: Proliferation (binary).
V16: Chr1q21 status: 2 copies (binary).
V17: Chr1q21 status: 3 copies (binary).
V18: Chr1q21 status: 4+ copies (binary).
V19: Chr1q21 status: NA-coded (binary). Missing values are coded as -9 (miss_codes = -9).

Source

Derived from the myeloma dataset in the survminer package. Original data: NCBI GEO accession GSE4581. No PHI; no institutional data. See tests/testthat/fixtures/myeloma/README.md in the source tree.

Details

This dataset is used throughout the oda documentation and vignettes to illustrate weighted CTA, MINDENOM constraints, LOO STABLE validation, and missing-code handling. Reference CTA.exe golden outputs for MINDENOM = 1, 30, and 56 are used as regression anchors.

Use miss_codes = -9 and w = myeloma$V2 when calling cta_fit. With mindenom = 1, the enumerated CTA tree roots at V14 with a V15 child (OVERALL ESS = 26.32%, WEIGHTED ESS = 27.69%). With mindenom = 30, the selected tree is a V17 stump (WEIGHTED ESS = 16.51%). With mindenom = 56, no admissible tree exists.