Classification Tree Analysis (CTA) grows a binary classification tree in which every internal node is an ODA model. Each split divides observations into two branches based on a single attribute cutpoint, and the process recurs until no further splits pass the MINDENOM, LOO STABLE, and significance criteria. This article introduces the key concepts and parameters.
CTA as sequential ODA
CTA answers the question: after the best single ODA rule is applied and observations are divided into two branches, is there additional discriminating signal in either branch?
Each node in the tree is an independent ODA fit on the observations routed to that node. The tree grows top-down, one split at a time. Growth stops at a node when:
- The minimum endpoint size
mindenomis not met by any candidate split. - No candidate split passes the LOO STABLE gate (LOO ESS ~ training ESS).
- No candidate split reaches significance (MC p <
alpha_split).
After growing, a backward pruning step removes nodes whose sub-tree does not improve the full-tree ESS relative to the Sidak-adjusted significance threshold.
cta_fit() arguments
| Argument | Default | Purpose |
|---|---|---|
X |
- | Data frame of attribute columns |
y |
- | Integer class variable |
mindenom |
10L | Minimum observations at any terminal endpoint |
mc_iter |
5000L | Monte Carlo permutations for significance screening |
mc_seed |
42L | Seed for MC reproducibility |
alpha_split |
0.05 | Significance threshold for accepting a split |
loo |
"stable" |
"stable" = LOO STABLE gate; "off" = skip
LOO |
priors_on |
TRUE | Balance class frequencies in objective |
attr_names |
NULL | Optional names for attribute columns |
MINDENOM: controlling endpoint size
mindenom sets the minimum number of observations at any
terminal leaf. It is the primary complexity control in CTA: a larger
mindenom forces broader, simpler trees; a smaller value
allows deeper, more granular trees.
library(oda)
set.seed(1L)
n <- 60L
X <- data.frame(
x1 = c(rnorm(30, mean = 2), rnorm(30, mean = 5)),
x2 = c(rnorm(30, mean = 1), rnorm(30, mean = 3))
)
y <- c(rep(1L, 30), rep(2L, 30))
# mindenom = 15: at least 15 obs per terminal endpoint
tree <- cta_fit(X, y,
mindenom = 15L,
mc_iter = 300L,
mc_seed = 42L,
loo = "off",
priors_on = TRUE,
attr_names = c("x1", "x2")
)
print(tree)
#>
#> CTA Tree alpha_split=0.050 mindenom=15 prune=1.000 max_depth=10 loo=off
#>
#> ATTRIBUTE NODE LEV OBS p ESS WESS LOO MODEL
#> ----------------------------------------------------------------------------------
#> x1 1 1 60 .000 100.00% 100.00% OFF <=3.60911-->0; >3.60911-->1
#> Node-local split confusion (this rule only, observations at this node)
#> 0 1
#> --------------
#> 0 | 30 0 | 100.00%
#> 1 | 0 30 | 100.00%
#> --------------
#> NP | 30 30
#>
#> Nodes: 3 total (1 split 2 leaf)
#>
#> Terminal endpoints (*):
#> * endpoint 1 node 2: path=x1<=3.60911 n=30 counts=[0:30 1:0] predicted=0 target_prop=0.0%
#> * endpoint 2 node 3: path=x1>3.60911 n=30 counts=[0:0 1:30] predicted=1 target_prop=100.0%
#> ESS: 100.00% D: 0.0000 strata: 2 min_denom: 30Increasing mindenom produces simpler (shallower) trees.
Decreasing it allows more splits but risks overfitting and
small-endpoint instability. The MINDENOM descendant family (see
articles/mdsa-family) systematically explores this
tradeoff.
ENUMERATE: root candidate selection
CTA does not simply split on the best attribute and recurse. Instead, it uses an ENUMERATE phase to evaluate every valid root candidate:
- Expanded phase: for each candidate root attribute, CTA grows the full HO-CTA tree below that root and scores the complete tree.
- Stump phase: each root candidate is also scored as a two-leaf stump (root split only, path-local scoring).
- The root with the highest overall ESS (or WESS when weights are active) wins.
This ensures the selected root is globally best for the full-tree objective, not just locally best at the root node. ENUMERATE evaluates root candidates only; it does not exhaustively enumerate every possible recursive tree.
LOO STABLE: generalisability gate
loo = "stable" (the default) screens candidate splits
using the jackknife:
- The full-node model is fit (training ESS).
- For each observation at the node, the model is refit on the remaining n - 1 cases, and the held-out observation is predicted.
- If |LOO ESS - training ESS| > 1 pp, the split is UNSTABLE and rejected.
- Only STABLE splits are eligible for inclusion in the tree.
This prevents the tree from retaining splits that depend critically on individual observations.
tree_loo <- cta_fit(X, y,
mindenom = 15L,
mc_iter = 300L,
mc_seed = 42L,
loo = "stable", # default - LOO STABLE gate active
priors_on = TRUE,
attr_names = c("x1", "x2")
)
print(tree_loo)
#>
#> CTA Tree alpha_split=0.050 mindenom=15 prune=1.000 max_depth=10 loo=stable
#>
#> No tree found (leaf-only): no valid split passed significance, LOO gate, and MINDENOM constraints.Pruning: backward complexity reduction
After the tree is grown, CTA applies a backward pruning step. Starting from the deepest nodes, each non-terminal sub-tree is removed if it does not improve the full-tree ESS beyond what a Sidak-adjusted significance threshold requires. This is analogous to cost-complexity pruning in CART but uses ESS gain and the Sidak correction for the number of splits evaluated.
The returned cta_tree represents the final
pruned tree. The full pre-pruning HO tree is not stored
as a separate public tree object; pruning provenance is summarized in
prune_info.
predict.cta_tree(): path-local missingness
Prediction in CTA is path-local: an observation is routed at each node using only the attribute at that node. If the attribute is missing for an observation at the node it is actually traversing, two strategies are available:
preds_na <- predict(tree_loo, X, missing_action = "na") # NA for missing
preds_maj <- predict(tree_loo, X, missing_action = "majority") # legacy majority route
# All non-NA predictions agree:
agree <- sum(preds_na == preds_maj, na.rm = TRUE)
cat("Predictions agree (non-NA):", agree, "of", sum(!is.na(preds_na)), "\n")
#> Predictions agree (non-NA): 0 of 0-
"na"(canonical): observations with a missing root attribute receiveNA_integer_. They are excluded from scoring. -
"majority"(legacy): missing observations are routed to the majority class at the node, so all observations receive a prediction.
For scoring and confusion tables the canonical "na" mode
is used; the observation count reported by
cta_confusion_table() excludes unclassified
observations.
plot.cta_tree(): structural diagram
plot() dispatches to the base-R tree renderer. It shows
the split rule, node size, and terminal class predictions at each
leaf:
plot(tree_loo)
For ggplot2 renderers, see plot_cta_tree() (requires the
ggplot2 package):
pd <- cta_plot_data(tree_loo)
plot_cta_tree(pd)
Confusion table
cta_confusion_table() returns the training confusion as
a tidy long-format data frame (one row per actual x predicted cell),
extracted from the stored confusion without refitting. For the raw 2 x 2
integer matrix directly, use cta_confusion_matrix():
cta_confusion_table(tree_loo)
#> [1] actual predicted n
#> <0 rows> (or 0-length row.names)Canonical benchmark: CTA_DEMO
The CTA_DEMO dataset (n = 200, 5 attributes, uniform weights) is the
primary regression fixture for the package. The canonical result with
MINDENOM = 1 and mc_iter = 25000 is:
- Root: V2, cut = 4.5
- Overall ESS = 52.63%
- Attributes V1, V3-V5 are not selected
These values are verified against CTA.exe output in
tests/testthat/test-cta.R. See
vignettes/articles/myeloma-cta for a fully worked MINDENOM
family walkthrough on a real dataset with case weights.
Further reading
-
articles/myeloma-cta- full MINDENOM family example (when precomputed artifacts are available) -
articles/cta-translation- endpoint staging and the translation pipeline -
articles/mdsa-family- MINDENOM family and minimum-D model selection -
articles/cta-graphics- tree diagram renderers (base-R and ggplot2) -
docs/CTA_CANON.md- canonical CTA behavior specification