CTA basics with cta_fit()

Classification Tree Analysis (CTA) grows a binary classification tree in which every internal node is an ODA model. Each split divides observations into two branches based on a single attribute cutpoint, and the process recurs until no further splits pass the MINDENOM, LOO STABLE, and significance criteria. This article introduces the key concepts and parameters.

CTA as sequential ODA

CTA answers the question: after the best single ODA rule is applied and observations are divided into two branches, is there additional discriminating signal in either branch?

Each node in the tree is an independent ODA fit on the observations routed to that node. The tree grows top-down, one split at a time. Growth stops at a node when:

The minimum endpoint size mindenom is not met by any candidate split.
No candidate split passes the LOO STABLE gate (LOO ESS ~ training ESS).
No candidate split reaches significance (MC p < alpha_split).

After growing, a backward pruning step removes nodes whose sub-tree does not improve the full-tree ESS relative to the Sidak-adjusted significance threshold.

`cta_fit()` arguments

Argument	Default	Purpose
`X`	-	Data frame of attribute columns
`y`	-	Integer class variable
`mindenom`	10L	Minimum observations at any terminal endpoint
`mc_iter`	5000L	Monte Carlo permutations for significance screening
`mc_seed`	42L	Seed for MC reproducibility
`alpha_split`	0.05	Significance threshold for accepting a split
`loo`	`"stable"`	`"stable"` = LOO STABLE gate; `"off"` = skip LOO
`priors_on`	TRUE	Balance class frequencies in objective
`attr_names`	NULL	Optional names for attribute columns

MINDENOM: controlling endpoint size

mindenom sets the minimum number of observations at any terminal leaf. It is the primary complexity control in CTA: a larger mindenom forces broader, simpler trees; a smaller value allows deeper, more granular trees.

library(oda)

set.seed(1L)
n <- 60L
X <- data.frame(
  x1 = c(rnorm(30, mean = 2), rnorm(30, mean = 5)),
  x2 = c(rnorm(30, mean = 1), rnorm(30, mean = 3))
)
y <- c(rep(1L, 30), rep(2L, 30))

# mindenom = 15: at least 15 obs per terminal endpoint
tree <- cta_fit(X, y,
  mindenom    = 15L,
  mc_iter     = 300L,
  mc_seed     = 42L,
  loo         = "off",
  priors_on   = TRUE,
  attr_names  = c("x1", "x2")
)
print(tree)
#> 
#> CTA Tree  alpha_split=0.050  mindenom=15  prune=1.000  max_depth=10  loo=off
#> 
#> ATTRIBUTE      NODE  LEV    OBS       p      ESS     WESS      LOO  MODEL
#> ---------------------------------------------------------------------------------- 
#> x1                1    1     60    .000  100.00%  100.00%      OFF  <=3.60911-->0; >3.60911-->1
#>   Node-local split confusion (this rule only, observations at this node)
#>                    0      1 
#>              -------------- 
#>        0  |      30      0 | 100.00%
#>        1  |       0     30 | 100.00%
#>              -------------- 
#>       NP  |      30     30
#> 
#> Nodes: 3 total  (1 split  2 leaf)
#> 
#> Terminal endpoints (*):
#> * endpoint 1  node 2:  path=x1<=3.60911  n=30  counts=[0:30 1:0]  predicted=0  target_prop=0.0%
#> * endpoint 2  node 3:  path=x1>3.60911  n=30  counts=[0:0 1:30]  predicted=1  target_prop=100.0%
#> ESS: 100.00%  D: 0.0000  strata: 2  min_denom: 30

Increasing mindenom produces simpler (shallower) trees. Decreasing it allows more splits but risks overfitting and small-endpoint instability. The MINDENOM descendant family (see articles/mdsa-family) systematically explores this tradeoff.

ENUMERATE: root candidate selection

CTA does not simply split on the best attribute and recurse. Instead, it uses an ENUMERATE phase to evaluate every valid root candidate:

Expanded phase: for each candidate root attribute, CTA grows the full HO-CTA tree below that root and scores the complete tree.
Stump phase: each root candidate is also scored as a two-leaf stump (root split only, path-local scoring).
The root with the highest overall ESS (or WESS when weights are active) wins.

This ensures the selected root is globally best for the full-tree objective, not just locally best at the root node. ENUMERATE evaluates root candidates only; it does not exhaustively enumerate every possible recursive tree.

LOO STABLE: generalisability gate

loo = "stable" (the default) screens candidate splits using the jackknife:

The full-node model is fit (training ESS).
For each observation at the node, the model is refit on the remaining n - 1 cases, and the held-out observation is predicted.
If |LOO ESS - training ESS| > 1 pp, the split is UNSTABLE and rejected.
Only STABLE splits are eligible for inclusion in the tree.

This prevents the tree from retaining splits that depend critically on individual observations.

tree_loo <- cta_fit(X, y,
  mindenom    = 15L,
  mc_iter     = 300L,
  mc_seed     = 42L,
  loo         = "stable",   # default  -  LOO STABLE gate active
  priors_on   = TRUE,
  attr_names  = c("x1", "x2")
)
print(tree_loo)
#> 
#> CTA Tree  alpha_split=0.050  mindenom=15  prune=1.000  max_depth=10  loo=stable
#> 
#> No tree found (leaf-only): no valid split passed significance, LOO gate, and MINDENOM constraints.

Pruning: backward complexity reduction

After the tree is grown, CTA applies a backward pruning step. Starting from the deepest nodes, each non-terminal sub-tree is removed if it does not improve the full-tree ESS beyond what a Sidak-adjusted significance threshold requires. This is analogous to cost-complexity pruning in CART but uses ESS gain and the Sidak correction for the number of splits evaluated.

The returned cta_tree represents the final pruned tree. The full pre-pruning HO tree is not stored as a separate public tree object; pruning provenance is summarized in prune_info.

`predict.cta_tree()`: path-local missingness

Prediction in CTA is path-local: an observation is routed at each node using only the attribute at that node. If the attribute is missing for an observation at the node it is actually traversing, two strategies are available:

preds_na  <- predict(tree_loo, X, missing_action = "na")       # NA for missing
preds_maj <- predict(tree_loo, X, missing_action = "majority") # legacy majority route

# All non-NA predictions agree:
agree <- sum(preds_na == preds_maj, na.rm = TRUE)
cat("Predictions agree (non-NA):", agree, "of", sum(!is.na(preds_na)), "\n")
#> Predictions agree (non-NA): 0 of 0

"na" (canonical): observations with a missing root attribute receive NA_integer_. They are excluded from scoring.
"majority" (legacy): missing observations are routed to the majority class at the node, so all observations receive a prediction.

For scoring and confusion tables the canonical "na" mode is used; the observation count reported by cta_confusion_table() excludes unclassified observations.

`plot.cta_tree()`: structural diagram

plot() dispatches to the base-R tree renderer. It shows the split rule, node size, and terminal class predictions at each leaf:

plot(tree_loo)

For ggplot2 renderers, see plot_cta_tree() (requires the ggplot2 package):

pd <- cta_plot_data(tree_loo)
plot_cta_tree(pd)

Confusion table

cta_confusion_table() returns the training confusion as a tidy long-format data frame (one row per actual x predicted cell), extracted from the stored confusion without refitting. For the raw 2 x 2 integer matrix directly, use cta_confusion_matrix():

cta_confusion_table(tree_loo)
#> [1] actual    predicted n        
#> <0 rows> (or 0-length row.names)

Canonical benchmark: CTA_DEMO

The CTA_DEMO dataset (n = 200, 5 attributes, uniform weights) is the primary regression fixture for the package. The canonical result with MINDENOM = 1 and mc_iter = 25000 is:

Root: V2, cut = 4.5
Overall ESS = 52.63%
Attributes V1, V3-V5 are not selected

These values are verified against CTA.exe output in tests/testthat/test-cta.R. See vignettes/articles/myeloma-cta for a fully worked MINDENOM family walkthrough on a real dataset with case weights.