Assign observations to CTA terminal endpoints

Traverses the fitted cta_tree for each row of newdata and returns the terminal leaf reached, expressed as both its stored node identifier (endpoint_node_id) and its sequential endpoint index (endpoint_id) matching cta_endpoint_summary.

No endpoint membership is stored at fit time. This function performs the traversal on demand so the cta_tree object remains lean. The returned endpoint_id can be joined with the output of cta_propensity_weights to assign endpoint-level stabilized weights to individual observations.

Column order requirement: newdata must have the same attribute column order as the X matrix passed to oda_cta_fit. Traversal uses the stored integer column positions (attr_col) from the fit, not column names. If both names(newdata) and tree$attr_names are non-NULL, a warning is issued when they disagree at the split attribute positions.

Missingness:

"na" (default): Canonical path-local behaviour: when a split attribute value is NA or a stored miss-code on the observation's actual traversal path, the row returns NA for both output columns. This matches the canonical missing_action = "na" semantics of predict.
"majority": Routes the observation to the child subtree with the larger n_obs, then continues traversal to a terminal leaf. Ties are resolved by selecting the first child.

Usage

cta_assign_endpoints(tree, newdata, missing_action = c("na", "majority"))

Arguments

tree: A cta_tree from oda_cta_fit.
newdata: A data.frame (or coercible object) with the same column order as the training X supplied to oda_cta_fit.
missing_action: Character; one of "na" (default) or "majority". See Description.

Value

A data.frame with one row per row of newdata and columns:

row_id: Integer; positional row index in newdata (1 to nrow(newdata)).
endpoint_node_id: Integer; node_id of the terminal leaf reached by traversal. NA_integer_ when the observation cannot be routed to a terminal leaf (missing split attribute with missing_action = "na", or no-tree fit).
endpoint_id: Integer; sequential endpoint index matching cta_endpoint_summary. NA_integer_ under the same conditions as endpoint_node_id.

For no-tree fits all rows have endpoint_node_id = NA_integer_ and endpoint_id = NA_integer_.

Details

Observation-level propensity weights (workflow sketch):


ep  <- cta_assign_endpoints(tree, X_train, missing_action = "na")
pw  <- cta_propensity_weights(tree, target_class = 1L, adjusted = TRUE)

# One row per classified training observation with its weight:
obs <- merge(
  data.frame(row_id = seq_len(nrow(X_train)),
             class  = as.character(y_train)),
  merge(ep, pw[, c("endpoint_id", "class", "adjusted_propensity_weight")],
        by = "endpoint_id"),
  by = c("row_id", "class")
)
# Rows with NA endpoint_id (missing root attribute) drop naturally.

Observation-level propensity weight expansion is intentionally left to the caller so that the cta_tree object stores no observation indices.

Examples

data(mtcars)
X    <- mtcars[, c("cyl", "disp", "hp", "wt")]
y    <- as.integer(mtcars$am)
tree <- oda_cta_fit(X, y, mindenom = 5L, mc_iter = 500L, mc_seed = 42L)
ep   <- cta_assign_endpoints(tree, X)
head(ep)
#>   row_id endpoint_node_id endpoint_id
#> 1      1                3           2
#> 2      2                3           2
#> 3      3                3           2
#> 4      4                2           1
#> 5      5                2           1
#> 6      6                2           1

Usage

Arguments

Value

Details

See also

Examples