Multiclass ODA: Convergent Validity of Protein Classification Methods
oda
2026-06-15
Source:vignettes/protein-type-multiclass-oda.Rmd
protein-type-multiclass-oda.RmdResearch question
Nishikawa, Kubota, and Ooi (1983) independently classified 325 proteins into one of four mutually exclusive types using two different methods: one based on biological characteristics and one based on amino acid composition.1 Because the two methods should theoretically converge on the same type for each protein, the directional hypothesis is that protein type codes are identical across methods - demonstrating convergent validity.
Optimal Data Analysis (MultiODA) tests whether amino acid composition type discriminates biological type, with the a priori prediction that type codes match across methods.
Data
Biological type (1-4) is the class variable; amino acid composition type (1-4) is the attribute. Published cell frequencies are reconstructed directly into observation-level vectors - no external data file is required.
library(oda)
# Cross-classification: rows = biological type, cols = amino acid type.
# (column-major reconstruction matches published Table 1)
# AA=1 AA=2 AA=3 AA=4 total
# Biological=1 98 16 5 3 122
# Biological=2 13 50 2 8 73
# Biological=3 6 4 23 12 45
# Biological=4 7 19 14 45 85
# total 124 89 44 68 325
biological_type <- c(
rep(1L, 98), rep(2L, 13), rep(3L, 6), rep(4L, 7), # amino_acid = 1
rep(1L, 16), rep(2L, 50), rep(3L, 4), rep(4L, 19), # amino_acid = 2
rep(1L, 5), rep(2L, 2), rep(3L, 23), rep(4L, 14), # amino_acid = 3
rep(1L, 3), rep(2L, 8), rep(3L, 12), rep(4L, 45) # amino_acid = 4
)
amino_acid_type <- c(rep(1L, 124), rep(2L, 89), rep(3L, 44), rep(4L, 68))
table(amino_acid_type, biological_type,
dnn = c("Amino Acid Type (1-4)", "Biological Type (1-4)"))
#> Biological Type (1-4)
#> Amino Acid Type (1-4) 1 2 3 4
#> 1 98 13 6 7
#> 2 16 50 4 19
#> 3 5 2 23 14
#> 4 3 8 12 45Fit the ODA model
Amino acid type is a four-category nominal variable. ODA searches all
possible mappings from the four amino-acid-type categories to the four
biological-type classes and selects the mapping that maximises ESS. No
a priori direction is supplied; the search is nondirectional
(Hypothesis: NONDIRECTIONAL in MegaODA output).
Leave-one-out (LOO) jackknife validity is requested via
loo = "on"; LOO confusion and ESS are reported. No LOO
p-value is given because no canonical Fisher-exact LOO p-value is
defined for C > 2 multicategorical class problems.
# Canonical reference run (mc_iter = 25000L; not evaluated in CRAN vignette)
fit <- oda_fit(
x = amino_acid_type,
y = biological_type,
attr_type = "categorical",
mc_iter = 25000L,
loo = "on"
)
# CRAN-safe run: mc_iter = 500L for vignette rendering speed.
# Training rule, ESS, and confusion matrix are identical to the canonical run.
fit <- oda_fit(
x = amino_acid_type,
y = biological_type,
attr_type = "categorical",
mc_iter = 500L,
mc_seed = 42L,
loo = "on"
)Rule and confusion matrix
print(fit)
#>
#> ODA (multiclass) attr_type=categorical priors=TRUE n=325
#>
#> Rule: 1 --> 1 | 2 --> 2 | 3 --> 3 | 4 --> 4
#>
#> CLASS PAC
#> 1 80.3%
#> 2 68.5%
#> 3 51.1%
#> 4 52.9%
#>
#> Mean PAC: 63.22% ESS: 50.96% p(MC): < .001
#>
#> -- LOO --
#> CLASS PAC
#> 1 80.3%
#> 2 68.5%
#> 3 51.1%
#> 4 52.9%
#>
#> LOO Mean PAC: 63.22% LOO ESS: 50.96%
#> p(LOO): not reported for multicategorical ODAODA’s nondirectional search identified the identity mapping as the optimal categorical partition:
- If amino acid type = 1 -> predict biological type = 1
- If amino acid type = 2 -> predict biological type = 2
- If amino acid type = 3 -> predict biological type = 3
- If amino acid type = 4 -> predict biological type = 4
# Confusion matrix (actual x predicted); strip dimnames for clean display
conf_mat <- unname(fit$confusion)
rownames(conf_mat) <- paste0("Bio=", 1:4)
colnames(conf_mat) <- paste0("Pred=", 1:4)
print(conf_mat)
#> Pred=1 Pred=2 Pred=3 Pred=4
#> Bio=1 98 16 5 3
#> Bio=2 13 50 2 8
#> Bio=3 6 4 23 12
#> Bio=4 7 19 14 45ESS / PAC / PV interpretation
summary(fit)
#>
#> ODA Summary (multiclass) status=valid n=325
#> attr_type=categorical priors=TRUE weights=FALSE
#> Rule: 1 --> 1 | 2 --> 2 | 3 --> 3 | 4 --> 4
#>
#> -- Train --
#> Mean PAC: 63.22% ESS: 50.96%
#> p(MC): < .001 [MC permutation, two-tailed]
#> -- LOO --
#> CLASS PAC
#> 1 80.3%
#> 2 68.5%
#> 3 51.1%
#> 4 52.9%
#> LOO ESS: 50.96%
#> LOO Mean PAC: 63.22%
#> p(LOO): not reported for multicategorical ODA
m <- oda_metrics(fit)
# PAC (sensitivity) per class - pac_by_class is already on percentage scale
cat("PAC by biological type:\n")
#> PAC by biological type:
cat(" Type 1:", round(m$pac_by_class[1], 1), "%\n")
#> Type 1: 80.3 %
cat(" Type 2:", round(m$pac_by_class[2], 1), "%\n")
#> Type 2: 68.5 %
cat(" Type 3:", round(m$pac_by_class[3], 1), "%\n")
#> Type 3: 51.1 %
cat(" Type 4:", round(m$pac_by_class[4], 1), "%\n")
#> Type 4: 52.9 %
# Predictive value: diagonal / column sums
pv <- diag(fit$confusion) / colSums(fit$confusion) * 100
cat("\nPV by biological type:\n")
#>
#> PV by biological type:
cat(" Type 1:", round(pv[1], 1), "%\n")
#> Type 1: 79 %
cat(" Type 2:", round(pv[2], 1), "%\n")
#> Type 2: 56.2 %
cat(" Type 3:", round(pv[3], 1), "%\n")
#> Type 3: 52.3 %
cat(" Type 4:", round(pv[4], 1), "%\n")
#> Type 4: 66.2 %- PAC (sensitivity per class): 80.3%, 68.5%, 51.1%, and 52.9% for protein types 1 through 4, respectively. Because 25% correct per class is expected by chance for a four-class problem, classification of all four types substantially exceeds chance.
- ESS = 50.96% indicates a relatively strong effect.2 All four PAC values exceed the four-class chance benchmark of 25%; they also exceed 50%, indicating majority-accurate classification within each class.
- PV: When the model predicts type 1 it is correct ~79.0% of the time; type 2, ~56.2%; type 3, ~52.3%; type 4, ~66.2%. All predictive values exceed chance.
Monte Carlo and LOO validity
The MC p-value and LOO results are shown in the print
and summary output above.
-
MC p-value: The printed
p(MC)is a nondirectional Fisher-randomization p-value. Each permutation searches for the best categorical mapping of the permuted labels, matching the nondirectional search used for the training model. Interpret by decision threshold (e.g., p < 0.05). - LOO jackknife: Leave-one-out ESS and Mean PAC are shown. Each fold holds out one observation, searches for the optimal categorical mapping on the remaining n-1 observations (nondirectional, equal priors), and classifies the held-out case using that fold’s rule. Because the identity mapping is the globally optimal partition for these data, every fold recovers the same rule, and LOO ESS equals training ESS exactly. This confirms the model is stable across folds and no single observation drives the result.
- LOO p-value: No LOO Fisher-exact p-value is reported for multicategorical (C > 2) class problems. No canonical reference distribution is defined for the C x C LOO confusion matrix in this context. For binary class ODA, a one-tailed Fisher exact p-value is available (MPE p. 34).
Notes on reproducibility
Fixture parity. The training rule, confusion matrix,
and ESS are verified against MegaODA.exe output in the package test
suite (tests/testthat/test-fixture-vignettes.R, Example
3).
MC p-value calibration. The MC p shown here reflects
mc_iter = 500L in this CRAN vignette. Use the canonical run
with mc_iter = 25000L (chunk fit-canonical,
eval=FALSE) for publication-quality results.
Nondirectional search. No direction
argument is supplied. ODA evaluates all possible mappings from the four
amino-acid categories to the four biological-type classes and selects
the mapping that maximises ESS. This matches the MegaODA.exe gold run
(Hypothesis: NONDIRECTIONAL).
Optional directional analysis. A researcher with an
a priori convergent- validity hypothesis (amino acid type i
predicts biological type i) can supply
direction = "ascending" for a constrained identity-map
analysis (MPE Chapter 4 Phase 6C). For this dataset the two analyses
yield identical ESS and confusion because the identity mapping happens
to be the global optimum; they differ in MC interpretation (directional
vs. nondirectional p-value).