Multiclass ODA: Convergent Validity of Protein Classification Methods

Research question

Nishikawa, Kubota, and Ooi (1983) independently classified 325 proteins into one of four mutually exclusive types using two different methods: one based on biological characteristics and one based on amino acid composition.¹ Because the two methods should theoretically converge on the same type for each protein, the directional hypothesis is that protein type codes are identical across methods - demonstrating convergent validity.

Optimal Data Analysis (MultiODA) tests whether amino acid composition type discriminates biological type, with the a priori prediction that type codes match across methods.

Data

Biological type (1-4) is the class variable; amino acid composition type (1-4) is the attribute. Published cell frequencies are reconstructed directly into observation-level vectors - no external data file is required.

library(oda)

# Cross-classification: rows = biological type, cols = amino acid type.
# (column-major reconstruction matches published Table 1)
#                  AA=1  AA=2  AA=3  AA=4   total
#  Biological=1     98    16     5     3      122
#  Biological=2     13    50     2     8       73
#  Biological=3      6     4    23    12       45
#  Biological=4      7    19    14    45       85
#  total           124    89    44    68      325

biological_type <- c(
  rep(1L, 98), rep(2L, 13), rep(3L,  6), rep(4L,  7),  # amino_acid = 1
  rep(1L, 16), rep(2L, 50), rep(3L,  4), rep(4L, 19),  # amino_acid = 2
  rep(1L,  5), rep(2L,  2), rep(3L, 23), rep(4L, 14),  # amino_acid = 3
  rep(1L,  3), rep(2L,  8), rep(3L, 12), rep(4L, 45)   # amino_acid = 4
)
amino_acid_type <- c(rep(1L, 124), rep(2L, 89), rep(3L, 44), rep(4L, 68))

table(amino_acid_type, biological_type,
      dnn = c("Amino Acid Type (1-4)", "Biological Type (1-4)"))
#>                      Biological Type (1-4)
#> Amino Acid Type (1-4)  1  2  3  4
#>                     1 98 13  6  7
#>                     2 16 50  4 19
#>                     3  5  2 23 14
#>                     4  3  8 12 45

Fit the ODA model

Amino acid type is a four-category nominal variable. ODA searches all possible mappings from the four amino-acid-type categories to the four biological-type classes and selects the mapping that maximises ESS. No a priori direction is supplied; the search is nondirectional (Hypothesis: NONDIRECTIONAL in MegaODA output). Leave-one-out (LOO) jackknife validity is requested via loo = "on"; LOO confusion and ESS are reported. No LOO p-value is given because no canonical Fisher-exact LOO p-value is defined for C > 2 multicategorical class problems.

# Canonical reference run (mc_iter = 25000L; not evaluated in CRAN vignette)
fit <- oda_fit(
  x         = amino_acid_type,
  y         = biological_type,
  attr_type = "categorical",
  mc_iter   = 25000L,
  loo       = "on"
)

# CRAN-safe run: mc_iter = 500L for vignette rendering speed.
# Training rule, ESS, and confusion matrix are identical to the canonical run.
fit <- oda_fit(
  x         = amino_acid_type,
  y         = biological_type,
  attr_type = "categorical",
  mc_iter   = 500L,
  mc_seed   = 42L,
  loo       = "on"
)

Rule and confusion matrix

print(fit)
#> 
#> ODA (multiclass)  attr_type=categorical  priors=TRUE  n=325
#> 
#> Rule: 1 --> 1   |   2 --> 2   |   3 --> 3   |   4 --> 4
#> 
#>   CLASS     PAC
#>       1   80.3%
#>       2   68.5%
#>       3   51.1%
#>       4   52.9%
#> 
#>   Mean PAC: 63.22%   ESS: 50.96%  p(MC): < .001
#> 
#>   -- LOO --
#>   CLASS     PAC
#>       1   80.3%
#>       2   68.5%
#>       3   51.1%
#>       4   52.9%
#> 
#>   LOO Mean PAC: 63.22%   LOO ESS: 50.96%
#>   p(LOO): not reported for multicategorical ODA

ODA’s nondirectional search identified the identity mapping as the optimal categorical partition:

If amino acid type = 1 -> predict biological type = 1
If amino acid type = 2 -> predict biological type = 2
If amino acid type = 3 -> predict biological type = 3
If amino acid type = 4 -> predict biological type = 4

# Confusion matrix (actual x predicted); strip dimnames for clean display
conf_mat <- unname(fit$confusion)
rownames(conf_mat) <- paste0("Bio=", 1:4)
colnames(conf_mat) <- paste0("Pred=", 1:4)
print(conf_mat)
#>       Pred=1 Pred=2 Pred=3 Pred=4
#> Bio=1     98     16      5      3
#> Bio=2     13     50      2      8
#> Bio=3      6      4     23     12
#> Bio=4      7     19     14     45

ESS / PAC / PV interpretation

summary(fit)
#> 
#> ODA Summary (multiclass)  status=valid  n=325
#>   attr_type=categorical  priors=TRUE  weights=FALSE
#>   Rule: 1 --> 1   |   2 --> 2   |   3 --> 3   |   4 --> 4
#> 
#>   -- Train --
#>     Mean PAC: 63.22%   ESS: 50.96%
#>     p(MC): < .001  [MC permutation, two-tailed]
#>   -- LOO --
#>     CLASS     PAC
#>         1   80.3%
#>         2   68.5%
#>         3   51.1%
#>         4   52.9%
#>     LOO ESS: 50.96%
#>     LOO Mean PAC: 63.22%
#>     p(LOO): not reported for multicategorical ODA

m <- oda_metrics(fit)

# PAC (sensitivity) per class - pac_by_class is already on percentage scale
cat("PAC by biological type:\n")
#> PAC by biological type:
cat("  Type 1:", round(m$pac_by_class[1], 1), "%\n")
#>   Type 1: 80.3 %
cat("  Type 2:", round(m$pac_by_class[2], 1), "%\n")
#>   Type 2: 68.5 %
cat("  Type 3:", round(m$pac_by_class[3], 1), "%\n")
#>   Type 3: 51.1 %
cat("  Type 4:", round(m$pac_by_class[4], 1), "%\n")
#>   Type 4: 52.9 %

# Predictive value: diagonal / column sums
pv <- diag(fit$confusion) / colSums(fit$confusion) * 100
cat("\nPV by biological type:\n")
#> 
#> PV by biological type:
cat("  Type 1:", round(pv[1], 1), "%\n")
#>   Type 1: 79 %
cat("  Type 2:", round(pv[2], 1), "%\n")
#>   Type 2: 56.2 %
cat("  Type 3:", round(pv[3], 1), "%\n")
#>   Type 3: 52.3 %
cat("  Type 4:", round(pv[4], 1), "%\n")
#>   Type 4: 66.2 %

PAC (sensitivity per class): 80.3%, 68.5%, 51.1%, and 52.9% for protein types 1 through 4, respectively. Because 25% correct per class is expected by chance for a four-class problem, classification of all four types substantially exceeds chance.
ESS = 50.96% indicates a relatively strong effect.² All four PAC values exceed the four-class chance benchmark of 25%; they also exceed 50%, indicating majority-accurate classification within each class.
PV: When the model predicts type 1 it is correct ~79.0% of the time; type 2, ~56.2%; type 3, ~52.3%; type 4, ~66.2%. All predictive values exceed chance.

Monte Carlo and LOO validity

The MC p-value and LOO results are shown in the print and summary output above.

MC p-value: The printed p(MC) is a nondirectional Fisher-randomization p-value. Each permutation searches for the best categorical mapping of the permuted labels, matching the nondirectional search used for the training model. Interpret by decision threshold (e.g., p < 0.05).
LOO jackknife: Leave-one-out ESS and Mean PAC are shown. Each fold holds out one observation, searches for the optimal categorical mapping on the remaining n-1 observations (nondirectional, equal priors), and classifies the held-out case using that fold’s rule. Because the identity mapping is the globally optimal partition for these data, every fold recovers the same rule, and LOO ESS equals training ESS exactly. This confirms the model is stable across folds and no single observation drives the result.
LOO p-value: No LOO Fisher-exact p-value is reported for multicategorical (C > 2) class problems. No canonical reference distribution is defined for the C x C LOO confusion matrix in this context. For binary class ODA, a one-tailed Fisher exact p-value is available (MPE p. 34).

Notes on reproducibility

Fixture parity. The training rule, confusion matrix, and ESS are verified against MegaODA.exe output in the package test suite (tests/testthat/test-fixture-vignettes.R, Example 3).

MC p-value calibration. The MC p shown here reflects mc_iter = 500L in this CRAN vignette. Use the canonical run with mc_iter = 25000L (chunk fit-canonical, eval=FALSE) for publication-quality results.

Nondirectional search. No direction argument is supplied. ODA evaluates all possible mappings from the four amino-acid categories to the four biological-type classes and selects the mapping that maximises ESS. This matches the MegaODA.exe gold run (Hypothesis: NONDIRECTIONAL).

Optional directional analysis. A researcher with an a priori convergent- validity hypothesis (amino acid type i predicts biological type i) can supply direction = "ascending" for a constrained identity-map analysis (MPE Chapter 4 Phase 6C). For this dataset the two analyses yield identical ESS and confusion because the identity mapping happens to be the global optimum; they differ in MC interpretation (directional vs. nondirectional p-value).

oda