The perceptron partitions the input space with a hyperplane:
\[\mathbf{w}^\top \mathbf{x} + b = 0\]
Note
In 2D (\(x_1, x_2\)), the boundary is a line \(w_1 x_1 + w_2 x_2 + b = 0\), i.e. \[x_2 = -\frac{w_1}{w_2}\,x_1 - \frac{b}{w_2}\] Points on one side are classified as 1, on the other as 0.
The perceptron convergence theorem guarantees that the algorithm finds a separating hyperplane in a finite number of steps — if one exists.
This is linearly separable — a single line can separate the one positive case \((1,1)\) from the three negatives. The ideal boundary: \(x_1 + x_2 - 1.5 = 0\).
# The CFA Institute's monograph on AI by Joseph Simonian, PhD gave me this idea to understand# better and share my insights about the perceptron and neural networks in investment management.# Input space: all combinations of two binary inputs.X <-as.matrix(expand.grid(x1 =c(0, 1), x2 =c(0, 1)))# AND gate targety <-apply(X, 1, function(x) as.numeric(x[1] & x[2]))data.frame(x1 = X[,1], x2 = X[,2], y = y)
In matrix form across all \(n\) samples simultaneously:
\[\hat{\mathbf{y}} = \mathbb{I}(X\mathbf{w} + b \geq 0) \in \{0,1\}^n\]
# Beautiful matrix algebra that in one line implements the Heaviside functionpredict_perceptron <-function(X, w, b) {as.numeric((X %*% w + b) >=0)}# Trace through with our AND gate dataz <- X %*% weights + biasfired <- z >=0preds <-as.numeric(fired)data.frame(x1 = X[,1], x2 = X[,2],z =round(z, 3),`z >= 0`= fired,`ŷ`= preds,y = y,correct = preds == y,check.names =FALSE)
The fit function loops over all samples for a fixed number of epochs (complete passes through the data):
for each epoch:
for each sample (xᵢ, yᵢ):
ŷᵢ ← φ(w⊤xᵢ + b) # predict
δᵢ ← ŷᵢ − yᵢ # error
if δᵢ ≠ 0:
w ← w − η · δᵢ · xᵢ # update weights
b ← b − η · δᵢ # update bias
if no errors this epoch → STOP (converged)
Early stopping: if an entire epoch passes with zero misclassifications, the data is perfectly separated and training terminates.
fit_perceptron <-function(X, y, w, b, lr =0.1, epochs =100) { history <-numeric(epochs)for (e inseq_len(epochs)) { err <-0for (i inseq_len(nrow(X))) { v <-sum(X[i,] * w) + b # net input y_hat <-ifelse(v <0, 0, 1) # step function delta <- y_hat - y[i] # errorif (delta !=0) { err <- err +1 w <- w - lr * delta * X[i,] # weight update b <- b - lr * delta # bias update } } history[e] <- errif (err ==0) { cat("Converged at epoch:", e, "\n"); break } }list(w = w, b = b, history = history[1:e])}
After training, the perceptron correctly classifies all four AND gate combinations.
Learned parameters (example, see R code):
Parameter
Value
\(w_1\)
+0.14
\(w_2\)
+0.17
\(b\)
−0.241
The implied decision boundary \(w_1 x_1 + w_2 x_2 + b = 0\) correctly places \((1,1)\) above the line (output = 1) and all other points below it (output = 0).
The AND gate is linearly separable, so the perceptron is guaranteed to converge.
set.seed(123)w <-matrix(rnorm(2), 2, 1)b <-rnorm(1)model_and <-fit_perceptron(X, y, w, b, lr =0.1, epochs =100)
The perceptron finds a straight line separating the single positive case \((1,1)\) from the three negatives.
Blue region → predicted 0 (no fire)
Red region → predicted 1 (fire)
The line is the learned decision boundary\(w_1 x_1 + w_2 x_2 + b = 0\)
The AND gate is a toy problem chosen because it is guaranteed to be linearly separable. Real credit data is far more complex — but the same mathematical machinery applies.
Show R code
library(ggplot2)library(RColorBrewer)# This sets the color palette to opposing colours, particularly Set1# see https://colorbrewer2.org/pal <-brewer.pal(3, "Set1")grid_and <-expand.grid(x1 =seq(-0.2, 1.2, 0.01),x2 =seq(-0.2, 1.2, 0.01))grid_and$pred <-as.numeric(as.matrix(grid_and) %*% model_and$w + model_and$b >=0)pts <-data.frame(x1 = X[,1], x2 = X[,2],y =factor(y, labels =c("0 (no fire)", "1 (fire)")))ggplot() +geom_tile(data = grid_and,aes(x1, x2, fill =factor(pred)), alpha =0.25) +geom_point(data = pts, aes(x1, x2, colour = y, shape = y), size =6) +geom_abline(intercept =-model_and$b / model_and$w[2],slope =-model_and$w[1] / model_and$w[2],colour ="white", linewidth =1, linetype ="dashed") +scale_fill_manual(values =c("0"= pal[2], "1"= pal[1]),labels =c("Pred: 0", "Pred: 1"), name ="Region") +scale_colour_manual(values =c("0 (no fire)"= pal[2],"1 (fire)"= pal[1]), name ="Target") +scale_shape_manual(values =c(1, 3)) +coord_fixed(xlim =c(-0.2, 1.2), ylim =c(-0.2, 1.2)) +labs(title ="AND gate — Learned decision boundary",x ="x1", y ="x2")
The perceptron draws a straight line in the DSCR × Debt/EBITDA plane — exactly as it did in the AND gate, but now on real financial data.
Blue zone → predicted: No default
Red zone → predicted: Default
Points in the wrong zone are misclassifications. The overlap of the two groups is irreducible with a linear classifier — motivating more powerful models (logistic regression, random forest).
As many as 98.4% of actual defaulters are caught. In credit risk, missed defaults are the costliest errors, but only 44.8% of non defaulters were caught. Can adding more features help?
We enrich the model with two additional financial ratios:
Variable
Description
No Default
Default
DSCR
Debt Service Coverage Ratio
1.95
1.04
DEBITDA
Debt / EBITDA multiple
4.48
7.61
ROS
Return on Sales (%)
10.2
4.6
DBR
Debt-to-Revenue ratio (%)
38.6
71.8
The decision boundary is now a hyperplane in 4D — no longer directly visualisable, but the perceptron still finds a linear combination of the four ratios.
Note
The fit_perceptron function requires no changes — only the number of input columns grows from 2 to 4. The learning rule is identical.
Comparing training errors per epoch for both models:
Both curves drop sharply in the first 20–30 epochs
The 4-predictor model (blue) settles at a lower error floor
Neither reaches zero — the classes are not linearly separable in real data
This non-zero floor is the signature of irreducible overlap in the feature space. It motivates more flexible classifiers — the topic of the next presentation.
The biggest gain is recall — the 4-predictor model catches 1.6 percentage points more actual defaulters. This is the metric that matters most in credit risk management, though this improvemente doesn’t add much.
Dominant risk factor: excessive leverage relative to earnings
ROS
−0.0765
Profitable firms generate cash to service obligations
DBR
+0.360
High debt burden relative to revenues is a warning signal
This structure is essentially a linear scorecard — identical in spirit to Altman’s Z-score or traditional expert-designed credit rating models, but learned entirely from data.
Note
Under EBA IRB guidelines and IFRS 9, model explainability is mandatory. The perceptron’s weights satisfy this requirement directly.
Linear boundary only — cannot capture interaction effects
No probability output — just 0/1 (no PD estimate)
Precision still limited (~26.9%) on real credit data
Sensitive to feature scaling
Cannot solve XOR-type problems
In the next part
The same step-by-step logic — applied to a Multi-Layer Perceptron (MLP):
Addition
What it unlocks
Hidden layer
Non-linear boundaries
Sigmoid activation
Probability output (PD)
Backpropagation
Gradient-based learning
Multiple layers
Deep representations
Regulatory reminder
The perceptron is fully auditable — its weights are the model. More complex models (XGBoost, MLP) require post-hoc explainability tools (SHAP values) to meet EBA and IFRS 9 requirements.
Key takeaways
The perceptron is a single linear classifier: weighted inputs + threshold = binary prediction
The AND gate shows the mechanics: same algorithm, 4 observations, guaranteed convergence
On 690 real loans, the same code achieves ~50% accuracy and ~100% recall with no feature engineering
2 predictors give a visible 2D boundary; 4 predictors raise default recall slightly
Debt/EBITDA is the dominant default signal — the model discovers this automatically
The weight vector is a data-driven scorecard — interpretable, auditable, regulatory-compliant
The non-zero error floor signals non-linear overlap → motivation for the MLP (next session)
“Start simple. A model that can be explained to a regulator is worth more than one that cannot.” “Understanding the simplest model deeply is more valuable than applying complex models blindly.”
Part V: From perceptron to neural network
Why one neuron is not enough
The perceptron left us with two problems:
Problem 1 — Linear boundary only
Real credit data is not linearly separable. Defaulters and non-defaulters overlap in every feature space. A single hyperplane cannot capture that complexity.
Problem 2 — No probability output
The perceptron outputs 0 or 1. Credit decisions require a probability of default (PD) — a number in \([0,1]\) that can be thresholded, stress-tested, and reported under IFRS 9.
Compare with the perceptron: 5 parameters (\(4\) weights + \(1\) bias). The hidden layer adds just 8 more — but unlocks non-linear boundaries and probability outputs.
The Sigmoid activation
We replace the Heaviside step function \(\phi\) with the sigmoid:
\[\sigma(z) = \frac{1}{1 + e^{-z}} \in (0, 1)\]
Why sigmoid instead of step?
Property
Step \(\phi\)
Sigmoid \(\sigma\)
Output range
\(\{0, 1\}\)
\((0, 1)\)
Differentiable
✗
✓
Probability interpretation
✗
✓
Gradient-based learning
✗
✓
The sigmoid is smooth and differentiable — this allows us to compute gradients and use backpropagation to train the network.
Key values to remember:
\[\sigma(0) = 0.5 \qquad \text{(decision threshold)}\]\[\sigma(z) \to 1 \text{ as } z \to +\infty\]\[\sigma(z) \to 0 \text{ as } z \to -\infty\]
Why not just count errors? The step function makes error counts non-differentiable — gradients are zero everywhere except at the threshold. BCE is smooth, always differentiable, and penalises confident wrong predictions much more harshly than uncertain ones.
Prediction \(\hat{y}\)
True label \(y\)
BCE contribution
0.95
1
\(-\log(0.95) \approx 0.05\) — small, correct
0.50
1
\(-\log(0.50) \approx 0.69\) — uncertain
0.05
1
\(-\log(0.05) \approx 3.00\) — severely penalised
0.05
0
\(-\log(0.95) \approx 0.05\) — small, correct
Backpropagation — the key equations
Training minimises \(\mathcal{L}\) via gradient descent. We need \(\frac{\partial \mathcal{L}}{\partial \mathbf{W}_1}\), \(\frac{\partial \mathcal{L}}{\partial \mathbf{W}_2}\), etc.
The chain rule in action: the output error \(\boldsymbol{\delta}^{(2)}\) is propagated backwards through \(\mathbf{W}_2\), scaled by the sigmoid derivative at the hidden layer. This is why the algorithm is called backpropagation.
The same four predictors used in the 4-predictor perceptron:
Variable
Description
DSCR
Debt Service Coverage Ratio
DEBITDA
Debt / EBITDA multiple
ROS
Return on Sales (%)
DBR
Debt-to-Revenue ratio (%)
Pre-processing steps (identical to the perceptron):
Normalise each feature to \([0, 1]\) using min-max scaling
Use the pre-split Train / Test sets (345 / 345)
Label: Si = 1 (default), No = 0 (no default)
The neural network uses the same normalised matrices\(X_{\text{train}}\), \(X_{\text{test}}\) already prepared for the 4-predictor perceptron — no additional pre-processing needed.
# Reuse the 4-predictor normalised matrices from Part IV# (X_tr4, y_tr4, X_te4, y_te4 are already in the environment)cat("Training set: ", nrow(X_tr4), "observations,",ncol(X_tr4), "features\n")
Weights initialised with Xavier scaling: \(\mathbf{w} \sim \mathcal{N}(0, 2/d_{\text{in}})\)
Xavier initialisation sets the variance of initial weights proportional to the input dimension — this prevents activations from vanishing or exploding at the start of training.
After 500 epochs the network converges to a stable loss, with accuracy well above the perceptron baseline.
Two panels show how the network learns over 500 epochs:
Left — accuracy curve: training accuracy rises to ~85% and stabilises. Compare with the perceptron’s ~50% — the neural network’s advantage will be clearest in recall on defaults(?).
Right — loss curve: binary cross-entropy drops steeply in the first ~100 epochs, then gradually flattens. The smooth descent (no oscillation) reflects gradient descent with a well-chosen learning rate.
Tip
A smooth, monotonically decreasing loss is the hallmark of a well-behaved network. Oscillation or divergence would signal that the learning rate \(\eta\) is too large.
The two hidden neurons transform the four inputs into a new 2D space \((h_1, h_2)\).
Key insight: the hidden layer learns a new representation of the borrowers — a non-linear projection into 2D that makes the default/no-default separation easier for the output neuron.
Plot interpretation:
Each point is one borrower from the test set
Blue circles = no default, red crosses = default
Notice the better clustering compared to the raw DSCR × Debt/EBITDA scatter — this is the network learning to compress and reorganise the information
Unlike the perceptron (which outputs 0 or 1), the neural network outputs a continuous PD score\(\hat{y} \in (0,1)\).
This enables:
Risk tiering: rank borrowers from lowest to highest PD
IFRS 9 staging: Stage 1 / 2 / 3 thresholds based on PD level
Stress testing: shift the PD distribution under adverse scenarios
Expected loss: \(EL = PD \times LGD \times EAD\)
The distribution of predicted PDs shows clear separation between the two groups — defaulters cluster toward higher scores, non-defaulters toward lower scores.
The nnet package (part of base R) implements exactly our 4→2→1 architecture with a single function call.
Key arguments:
Argument
Value
Meaning
formula
label ~ .
predict label from all other columns
size
2
number of hidden neurons
linout
FALSE
sigmoid output (logistic)
decay
0.01
L2 regularisation (weight decay)
maxit
500
maximum iterations
Weight decay (L2 regularisation) adds a penalty \(\lambda \|\mathbf{W}\|^2\) to the loss — it shrinks weights toward zero and reduces overfitting. Our hand-coded network did not include this; nnet handles it automatically.
library(nnet)# Prepare data frames (nnet prefers data frames with named columns)train_df <-data.frame(X_tr4, label = y_tr4)test_df <-data.frame(X_te4, label = y_te4)# Case weights: upweight the minority class (defaults) ~5xcase_w <-ifelse(train_df$label ==1, 5, 1)set.seed(42)nn_nnet <-nnet( label ~ .,data = train_df,weights = case_w, # class imbalance correctionsize =2, # 2 hidden neuronslinout =FALSE, # sigmoid outputdecay =0.01, # L2 regularisationmaxit =500, # max iterationstrace =FALSE# suppress iteration output)cat("Network summary:\n")
Network summary:
print(nn_nnet)
a 4-2-1 network with 13 weights
inputs: DSCR DEBITDA ROS DBR
output(s): label
options were - decay=0.01
The nnet model also outputs continuous PD scores. The distribution plot (below) shows the same qualitative separation as our hand-coded network:
Non-defaulters (blue circles) cluster near PD = 0
Defaulters (red crosses) cluster near PD = 1
The overlap region (PD ≈ 0.3–0.7) represents genuinely ambiguous borrowers — cases where even a well-trained model cannot be certain
Under IFRS 9, these borderline borrowers (PD in the grey zone) would typically be placed in Stage 2 — significant increase in credit risk — and provisioned accordingly.
The neural network’s hidden layer introduces a black-box element — the weights \(\mathbf{W}_1\) do not map directly to a credit rule. In regulated applications, SHAP values or LIME are used to explain individual predictions. The perceptron remains the gold standard for explainability; the neural network trades some of that for better predictive power.
Key takeaways — Part V
A single hidden layer with 2 sigmoid neurons is enough to move from a linear to a non-linear decision boundary
The sigmoid replaces the step function: outputs PD ∈ (0,1) instead of \(\{0,1\}\) — essential for IFRS 9 provisioning
Backpropagation is the chain rule applied layer by layer: output error propagated backwards, scaled by sigmoid derivatives
Our hand-coded network and nnet reach the same result — nnet adds L2 regularisation and a faster optimiser
Recall on defaults improves from 41.7% → 50%: the non-linear boundary catches borderline borrowers the perceptron misses
The hidden layer is a learned representation — it compresses 4 financial ratios into 2 activations that best predict default
More complex models trade interpretability for performance — SHAP values are the regulatory bridge
“The perceptron asks: does this borrower cross the line? The neural network asks: where in the space does this borrower truly belong?”
Part VI: Model Comparison — ROC Curves
Why ROC Curves?
All four models were evaluated so far at a fixed threshold (0.5 for neural networks, sign of the linear score for perceptrons).
But threshold choice is a business decision:
A conservative credit officer sets a low threshold → catches more defaults, but rejects more good borrowers
A growth-oriented officer sets a high threshold → approves more, but misses more defaults
The ROC curve shows model performance across all possible thresholds simultaneously — separating model quality from threshold choice.
The ROC curve requires a continuous ranking score, not a hard 0/1 prediction.
Model
Ranking score used
Rationale
Perceptron 2-pred
\(s = \mathbf{w}^\top \mathbf{x} + b\)
Raw linear activation before step function
Perceptron 4-pred
\(s = \mathbf{w}^\top \mathbf{x} + b\)
Raw linear activation before step function
NN hand-coded
\(\hat{y} = \sigma(z^{(2)})\)
Sigmoid output — already in \((0,1)\)
NN nnet
\(\hat{y} = \sigma(z^{(2)})\)
Sigmoid output via predict(..., type="raw")
Note
Using the raw linear score \(\mathbf{w}^\top\mathbf{x} + b\) for the perceptrons is the correct approach for ROC analysis. Applying the step function first collapses the score to \(\{0,1\}\), which produces a degenerate ROC curve with only three points.
We compute ROC curves from scratch — no external packages needed. The algorithm:
1. Sort observations by score (descending)
2. For each unique threshold t:
a. Predict: label = 1 if score ≥ t, else 0
b. Compute TPR = TP / (TP + FN)
c. Compute FPR = FP / (FP + TN)
3. Plot TPR vs FPR
4. AUC = area under the curve (trapezoidal rule, gently provided by AI!)
The diagonal (FPR = TPR) represents a random classifier — AUC = 0.5. Any model above the diagonal has discriminatory power. The top-left corner is the ideal point (TPR = 1, FPR = 0).
# ROC curve and AUC from scratchroc_curve <-function(scores, labels) {# Sort by descending score ord <-order(scores, decreasing =TRUE) scores <- scores[ord] labels <- labels[ord] n_pos <-sum(labels) n_neg <-sum(1- labels)# Unique thresholds (include -Inf so last point is (1,1)) thresholds <-c(Inf, unique(scores), -Inf) tpr <- fpr <-numeric(length(thresholds))for (k inseq_along(thresholds)) { pred <-as.integer(scores >= thresholds[k]) tp <-sum(pred ==1& labels ==1) fp <-sum(pred ==1& labels ==0) tpr[k] <- tp / n_pos fpr[k] <- fp / n_neg }# AUC via trapezoidal rule had to be supplied by Claude AI! auc <-abs(sum(diff(fpr) * (tpr[-1] + tpr[-length(tpr)]) /2))data.frame(fpr = fpr, tpr = tpr, auc = auc)}# Compute for all four modelsroc_p2 <-roc_curve(score_p2, y_te2)roc_p4 <-roc_curve(score_p4, y_te4)roc_nn <-roc_curve(score_nn, y_te4)roc_nnet <-roc_curve(score_nnet, y_te4)cat(sprintf("AUC — Perceptron 2-pred: %.3f\n", roc_p2$auc[1]))
The ROC plot shows all four models on the same axes:
Steeper initial rise → model ranks true defaulters near the top of its score list
Larger area → better discrimination across all operating points
The dashed diagonal is the random baseline (AUC = 0.5)
Key observations:
Both neural networks clearly dominate the perceptrons at most operating points
The 4-predictor perceptron outperforms the 2-predictor version — extra features help even in a linear model
The gap between perceptrons and NNs is largest in the high-sensitivity region (TPR > 0.6) — the non-linear boundary helps most when trying to catch the majority of defaulters
The two neural networks perform similarly — nnet’s L2 regularisation trades a small amount of raw performance for better generalisation
Show R code
library(ggplot2)library(dplyr)# Combine into one data frameroc_all <-bind_rows( roc_p2 |>mutate(Model =sprintf("Perceptron 2-pred (AUC = %.3f)", auc[1])), roc_p4 |>mutate(Model =sprintf("Perceptron 4-pred (AUC = %.3f)", auc[1])), roc_nn |>mutate(Model =sprintf("NN hand-coded (AUC = %.3f)", auc[1])), roc_nnet |>mutate(Model =sprintf("NN nnet (AUC = %.3f)", auc[1])))# Consistent colour palettemodel_colours <-c("Perceptron 2-pred"="#e67e22","Perceptron 4-pred"="#e74c3c","NN hand-coded"="#3498db","NN nnet"="#2ecc71")# Match colours to actual Model labelsmodel_key <-c("Perceptron 2-pred", "Perceptron 4-pred","NN hand-coded", "NN nnet")names(model_colours) <-unique(roc_all$Model)ggplot(roc_all, aes(fpr, tpr, colour = Model)) +geom_line(linewidth =1.1) +geom_abline(slope =1, intercept =0,linetype ="dashed", colour ="grey60", linewidth =0.7) +scale_colour_manual(values = model_colours) +annotate("text", x =0.395, y =0.375, label ="Random classifier (AUC = 0.5)",colour ="grey60", size =3.2, hjust =0) +labs(title ="ROC curves — four models on the sample test set",subtitle ="345 test observations | 64 actual defaults",x ="False positive rate (1 − specificity)",y ="True positive rate (recall / sensitivity)",colour =NULL) +coord_equal()
Under EBA IRB guidelines, the Gini coefficient (\(= 2 \times \text{AUC} - 1\)) is the primary discriminatory power metric for PD models. A Gini below 0.40 (AUC < 0.70) is generally considered insufficient for supervisory approval of internal rating models. Our neural networks comfortably exceed this threshold.
The ROC curve helps us choose the operating threshold that best matches the business objective.
Two common criteria:
Youden’s J statistic — maximises the sum of sensitivity and specificity: \[J = \text{TPR} - \text{FPR} = \text{Sensitivity} + \text{Specificity} - 1\]
F1 score — balances precision and recall (useful when false negatives are costly): \[F_1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}\]
In credit risk, false negatives (missed defaults) are typically more costly than false positives (wrongly rejected good borrowers). This argues for a threshold below 0.5 — accepting more false alarms to catch more true defaults.
Show R code
# Find optimal threshold by Youden's J for the nnet modelthresholds <-seq(0.01, 0.99, by =0.01)youden <-sapply(thresholds, function(t) { pred <-as.integer(score_nnet >= t) cm <-table(factor(y_te4, levels=c(0,1)),factor(pred, levels=c(0,1))) tpr <- cm["1","1"] /sum(cm["1",]) fpr <- cm["0","1"] /sum(cm["0",]) tpr - fpr # Youden's J})opt_t <- thresholds[which.max(youden)]cat(sprintf("Optimal threshold (Youden's J) for nnet: %.2f\n", opt_t))
Zooming in on the nnet ROC curve with the optimal operating point marked.
The optimal point (by Youden’s J) represents the threshold that maximises the sum of sensitivity and specificity — it is the point on the ROC curve furthest from the diagonal.
In practice, the choice of operating point is a business and regulatory decision:
A Basel III-compliant internal model might require recall ≥ 70% on defaults
A portfolio growth strategy might accept higher FPR to avoid rejecting good borrowers
An IFRS 9 Stage 2 model might use a very low threshold to capture all significant credit deterioration early
The ROC curve makes all these trade-offs explicit and auditable.
The ROC curve separates model quality from threshold choice — essential for regulatory reporting and business calibration
AUC is the probability that the model ranks a random defaulter above a random non-defaulter — the standard EBA discriminatory power metric
The perceptron’s raw linear score (\(\mathbf{w}^\top\mathbf{x} + b\)) must be used for ROC analysis — the hard 0/1 output is not a ranking
Neural networks achieve AUC ≈ 0.83–0.84 vs 0.72–0.76 for perceptrons — a meaningful uplift that justifies the added complexity
The Gini coefficient (\(= 2 \times \text{AUC} - 1\)) is the Basel/EBA standard: our NNs achieve Gini ≈ 0.66–0.68, well above the 0.40 minimum
Youden’s J provides a principled, threshold-agnostic way to select the operating point — but the final choice is always a business and regulatory decision
The ROC framework makes the performance–interpretability trade-off explicit and auditable — the right language for a conversation with a supervisor or a risk committee
“A model without a ROC curve is a model without a voice. The curve is how you speak to a regulator.”