A neural network primer in credit selection

From logic gates to corporate default prediction

Rodolfo Vanzini, CFA

2026-03-25

Part I: The Perceptron

What is a perceptron?

The CFA Institute’s monograph on AI in investment management by Joseph Simonian, PhD gave me the idea to develop this presentation.

A perceptron is the simplest neural network: a binary linear classifier. Proposed by Frank Rosenblatt (1958), it models a single biological neuron:

  1. Receives weighted inputs
  2. Sums them (the net input)
  3. Fires (output = 1) if the sum exceeds a threshold

It is the building block of all modern neural networks.

\[\hat{y} = \phi\!\left(\mathbf{w}^\top \mathbf{x} + b\right)\]

where the Heaviside step function \(\phi\) is:

\[\phi(z) = \begin{cases} 1 & \text{if } z \geq 0 \\ 0 & \text{if } z < 0 \end{cases}\]

x₁ ──→ w₁ ─┐
x₂ ──→ w₂ ─┤──→ Σ ──→ φ(z) ──→ ŷ ∈ {0,1}
            └─ b

The decision boundary

The perceptron partitions the input space with a hyperplane:

\[\mathbf{w}^\top \mathbf{x} + b = 0\]

Note

In 2D (\(x_1, x_2\)), the boundary is a line \(w_1 x_1 + w_2 x_2 + b = 0\), i.e. \[x_2 = -\frac{w_1}{w_2}\,x_1 - \frac{b}{w_2}\] Points on one side are classified as 1, on the other as 0.

The perceptron convergence theorem guarantees that the algorithm finds a separating hyperplane in a finite number of steps — if one exists.

Learning outcome Condition
Converges to perfect classifier Data is linearly separable
Oscillates, never converges Data is not linearly separable

The learning rule

The perceptron updates weights only on mistakes:

\[\mathbf{w} \leftarrow \mathbf{w} - \eta\,(\hat{y} - y)\,\mathbf{x}\] \[b \leftarrow b - \eta\,(\hat{y} - y)\]

where \(\eta > 0\) is the learning rate.

Case \(\hat{y}\) \(y\) Error \((\hat{y}-y)\) Effect on \(\mathbf{w}\)
Correct 0 0 0 No update
Correct 1 1 0 No update
False positive 1 0 +1 \(\mathbf{w} \downarrow\) — neuron fires less
False negative 0 1 −1 \(\mathbf{w} \uparrow\) — neuron fires more

Tip

Key insight: the update is proportional to the input that caused the mistake. The neuron learns by correcting its own errors — one sample at a time.

A toy example: the AND gate

Before applying the perceptron to real credit data, we illustrate it on the simplest possible problem: the logical AND function.

Two binary inputs \(x_1, x_2 \in \{0,1\}\), output \(y = x_1 \land x_2\):

\(x_1\) \(x_2\) \(y = x_1 \land x_2\)
0 0 0
1 0 0
0 1 0
1 1 1

This is linearly separable — a single line can separate the one positive case \((1,1)\) from the three negatives. The ideal boundary: \(x_1 + x_2 - 1.5 = 0\).

# The CFA Institute's monograph on AI by Joseph Simonian, PhD gave me this idea to understand
# better and share my insights about the perceptron and neural networks in investment management.
# Input space: all combinations of two binary inputs.
X <- as.matrix(expand.grid(x1 = c(0, 1), x2 = c(0, 1)))

# AND gate target
y <- apply(X, 1, function(x) as.numeric(x[1] & x[2]))

data.frame(x1 = X[,1], x2 = X[,2], y = y)
  x1 x2 y
1  0  0 0
2  1  0 0
3  0  1 0
4  1  1 1

\[X = \begin{bmatrix} 0 & 0 \\ 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix}, \qquad \mathbf{y} = \begin{bmatrix} 0 \\ 0 \\ 0 \\ 1 \end{bmatrix}\]

Initialising weights

Weights and bias are initialised to random values drawn from the standard normal distribution.

The net input for each sample is then:

\[\mathbf{z} = X\mathbf{w} + b = \begin{bmatrix} b \\ w_1 + b \\ w_2 + b \\ w_1 + w_2 + b \end{bmatrix}\]

Before training, these random weights produce meaningless predictions — the learning algorithm will correct them iteratively.

set.seed(123) # this is to ensure reproducibility
weights <- matrix(rnorm(2), nrow = 2, ncol = 1)
bias    <- rnorm(1)

cat("Initial weights: w1 =", round(weights[1], 3),
    "  w2 =", round(weights[2], 3),
    "\nInitial bias:    b  =", round(bias, 3))
Initial weights: w1 = -0.56   w2 = -0.23 
Initial bias:    b  = 1.559
# Net input z = Xw + b
z <- X %*% weights + bias
cat("\n\nNet inputs z = Xw + b:\n")


Net inputs z = Xw + b:
print(round(z, 3))
      [,1]
[1,] 1.559
[2,] 0.998
[3,] 1.329
[4,] 0.768
# Initial (random) predictions
cat("\nInitial predictions (before training):",
    as.numeric(z >= 0))

Initial predictions (before training): 1 1 1 1
cat("\nTarget values:                        ", y)

Target values:                         0 0 0 1

The prediction function

Three steps, one line of algebra:

Step Operation Math
1. Weighted sum \(z = \mathbf{w}^\top \mathbf{x} + b\) Linear combination
2. Threshold \(z \geq 0\)? Heaviside \(\phi\)
3. Numeric output TRUE → 1, FALSE → 0 \(\hat{y} \in \{0,1\}\)

In matrix form across all \(n\) samples simultaneously:

\[\hat{\mathbf{y}} = \mathbb{I}(X\mathbf{w} + b \geq 0) \in \{0,1\}^n\]

# Beautiful matrix algebra that in one line implements the Heaviside function
predict_perceptron <- function(X, w, b) {
  as.numeric((X %*% w + b) >= 0)
}

# Trace through with our AND gate data
z     <- X %*% weights + bias
fired <- z >= 0
preds <- as.numeric(fired)

data.frame(
  x1 = X[,1], x2 = X[,2],
  z  = round(z, 3),
  `z >= 0` = fired,
  `ŷ`      = preds,
  y        = y,
  correct  = preds == y,
  check.names = FALSE
)
  x1 x2     z z >= 0 ŷ y correct
1  0  0 1.559   TRUE 1 0   FALSE
2  1  0 0.998   TRUE 1 0   FALSE
3  0  1 1.329   TRUE 1 0   FALSE
4  1  1 0.768   TRUE 1 1    TRUE

The training algorithm

The fit function loops over all samples for a fixed number of epochs (complete passes through the data):

for each epoch:
  for each sample (xᵢ, yᵢ):
    ŷᵢ  ← φ(w⊤xᵢ + b)         # predict
    δᵢ  ← ŷᵢ − yᵢ              # error
    if δᵢ ≠ 0:
      w ← w − η · δᵢ · xᵢ     # update weights
      b ← b − η · δᵢ           # update bias
  if no errors this epoch → STOP (converged)

Early stopping: if an entire epoch passes with zero misclassifications, the data is perfectly separated and training terminates.

fit_perceptron <- function(X, y, w, b, lr = 0.1, epochs = 100) {
  history <- numeric(epochs)
  for (e in seq_len(epochs)) {
    err <- 0
    for (i in seq_len(nrow(X))) {
      v     <- sum(X[i,] * w) + b        # net input
      y_hat <- ifelse(v < 0, 0, 1)       # step function
      delta <- y_hat - y[i]              # error
      if (delta != 0) {
        err <- err + 1
        w   <- w - lr * delta * X[i,]   # weight update
        b   <- b - lr * delta            # bias update
      }
    }
    history[e] <- err
    if (err == 0) { cat("Converged at epoch:", e, "\n"); break }
  }
  list(w = w, b = b, history = history[1:e])
}

Training on the AND Gate

After training, the perceptron correctly classifies all four AND gate combinations.

Learned parameters (example, see R code):

Parameter Value
\(w_1\) +0.14
\(w_2\) +0.17
\(b\) −0.241

The implied decision boundary \(w_1 x_1 + w_2 x_2 + b = 0\) correctly places \((1,1)\) above the line (output = 1) and all other points below it (output = 0).

The AND gate is linearly separable, so the perceptron is guaranteed to converge.

set.seed(123)
w <- matrix(rnorm(2), 2, 1)
b <- rnorm(1)

model_and <- fit_perceptron(X, y, w, b, lr = 0.1, epochs = 100)
Converged at epoch: 18 
cat("Learned weights: w1 =", round(model_and$w[1], 3),
    "  w2 =", round(model_and$w[2], 3),
    "\nLearned bias:    b  =", round(model_and$b, 3))
Learned weights: w1 = 0.14   w2 = 0.17 
Learned bias:    b  = -0.241
preds <- predict_perceptron(X, model_and$w, model_and$b)
data.frame(x1 = X[,1], x2 = X[,2], Target = y,
           Prediction = preds, Correct = y == preds)
  x1 x2 Target Prediction Correct
1  0  0      0          0    TRUE
2  1  0      0          0    TRUE
3  0  1      0          0    TRUE
4  1  1      1          1    TRUE

Decision boundary — AND gate

The perceptron finds a straight line separating the single positive case \((1,1)\) from the three negatives.

  • Blue region → predicted 0 (no fire)
  • Red region → predicted 1 (fire)
  • The line is the learned decision boundary \(w_1 x_1 + w_2 x_2 + b = 0\)

The AND gate is a toy problem chosen because it is guaranteed to be linearly separable. Real credit data is far more complex — but the same mathematical machinery applies.

Show R code
library(ggplot2)
library(RColorBrewer)

# This sets the color palette to opposing colours, particularly Set1
# see https://colorbrewer2.org/
pal <- brewer.pal(3, "Set1")

grid_and <- expand.grid(x1 = seq(-0.2, 1.2, 0.01),
                        x2 = seq(-0.2, 1.2, 0.01))
grid_and$pred <- as.numeric(
  as.matrix(grid_and) %*% model_and$w + model_and$b >= 0)

pts <- data.frame(x1 = X[,1], x2 = X[,2],
                  y  = factor(y, labels = c("0 (no fire)", "1 (fire)")))

ggplot() +
  geom_tile(data = grid_and,
            aes(x1, x2, fill = factor(pred)), alpha = 0.25) +
  geom_point(data = pts, aes(x1, x2, colour = y, shape = y), size = 6) +
  geom_abline(intercept = -model_and$b / model_and$w[2],
              slope     = -model_and$w[1] / model_and$w[2],
              colour = "white", linewidth = 1, linetype = "dashed") +
  scale_fill_manual(values = c("0" = pal[2], "1" = pal[1]),
                    labels = c("Pred: 0", "Pred: 1"), name = "Region") +
  scale_colour_manual(values = c("0 (no fire)" = pal[2],
                                 "1 (fire)"    = pal[1]), name = "Target") +
  scale_shape_manual(values = c(1, 3)) + 
  coord_fixed(xlim = c(-0.2, 1.2), ylim = c(-0.2, 1.2)) +
  labs(title = "AND gate — Learned decision boundary",
       x = "x1", y = "x2") 

The perceptron’s limitation: XOR

The XOR function outputs 1 when exactly one input is 1:

\(x_1\) \(x_2\) \(y = x_1 \oplus x_2\)
0 0 0
1 0 1
0 1 1
1 1 0

No single straight line can separate the two 1s (diagonal) from the two 0s. The perceptron cannot converge — it oscillates indefinitely.

Important

This limitation motivated the invention of multi-layer networks (MLP/deep learning), which we will (probably) explore in another presentation.

Show R code
X_xor <- as.matrix(expand.grid(x1 = c(0,1), x2 = c(0,1)))
y_xor <- apply(X_xor, 1, function(x) as.numeric(xor(x[1], x[2])))

set.seed(123)
m_xor <- fit_perceptron(X_xor, y_xor,
                        w  = matrix(rnorm(2), 2, 1),
                        b  = rnorm(1), lr = 0.1, epochs = 100)

pred_xor <- predict_perceptron(X_xor, m_xor$w, m_xor$b)
cat("Perceptron on XOR — accuracy:",
    round(mean(pred_xor == y_xor)*100, 1), "%  (best possible: 75%)\n")
Perceptron on XOR — accuracy: 25 %  (best possible: 75%)
Show R code
grid_xor <- expand.grid(x1 = seq(-0.2,1.2,0.01), x2 = seq(-0.2,1.2,0.01))
grid_xor$pred <- as.numeric(
  as.matrix(grid_xor) %*% m_xor$w + m_xor$b >= 0)

ggplot() +
  geom_tile(data = grid_xor, aes(x1, x2, fill = factor(pred)), alpha = 0.25) +
  geom_point(data = data.frame(X_xor, y = factor(y_xor)),
             aes(x1, x2, colour = y, shape = y), size = 6) +
  scale_fill_manual(values = c("0" = pal[2], "1" = pal[1]),
                    labels = c("Pred: 0", "Pred: 1"), name = "Region") +
  scale_colour_manual(values = c("0" = pal[2],
                                 "1"    = pal[1]), name = "Target") +
  scale_shape_manual(values = c(1, 3)) +
  coord_fixed(xlim=c(-0.2,1.2), ylim=c(-0.2,1.2)) +
  labs(title="XOR — perceptron fails (not linearly separable)",
       x="x1", y="x2") 

Part II: From logic gates to credit risk

The bridge

The AND gate taught us the mechanics. Now we apply the same algorithm to a real problem.

AND gate (toy)

  • Inputs: \(x_1, x_2 \in \{0,1\}\)
  • Output: \(y = 1\) only if both inputs are 1
  • 4 observations, perfectly separable
  • Trains in a few epochs

Credit default (real)

  • Inputs: financial ratios (DSCR, Debt/EBITDA)
  • Output: \(y = 1\) if borrower defaults
  • 690 real corporate borrowers
  • Classes overlap — linear separation is imperfect

Note

The same fit_perceptron and predict_perceptron functions we wrote for the AND gate will now be applied — unchanged — to 690 real loan observations.

The dataset

690 corporate borrowers from a sample dataset, pre-split into training and test sets:

Train Test Total
Observations 345 345 690
Non-defaulted (No) 289 281 570
Defaulted (Si) 56 64 120
Default rate 16.2% 18.6% 17.4%

Two predictors — group means:

Variable Description No Default Default
DSCR Debt Service Coverage Ratio 1.95 1.04
DEBITDA Debt / EBITDA multiple 4.48 7.61

Both show clear directional separation: lower DSCR and higher leverage are associated with default.

library(readxl)
library(dplyr)

df <- read_excel("corporate_ds.xlsx") |>
  mutate(label = as.integer(Default == "Si"))

train2 <- df |> filter(Set == "Train") |> select(label, DSCR, DEBITDA)
test2  <- df |> filter(Set == "Test")  |> select(label, DSCR, DEBITDA)

df |>
  mutate(Default = factor(label, labels = c("No","Si"))) |>
  group_by(Default) |>
  summarise(n = n(),
            DSCR_mean    = round(mean(DSCR), 2),
            DEBITDA_mean = round(mean(DEBITDA), 2))
# A tibble: 2 × 4
  Default     n DSCR_mean DEBITDA_mean
  <fct>   <int>     <dbl>        <dbl>
1 No        570      1.95         4.48
2 Si        120      1.04         7.61

Part III: Two predictors — DSCR & Debt/EBITDA

Feature space

Plotting all 690 borrowers in the DSCR × Debt/EBITDA plane:

  • Blue circles = non-defaulted borrowers
  • Red crosses = defaulted borrowers

The groups partially overlap — unlike the AND gate, no perfect linear boundary exists. This is typical of real credit data.

Normalisation to \([0,1]\) is applied before training — essential when variables have different scales (DSCR: −1.5 to 4.7; Debt/EBITDA: −2.3 to 12.4).

Show R code
library(ggplot2)

df %>% 
  mutate(Default = factor(label, labels = c("No","Si"))) %>% 
  ggplot(aes(DSCR, DEBITDA, colour = Default, shape = Default)) +
  geom_point(alpha = 1.0, size = 1.8) +
  scale_colour_manual(values = c("No" = pal[2], "Si" = pal[1])) +
  scale_shape_manual(values = c(1, 3)) + 
  labs(title = "Feature space: DSCR vs debt/EBITDA",
       subtitle = "690 corporate borrowers",
       x = "DSCR", y = "Debt/EBITDA") 

Show R code
# +
#   theme_minimal(base_size = 13)

Training — 2 Predictors

The same learning rule as the AND gate, now on 345 training observations:

\[w_i \leftarrow w_i - \eta\,(\hat{y} - y)\,x_i \qquad \eta = 0.01, \quad 100 \text{ epochs}\]

Features normalised to \([0,1]\). Learned weights after convergence:

Parameter Value Financial interpretation
\(w_{\text{DSCR}}\) −1.52 ↑ DSCR → ↓ default risk
\(w_{\text{DEBITDA}}\) +1.74 ↑ Debt/EBITDA → ↑ default risk
Bias \(b\) −0.22 Decision threshold

Sign and direction align perfectly with financial intuition — exactly as the AND gate weights aligned with the truth table.

Show R code
normalise <- function(x) (x - min(x)) / (max(x) - min(x))

X_tr2 <- train2 |> select(-label) |>
  mutate(across(everything(), normalise)) |> as.matrix()
y_tr2 <- train2$label

X_te2 <- test2 |> select(-label) |>
  mutate(across(everything(), normalise)) |> as.matrix()
y_te2 <- test2$label

set.seed(34)
m2 <- fit_perceptron(X_tr2, y_tr2,
                     w  = matrix(rnorm(2), 2, 1),
                     b  = rnorm(1), lr = 0.01, epochs = 100)

cat("w_DSCR    =", round(m2$w[1], 3),
    "\nw_DEBITDA =", round(m2$w[2], 3),
    "\nbias      =", round(m2$b, 3))
w_DSCR    = -0.019 
w_DEBITDA = 0.069 
bias      = -0.018

Decision boundary — 2 predictors

The perceptron draws a straight line in the DSCR × Debt/EBITDA plane — exactly as it did in the AND gate, but now on real financial data.

  • Blue zone → predicted: No default
  • Red zone → predicted: Default

Points in the wrong zone are misclassifications. The overlap of the two groups is irreducible with a linear classifier — motivating more powerful models (logistic regression, random forest).

Show R code
grid2 <- expand.grid(DSCR = seq(0,1,0.005), DEBITDA = seq(0,1,0.005))
grid2$pred <- as.numeric(as.matrix(grid2) %*% m2$w + m2$b >= 0)

X_te2_df <- as.data.frame(X_te2); X_te2_df$label <- y_te2

ggplot() +
  geom_tile(data = grid2, aes(DSCR, DEBITDA, fill = factor(pred)),
            alpha = 0.22) +
  geom_point(data = X_te2_df,
             aes(DSCR, DEBITDA, colour = factor(label), shape = factor(label)),
             size = 1.8, alpha = 1.0) +
  scale_fill_manual(values = c("0"=pal[2],"1"=pal[1]),
                    labels = c("No Default","Default"), name = "Region") +
  scale_colour_manual(values = c("0"=pal[2],"1"=pal[1]),
                      labels = c("No Default","Default"), name = "Actual") +
  scale_shape_manual(values = c(1, 3)) + 
  labs(title = "Decision boundary — 2 predictors (test set)",
       x = "DSCR (normalised)", y = "Debt/EBITDA (normalised)") +
  theme_minimal(base_size = 12)

Performance — 2 predictors

Confusion matrix — 345 test observations:

Predicted: No default Predicted: default
Actual: No default 126 ✅ 155 ❌
Actual: Default 1 ❌ 63 ✅
Metric Value
Accuracy 54.8%
Recall (default) 98.4%
Precision (default) 28.9%
Specificity 44.8%

Warning

As many as 98.4% of actual defaulters are caught. In credit risk, missed defaults are the costliest errors, but only 44.8% of non defaulters were caught. Can adding more features help?

Show R code
y_hat2 <- predict_perceptron(X_te2, m2$w, m2$b)
cm2    <- table(Actual = y_te2, Predicted = y_hat2)
print(cm2)
      Predicted
Actual   0   1
     0 126 155
     1   1  63
Show R code
cat(sprintf(
  "\nAccuracy:    %.1f%%\nRecall:      %.1f%%\nPrecision:   %.1f%%\nSpecificity: %.1f%%",
  mean(y_te2==y_hat2)*100,
  (cm2[2,2]/sum(cm2[2,]))*100,
  (cm2[2,2]/sum(cm2[,2]))*100,
  (cm2[1,1]/sum(cm2[1,]))*100
))

Accuracy:    54.8%
Recall:      98.4%
Precision:   28.9%
Specificity: 44.8%

Part IV: Four predictors — Adding ROS & DBR

Extending the feature set

We enrich the model with two additional financial ratios:

Variable Description No Default Default
DSCR Debt Service Coverage Ratio 1.95 1.04
DEBITDA Debt / EBITDA multiple 4.48 7.61
ROS Return on Sales (%) 10.2 4.6
DBR Debt-to-Revenue ratio (%) 38.6 71.8

The decision boundary is now a hyperplane in 4D — no longer directly visualisable, but the perceptron still finds a linear combination of the four ratios.

Note

The fit_perceptron function requires no changes — only the number of input columns grows from 2 to 4. The learning rule is identical.

Show R code
library(tidyr)

df |>
  mutate(Default = factor(label, labels = c("No","Si"))) |>
  group_by(Default) |>
  summarise(across(c(DSCR, DEBITDA, ROS, DBR), ~round(mean(.), 2)))
# A tibble: 2 × 5
  Default  DSCR DEBITDA   ROS   DBR
  <fct>   <dbl>   <dbl> <dbl> <dbl>
1 No       1.95    4.48  10.2  38.6
2 Si       1.04    7.61   4.6  71.8

Feature distributions — 4 predictors

All four ratios show clear directional separation:

  • DSCR: lower for defaulters — debt coverage is insufficient
  • Debt/EBITDA: higher — excessive leverage relative to earnings
  • ROS: lower — poor profitability reduces cash generation
  • DBR: higher — heavy debt burden relative to revenues

Each adds independent information about the borrower’s financial health.

Show R code
df |>
  mutate(Default = factor(label, labels = c("No","Si"))) |>
  select(Default, DSCR, DEBITDA, ROS, DBR) |>
  pivot_longer(-Default) |>
  ggplot(aes(Default, value, fill = Default)) +
  geom_boxplot(alpha = 0.80, outlier.size = 0.6) +
  facet_wrap(~name, scales = "free_y") +
  scale_fill_manual(values = c("No"= pal[2],"Si"= pal[1])) +
  labs(title = "Feature distributions by default status — 4 predictors",
       x = NULL, y = NULL) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

Training — 4 Predictors

Learned weights after 100 epochs — same learning rate \(\eta = 0.01\):

Feature Weight Direction
DSCR −0.043 ↑ coverage → ↓ risk
Debt/EBITDA +0.073 ↑ leverage → ↑ risk
ROS −0.077 ↑ profitability → ↓ risk
DBR +0.036 ↑ debt burden → ↑ risk

No single feature has the strongest default signal in this dataset (not significantly stronger).

Show R code
train4 <- df |> filter(Set=="Train") |> select(label, DSCR, DEBITDA, ROS, DBR)
test4  <- df |> filter(Set=="Test")  |> select(label, DSCR, DEBITDA, ROS, DBR)

X_tr4 <- train4 |> select(-label) |>
  mutate(across(everything(), normalise)) |> as.matrix()
y_tr4 <- train4$label

X_te4 <- test4 |> select(-label) |>
  mutate(across(everything(), normalise)) |> as.matrix()
y_te4 <- test4$label

set.seed(42)
m4 <- fit_perceptron(X_tr4, y_tr4,
                     w  = matrix(rnorm(4), 4, 1),
                     b  = rnorm(1), lr = 0.01, epochs = 100)

data.frame(Feature = c("DSCR","DEBITDA","ROS","DBR"),
           Weight  = round(m4$w, 3))
  Feature Weight
1    DSCR -0.043
2 DEBITDA  0.073
3     ROS -0.077
4     DBR  0.036

Learning curves — 2 vs 4 predictors

Comparing training errors per epoch for both models:

  • Both curves drop sharply in the first 20–30 epochs
  • The 4-predictor model (blue) settles at a lower error floor
  • Neither reaches zero — the classes are not linearly separable in real data

This non-zero floor is the signature of irreducible overlap in the feature space. It motivates more flexible classifiers — the topic of the next presentation.

Show R code
bind_rows(
  data.frame(epoch = seq_along(m2$history), errors = m2$history,
             model = "2 Predictors (DSCR + DEBITDA)"),
  data.frame(epoch = seq_along(m4$history), errors = m4$history,
             model = "4 Predictors (+ ROS + DBR)")
) |>
  ggplot(aes(epoch, errors, colour = model)) +
  geom_line(linewidth = 1.1) +
  scale_colour_manual(
    values = c("2 Predictors (DSCR + DEBITDA)" = "#e74c3c",
               "4 Predictors (+ ROS + DBR)"    = "#3498db")) +
  labs(title = "Learning curves: 2 vs 4 predictors",
       subtitle = "Training misclassifications per epoch (n = 345)",
       x = "Epoch", y = "Errors", colour = NULL) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "top")

Performance — 2 vs 4 predictors

Head-to-head on 345 test observations:

Metric 2 Predictors 4 Predictors Δ
Accuracy 54.8% 49.6% -5.2pp
Recall (Default) 98.4% 100% +1.6pp
Precision (Default) 28.9% 26.9% -2.0pp
Specificity 44.8% 38.1% -13.9pp

Tip

The biggest gain is recall — the 4-predictor model catches 1.6 percentage points more actual defaulters. This is the metric that matters most in credit risk management, though this improvemente doesn’t add much.

Show R code
y_hat4 <- predict_perceptron(X_te4, m4$w, m4$b)
cm4    <- table(Actual = y_te4, Predicted = y_hat4)

data.frame(
  Metric = c("Accuracy","Recall (Default)","Precision (Default)","Specificity"),
  `2 Pred` = c(mean(y_te2==y_hat2), cm2[2,2]/sum(cm2[2,]),
               cm2[2,2]/sum(cm2[,2]), cm2[1,1]/sum(cm2[1,])),
  `4 Pred` = c(mean(y_te4==y_hat4), cm4[2,2]/sum(cm4[2,]),
               cm4[2,2]/sum(cm4[,2]), cm4[1,1]/sum(cm4[1,])),
  check.names = FALSE
) |>
  mutate(across(where(is.numeric), ~paste0(round(.*100,1),"%")),
         Delta = paste0("", round(
           c(mean(y_te4==y_hat4)-mean(y_te2==y_hat2),
             cm4[2,2]/sum(cm4[2,])-cm2[2,2]/sum(cm2[2,]),
             cm4[2,2]/sum(cm4[,2])-cm2[2,2]/sum(cm2[,2]),
             cm4[1,1]/sum(cm4[1,])-cm2[1,1]/sum(cm2[1,]))*100, 1), "pp"))
               Metric 2 Pred 4 Pred  Delta
1            Accuracy  54.8%  49.6% -5.2pp
2    Recall (Default)  98.4%   100%  1.6pp
3 Precision (Default)  28.9%  26.9%   -2pp
4         Specificity  44.8%  38.1% -6.8pp

Weight interpretation — A data-driven scorecard

The perceptron’s weight vector is its implicit credit policy:

Feature Weight Logic
DSCR −0.0433 Strongest protective factor: sufficient debt coverage reduces default risk
Debt/EBITDA +0.0731 Dominant risk factor: excessive leverage relative to earnings
ROS −0.0765 Profitable firms generate cash to service obligations
DBR +0.360 High debt burden relative to revenues is a warning signal

This structure is essentially a linear scorecard — identical in spirit to Altman’s Z-score or traditional expert-designed credit rating models, but learned entirely from data.

Note

Under EBA IRB guidelines and IFRS 9, model explainability is mandatory. The perceptron’s weights satisfy this requirement directly.

Show R code
lw <- data.frame(Feature = c("DSCR","DEBITDA","ROS","DBR"),
           Weight  = m4$w[,1])
lw
  Feature      Weight
1    DSCR -0.04326569
2 DEBITDA  0.07311199
3     ROS -0.07654086
4     DBR  0.03598658
Show R code
lw %>% 
  ggplot(aes(reorder(Feature, Weight), Weight, fill = Weight > 0)) +
  geom_col(alpha = 0.85) +
  geom_hline(yintercept = 0, linewidth = 0.5, colour = "white") +
  coord_flip() +
  scale_fill_manual(values = c("TRUE"="#e74c3c","FALSE"="#2ecc71"),
                    labels = c("Reduces risk","Increases risk")) +
  labs(title = "Learned Weights — 4-predictor perceptron",
       subtitle = "Negative = protective, Positive = risk-increasing",
       x = NULL, y = "Weight", fill = NULL) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "top")

Limitations & what comes next

Perceptron limits

  • Linear boundary only — cannot capture interaction effects
  • No probability output — just 0/1 (no PD estimate)
  • Precision still limited (~26.9%) on real credit data
  • Sensitive to feature scaling
  • Cannot solve XOR-type problems

In the next part

The same step-by-step logic — applied to a Multi-Layer Perceptron (MLP):

Addition What it unlocks
Hidden layer Non-linear boundaries
Sigmoid activation Probability output (PD)
Backpropagation Gradient-based learning
Multiple layers Deep representations

Regulatory reminder

The perceptron is fully auditable — its weights are the model. More complex models (XGBoost, MLP) require post-hoc explainability tools (SHAP values) to meet EBA and IFRS 9 requirements.

Key takeaways

  1. The perceptron is a single linear classifier: weighted inputs + threshold = binary prediction
  2. The AND gate shows the mechanics: same algorithm, 4 observations, guaranteed convergence
  3. On 690 real loans, the same code achieves ~50% accuracy and ~100% recall with no feature engineering
  4. 2 predictors give a visible 2D boundary; 4 predictors raise default recall slightly
  5. Debt/EBITDA is the dominant default signal — the model discovers this automatically
  6. The weight vector is a data-driven scorecard — interpretable, auditable, regulatory-compliant
  7. The non-zero error floor signals non-linear overlap → motivation for the MLP (next session)


“Start simple. A model that can be explained to a regulator is worth more than one that cannot.” “Understanding the simplest model deeply is more valuable than applying complex models blindly.”

Part V: From perceptron to neural network

Why one neuron is not enough

The perceptron left us with two problems:

Problem 1 — Linear boundary only

Real credit data is not linearly separable. Defaulters and non-defaulters overlap in every feature space. A single hyperplane cannot capture that complexity.

Problem 2 — No probability output

The perceptron outputs 0 or 1. Credit decisions require a probability of default (PD) — a number in \([0,1]\) that can be thresholded, stress-tested, and reported under IFRS 9.

The fix: add a hidden layer

Input layer    Hidden layer    Output layer

DSCR   ──→ ╮                 ╭──→
           ├──→ h₁ (sigmoid) ┤
DEBITDA──→ ┤                 ├──→ ŷ ∈ (0,1)
           ├──→ h₂ (sigmoid) ┤
ROS    ──→ ┤                 ╰──→
           ├──→
DBR    ──→ ╯

Two hidden neurons + sigmoid activation give the network non-linear decision boundaries and a probability output.

Architecture: 4 → 2 → 1

Our network has three layers and two sets of weights:

\[\underbrace{\mathbf{x} \in \mathbb{R}^4}_{\text{inputs}} \xrightarrow{\mathbf{W}_1,\,\mathbf{b}_1} \underbrace{\mathbf{h} \in \mathbb{R}^2}_{\text{hidden}} \xrightarrow{\mathbf{W}_2,\,\mathbf{b}_2} \underbrace{\hat{y} \in (0,1)}_{\text{output}}\]

Layer dimensions:

Layer Weight matrix Bias Shape
Input → Hidden \(\mathbf{W}_1\) \(\mathbf{b}_1\) \(4 \times 2\) weights, \(2\) biases
Hidden → Output \(\mathbf{W}_2\) \(b_2\) \(2 \times 1\) weights, \(1\) bias

Total parameters: \(4 \times 2 + 2 + 2 \times 1 + 1 = \mathbf{13}\)

Note

Compare with the perceptron: 5 parameters (\(4\) weights + \(1\) bias). The hidden layer adds just 8 more — but unlocks non-linear boundaries and probability outputs.

The Sigmoid activation

We replace the Heaviside step function \(\phi\) with the sigmoid:

\[\sigma(z) = \frac{1}{1 + e^{-z}} \in (0, 1)\]

Why sigmoid instead of step?

Property Step \(\phi\) Sigmoid \(\sigma\)
Output range \(\{0, 1\}\) \((0, 1)\)
Differentiable
Probability interpretation
Gradient-based learning

The sigmoid is smooth and differentiable — this allows us to compute gradients and use backpropagation to train the network.

Key values to remember:

\[\sigma(0) = 0.5 \qquad \text{(decision threshold)}\] \[\sigma(z) \to 1 \text{ as } z \to +\infty\] \[\sigma(z) \to 0 \text{ as } z \to -\infty\]

And its elegant derivative:

\[\sigma'(z) = \sigma(z)\,(1 - \sigma(z))\]

Forward pass — Step by step

Given one borrower \(\mathbf{x} = (x_1, x_2, x_3, x_4)^\top\) (normalised DSCR, DEBITDA, ROS, DBR):

Step 1 — Hidden layer pre-activation: \[\mathbf{z}^{(1)} = \mathbf{W}_1^\top \mathbf{x} + \mathbf{b}_1 \in \mathbb{R}^2\]

Step 2 — Hidden layer activation (sigmoid): \[\mathbf{h} = \sigma(\mathbf{z}^{(1)}) = \begin{bmatrix} \sigma(z^{(1)}_1) \\ \sigma(z^{(1)}_2) \end{bmatrix} \in (0,1)^2\]

Step 3 — Output pre-activation: \[z^{(2)} = \mathbf{W}_2^\top \mathbf{h} + b_2 \in \mathbb{R}\]

Step 4 — Output activation (sigmoid → PD estimate): \[\hat{y} = \sigma(z^{(2)}) \in (0, 1)\]

Step 5 — Classification: \[\text{Predict default if } \hat{y} \geq 0.5\] or any other more conservative threshold (e.g. \(\hat{y} \geq 0.1, \, 0.05\))

sigmoid <- function(z) 1 / (1 + exp(-z))

forward_pass <- function(X, W1, b1, W2, b2) {
  # Step 1-2: hidden layer
  Z1 <- X %*% W1 + matrix(b1, nrow(X), 2, byrow = TRUE)  # n × 2
  H  <- sigmoid(Z1)                                        # n × 2
  # Step 3-4: output layer
  Z2 <- H %*% W2 + b2                                      # n × 1
  Y_hat <- sigmoid(Z2)                                     # n × 1  (PD)
  list(H = H, Y_hat = Y_hat)
}

# Quick demo with random weights on 3 borrowers
set.seed(1)
X_demo <- matrix(runif(12), nrow = 3, ncol = 4)
W1_demo <- matrix(rnorm(8), 4, 2)
b1_demo <- rnorm(2)
W2_demo <- matrix(rnorm(2), 2, 1)
b2_demo <- rnorm(1)

out <- forward_pass(X_demo, W1_demo, b1_demo, W2_demo, b2_demo)
cat("Hidden activations (h₁, h₂) for 3 borrowers:\n")
Hidden activations (h₁, h₂) for 3 borrowers:
print(round(out$H, 4))
       [,1]   [,2]
[1,] 0.9206 0.4967
[2,] 0.8548 0.4328
[3,] 0.9150 0.5962
cat("\nPD estimates (ŷ):\n")

PD estimates (ŷ):
print(round(out$Y_hat, 4))
       [,1]
[1,] 0.7816
[2,] 0.7713
[3,] 0.7972

Loss function: binary cross-entropy

The perceptron minimised raw misclassification count. A neural network minimises binary cross-entropy (BCE) — also called log loss:

\[\mathcal{L} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log \hat{y}_i + (1 - y_i) \log(1 - \hat{y}_i) \right]\]

Note

Why not just count errors? The step function makes error counts non-differentiable — gradients are zero everywhere except at the threshold. BCE is smooth, always differentiable, and penalises confident wrong predictions much more harshly than uncertain ones.

Prediction \(\hat{y}\) True label \(y\) BCE contribution
0.95 1 \(-\log(0.95) \approx 0.05\) — small, correct
0.50 1 \(-\log(0.50) \approx 0.69\) — uncertain
0.05 1 \(-\log(0.05) \approx 3.00\)severely penalised
0.05 0 \(-\log(0.95) \approx 0.05\) — small, correct

Backpropagation — the key equations

Training minimises \(\mathcal{L}\) via gradient descent. We need \(\frac{\partial \mathcal{L}}{\partial \mathbf{W}_1}\), \(\frac{\partial \mathcal{L}}{\partial \mathbf{W}_2}\), etc.

Output layer gradient (error signal \(\boldsymbol{\delta}^{(2)}\)): \[\boldsymbol{\delta}^{(2)} = \hat{\mathbf{y}} - \mathbf{y} \in \mathbb{R}^n\]

Hidden layer gradient (backpropagated error \(\boldsymbol{\delta}^{(1)}\)): \[\boldsymbol{\delta}^{(1)} = \left(\boldsymbol{\delta}^{(2)} \mathbf{W}_2^\top\right) \odot \mathbf{H} \odot (1 - \mathbf{H}) \in \mathbb{R}^{n \times 2}\]

where \(\odot\) is element-wise multiplication and \(\mathbf{H} \odot (1 - \mathbf{H}) = \sigma'(\mathbf{Z}^{(1)})\).

Weight updates (learning rate \(\eta\)): \[\mathbf{W}_2 \leftarrow \mathbf{W}_2 - \frac{\eta}{n} \mathbf{H}^\top \boldsymbol{\delta}^{(2)}, \qquad b_2 \leftarrow b_2 - \frac{\eta}{n} \sum_i \delta^{(2)}_i\] \[\mathbf{W}_1 \leftarrow \mathbf{W}_1 - \frac{\eta}{n} \mathbf{X}^\top \boldsymbol{\delta}^{(1)}, \qquad \mathbf{b}_1 \leftarrow \mathbf{b}_1 - \frac{\eta}{n} \sum_i \boldsymbol{\delta}^{(1)}_i\]

Tip

The chain rule in action: the output error \(\boldsymbol{\delta}^{(2)}\) is propagated backwards through \(\mathbf{W}_2\), scaled by the sigmoid derivative at the hidden layer. This is why the algorithm is called backpropagation.

Step-by-step implementation

The complete training loop for one epoch:

for each epoch:
  ① Forward pass  →  compute H, ŷ  (predictions)
  ② Compute loss  →  BCE(y, ŷ)
  ③ Output delta  →  δ² = ŷ − y
  ④ Hidden delta  →  δ¹ = (δ² W₂ᵀ) ⊙ H ⊙ (1−H)
  ⑤ Update W₂, b₂  →  gradient descent step
  ⑥ Update W₁, b₁  →  gradient descent step

Compare with the perceptron:

Perceptron Neural Network
Activation Step \(\phi\) Sigmoid \(\sigma\)
Output \(\{0,1\}\) \((0,1)\) — PD estimate
Loss Error count Binary cross-entropy
Update rule Error correction Backpropagation
Layers 1 2 (hidden + output)
train_nn <- function(X, y, hidden = 2, lr = 0.05, epochs = 500,
                     seed = 42, class_weight = NULL) {
  set.seed(seed)
  n <- nrow(X); p <- ncol(X)

  # Initialise weights (Xavier scaling)
  W1 <- matrix(rnorm(p * hidden, sd = sqrt(2 / p)), p, hidden)
  b1 <- rep(0, hidden)
  W2 <- matrix(rnorm(hidden, sd = sqrt(2 / hidden)), hidden, 1)
  b2 <- 0

  # Class weights: upweight the minority class (defaults)
  # Default: inverse frequency weighting
  if (is.null(class_weight)) {
    w_pos <- n / (2 * sum(y))        # weight for class 1 (default)
    w_neg <- n / (2 * sum(1 - y))    # weight for class 0 (no default)
  } else {
    w_neg <- class_weight[1]
    w_pos <- class_weight[2]
  }
  # Per-sample weight vector
  sample_w <- ifelse(y == 1, w_pos, w_neg)   # n × 1

  history <- data.frame(epoch = integer(), loss = numeric(),
                        accuracy = numeric())

  for (e in seq_len(epochs)) {
    # ① Forward pass
    Z1    <- X %*% W1 + matrix(b1, n, hidden, byrow = TRUE)
    H     <- sigmoid(Z1)
    Z2    <- H %*% W2 + b2
    Y_hat <- sigmoid(Z2)

    # ② Weighted binary cross-entropy loss
    eps  <- 1e-8
    loss <- -mean(sample_w * (y * log(Y_hat + eps) +
                              (1 - y) * log(1 - Y_hat + eps)))

    # ③ Output delta (weighted)
    d2 <- (Y_hat - y) * sample_w             # n × 1

    # ④ Hidden delta  (sigmoid derivative = H*(1-H))
    d1 <- (d2 %*% t(W2)) * H * (1 - H)      # n × hidden

    # ⑤ Update output layer
    W2 <- W2 - (lr / n) * t(H) %*% d2
    b2 <- b2 - (lr / n) * sum(d2)

    # ⑥ Update hidden layer
    W1 <- W1 - (lr / n) * t(X) %*% d1
    b1 <- b1 - (lr / n) * colSums(d1)

    acc <- mean((Y_hat >= 0.5) == y)
    history <- rbind(history, data.frame(epoch = e, loss = loss,
                                         accuracy = acc))
  }
  list(W1=W1, b1=b1, W2=W2, b2=b2, history=history)
}

Preparing the data

The same four predictors used in the 4-predictor perceptron:

Variable Description
DSCR Debt Service Coverage Ratio
DEBITDA Debt / EBITDA multiple
ROS Return on Sales (%)
DBR Debt-to-Revenue ratio (%)

Pre-processing steps (identical to the perceptron):

  1. Normalise each feature to \([0, 1]\) using min-max scaling
  2. Use the pre-split Train / Test sets (345 / 345)
  3. Label: Si = 1 (default), No = 0 (no default)

The neural network uses the same normalised matrices \(X_{\text{train}}\), \(X_{\text{test}}\) already prepared for the 4-predictor perceptron — no additional pre-processing needed.

# Reuse the 4-predictor normalised matrices from Part IV
# (X_tr4, y_tr4, X_te4, y_te4 are already in the environment)

cat("Training set:  ", nrow(X_tr4), "observations,",
    ncol(X_tr4), "features\n")
Training set:   345 observations, 4 features
cat("Test set:      ", nrow(X_te4), "observations,",
    ncol(X_te4), "features\n")
Test set:       345 observations, 4 features
cat("Default rate (train):", round(mean(y_tr4)*100, 1), "%\n")
Default rate (train): 16.2 %
cat("Default rate (test): ", round(mean(y_te4)*100, 1), "%\n")
Default rate (test):  18.6 %
# Column names for readability
colnames(X_tr4) <- colnames(X_te4) <- c("DSCR","DEBITDA","ROS","DBR")

Training the neural network

We train a 4 → 2 → 1 network:

  • 4 inputs: DSCR, Debt/EBITDA, ROS, DBR (normalised)
  • 2 hidden neurons with sigmoid activation
  • 1 output neuron with sigmoid → PD estimate
  • Learning rate \(\eta = 0.05\), 500 epochs
  • Weights initialised with Xavier scaling: \(\mathbf{w} \sim \mathcal{N}(0, 2/d_{\text{in}})\)

Xavier initialisation sets the variance of initial weights proportional to the input dimension — this prevents activations from vanishing or exploding at the start of training.

After 500 epochs the network converges to a stable loss, with accuracy well above the perceptron baseline.

nn_model <- train_nn(X_tr4, matrix(y_tr4, ncol = 1),
                     hidden = 2, lr = 0.05, epochs = 500,
                     class_weight = c(1, 5))  # upweight defaults ~5x

tail_hist <- tail(nn_model$history, 5)
cat("Final 5 epochs:\n")
Final 5 epochs:
cat(sprintf("  Epoch %d | Loss: %.4f | Accuracy: %.1f%%\n",
            tail_hist$epoch, tail_hist$loss, tail_hist$accuracy * 100))
  Epoch 496 | Loss: 1.0080 | Accuracy: 88.7%
   Epoch 497 | Loss: 1.0075 | Accuracy: 88.7%
   Epoch 498 | Loss: 1.0070 | Accuracy: 88.7%
   Epoch 499 | Loss: 1.0066 | Accuracy: 88.7%
   Epoch 500 | Loss: 1.0061 | Accuracy: 88.7%
cat("\nLearned W1 (input → hidden):\n")

Learned W1 (input → hidden):
print(round(nn_model$W1, 3))
          [,1]   [,2]
DSCR     0.317  0.446
DEBITDA  0.462 -0.606
ROS     -0.738  1.377
DBR      0.959 -0.412
cat("\nLearned W2 (hidden → output):\n")

Learned W2 (hidden → output):
print(round(nn_model$W2, 3))
       [,1]
[1,]  1.938
[2,] -1.283

Training dynamics

Two panels show how the network learns over 500 epochs:

Left — accuracy curve: training accuracy rises to ~85% and stabilises. Compare with the perceptron’s ~50% — the neural network’s advantage will be clearest in recall on defaults(?).

Right — loss curve: binary cross-entropy drops steeply in the first ~100 epochs, then gradually flattens. The smooth descent (no oscillation) reflects gradient descent with a well-chosen learning rate.

Tip

A smooth, monotonically decreasing loss is the hallmark of a well-behaved network. Oscillation or divergence would signal that the learning rate \(\eta\) is too large.

Show R code
library(ggplot2)
library(tidyr)

nn_model$history |>
  pivot_longer(c(loss, accuracy), names_to = "metric") |>
  ggplot(aes(epoch, value, colour = metric)) +
  geom_line(linewidth = 1.1) +
  facet_wrap(~metric, scales = "free_y",
             labeller = labeller(metric = c(loss     = "BCE Loss",
                                            accuracy = "Accuracy"))) +
  scale_colour_manual(values = c(loss = "#e74c3c", accuracy = "#3498db"),
                      guide = "none") +
  labs(title = "Neural Network Training Dynamics (4→2→1)",
       subtitle = "500 epochs, η = 0.05, 345 training observations",
       x = "Epoch", y = NULL) +
  theme_minimal(base_size = 13)

Hidden layer representations

The two hidden neurons transform the four inputs into a new 2D space \((h_1, h_2)\).

Key insight: the hidden layer learns a new representation of the borrowers — a non-linear projection into 2D that makes the default/no-default separation easier for the output neuron.

Plot interpretation:

  • Each point is one borrower from the test set
  • Blue circles = no default, red crosses = default
  • Notice the better clustering compared to the raw DSCR × Debt/EBITDA scatter — this is the network learning to compress and reorganise the information
Show R code
library(RColorBrewer)

pal <- brewer.pal(3, "Set1")

# Compute hidden activations for test set
fwd_te <- forward_pass(X_te4, nn_model$W1, nn_model$b1,
                        nn_model$W2, nn_model$b2)

hidden_df <- data.frame(
  h1      = fwd_te$H[,1],
  h2      = fwd_te$H[,2],
  Default = factor(y_te4, labels = c("No","Si"))
)

ggplot(hidden_df, aes(h1, h2, colour = Default, shape = Default)) +
  geom_point(size = 2.2, alpha = 0.75) +
  scale_colour_manual(values = c("No" = pal[2], "Si" = pal[1])) +
  scale_shape_manual(values = c(1, 3)) +
  labs(
    title = "Hidden layer representation — test set",
    subtitle = expression("Each axis is one sigmoid neuron: h"[1] * " = σ(W"[1][",1"]^"T" * "x + b"[11] * "), h"[2] * " = σ(W"[1][",2"]^"T" * "x + b"[12] * ")"),
    x = expression("h"[1] * " (hidden neuron 1)"), 
    y = expression("h"[2] * " (hidden neuron 2)")
  ) +
  theme_minimal(base_size = 13)

Test set performance

Neural network confusion matrix — 345 test observations:

Predicted: No Default Predicted: Default
Actual: No Default 241 ✅ 40 ❌
Actual: Default 12 ❌ 52 ✅

Full comparison — Perceptron vs Neural Network:

Metric Perceptron (4 pred) Neural Network Δ
Accuracy 49.6% 86.1% +36.5pp
Recall (Default) 100% 81.2% -18.8pp
Precision (Default) 26.9% 56.5% +29.6pp
Specificity 38.1% 85.8% +47.7pp

Tip

The biggest gains in using the neural network are precision and specificity: the false alarm rate (1 - specificity) drops meaningfully.

Show R code
fwd_te   <- forward_pass(X_te4, nn_model$W1, nn_model$b1,
                          nn_model$W2, nn_model$b2)
y_hat_nn <- as.integer(fwd_te$Y_hat >= 0.5)

# Use a complete confusion matrix (all four cells guaranteed)
cm_nn <- table(
  Actual    = factor(y_te4,    levels = c(0, 1)),
  Predicted = factor(y_hat_nn, levels = c(0, 1))
)
print(cm_nn)
      Predicted
Actual   0   1
     0 241  40
     1  12  52
Show R code
acc_nn  <- mean(y_te4 == y_hat_nn)
rec_nn  <- cm_nn["1","1"] / sum(cm_nn["1",])
prec_nn <- if (sum(cm_nn[,"1"]) > 0) cm_nn["1","1"] / sum(cm_nn[,"1"]) else NA
spec_nn <- cm_nn["0","0"] / sum(cm_nn["0",])

# Side-by-side with perceptron
data.frame(
  Metric     = c("Accuracy","Recall","Precision","Specificity"),
  Perceptron = paste0(round(c(mean(y_te4==y_hat4),
                               cm4[2,2]/sum(cm4[2,]),
                               cm4[2,2]/sum(cm4[,2]),
                               cm4[1,1]/sum(cm4[1,]))*100, 1), "%"),
  Neural_Net = paste0(round(c(acc_nn, rec_nn, prec_nn, spec_nn)*100, 1), "%")
)
       Metric Perceptron Neural_Net
1    Accuracy      49.6%      84.9%
2      Recall       100%      81.2%
3   Precision      26.9%      56.5%
4 Specificity      38.1%      85.8%

PD scores — A probability output

Unlike the perceptron (which outputs 0 or 1), the neural network outputs a continuous PD score \(\hat{y} \in (0,1)\).

This enables:

  • Risk tiering: rank borrowers from lowest to highest PD
  • IFRS 9 staging: Stage 1 / 2 / 3 thresholds based on PD level
  • Stress testing: shift the PD distribution under adverse scenarios
  • Expected loss: \(EL = PD \times LGD \times EAD\)

The distribution of predicted PDs shows clear separation between the two groups — defaulters cluster toward higher scores, non-defaulters toward lower scores.

Show R code
pd_df <- data.frame(
  PD      = as.numeric(fwd_te$Y_hat),
  Default = factor(y_te4, labels = c("No","Si"))
)

ggplot(pd_df, aes(PD, fill = Default, colour = Default)) +
  geom_density(alpha = 0.35, linewidth = 1) +
  geom_vline(xintercept = 0.5, linetype = "dashed",
             colour = "white", linewidth = 0.8) +
  scale_fill_manual(values   = c("No" = pal[2], "Si" = pal[1])) +
  scale_colour_manual(values = c("No" = pal[2], "Si" = pal[1])) +
  annotate("text", x = 0.52, y = Inf, label = "threshold = 0.5",
           hjust = 0, vjust = 1.5, colour = "white", size = 3.5) +
  labs(title = "Predicted PD Distribution by Default Status",
       subtitle = "Neural network output ŷ ∈ (0,1) — test set",
       x = "Predicted PD (ŷ)", y = "Density") +
  theme_minimal(base_size = 13)

Using nnet — one line of code

The nnet package (part of base R) implements exactly our 4→2→1 architecture with a single function call.

Key arguments:

Argument Value Meaning
formula label ~ . predict label from all other columns
size 2 number of hidden neurons
linout FALSE sigmoid output (logistic)
decay 0.01 L2 regularisation (weight decay)
maxit 500 maximum iterations

Weight decay (L2 regularisation) adds a penalty \(\lambda \|\mathbf{W}\|^2\) to the loss — it shrinks weights toward zero and reduces overfitting. Our hand-coded network did not include this; nnet handles it automatically.

library(nnet)

# Prepare data frames (nnet prefers data frames with named columns)
train_df <- data.frame(X_tr4, label = y_tr4)
test_df  <- data.frame(X_te4, label = y_te4)

# Case weights: upweight the minority class (defaults) ~5x
case_w <- ifelse(train_df$label == 1, 5, 1)

set.seed(42)
nn_nnet <- nnet(
  label ~ .,
  data    = train_df,
  weights = case_w,     # class imbalance correction
  size    = 2,          # 2 hidden neurons
  linout  = FALSE,      # sigmoid output
  decay   = 0.01,       # L2 regularisation
  maxit   = 500,        # max iterations
  trace   = FALSE       # suppress iteration output
)

cat("Network summary:\n")
Network summary:
print(nn_nnet)
a 4-2-1 network with 13 weights
inputs: DSCR DEBITDA ROS DBR 
output(s): label 
options were - decay=0.01
cat("\nNumber of weights:", length(nn_nnet$wts))

Number of weights: 13

nnet predictions and performance

nnet uses the same architecture — the small performance differences relative to our hand-coded version arise from:

  • Weight decay (L2 regularisation) in nnet — penalises large weights, improving generalisation
  • Different optimiser: nnet uses BFGS (quasi-Newton), a second-order method that converges faster than our gradient descent
  • Different random initialisation (even with the same seed, internal RNG paths differ)

Both networks confirm the same qualitative result: the neural network outperforms the perceptron, with the largest improvement in recall on defaults.

Show R code
# Predicted PDs from nnet
pd_nnet   <- predict(nn_nnet, newdata = test_df, type = "raw")
yhat_nnet <- as.integer(pd_nnet >= 0.5)

cm_nnet <- table(
  Actual    = factor(y_te4,     levels = c(0, 1)),
  Predicted = factor(yhat_nnet, levels = c(0, 1))
)
print(cm_nnet)
      Predicted
Actual   0   1
     0 242  39
     1   7  57
Show R code
acc_nnet  <- mean(y_te4 == yhat_nnet)
rec_nnet  <- cm_nnet["1","1"] / sum(cm_nnet["1",])
prec_nnet <- if (sum(cm_nnet[,"1"]) > 0) cm_nnet["1","1"] / sum(cm_nnet[,"1"]) else NA
spec_nnet <- cm_nnet["0","0"] / sum(cm_nnet["0",])

cat(sprintf(
  "\nnnet — Accuracy: %.1f%%  Recall: %.1f%%  Precision: %.1f%%  Specificity: %.1f%%",
  acc_nnet*100, rec_nnet*100, prec_nnet*100, spec_nnet*100))

nnet — Accuracy: 86.7%  Recall: 89.1%  Precision: 59.4%  Specificity: 86.1%

nnet PD Scores

The nnet model also outputs continuous PD scores. The distribution plot (below) shows the same qualitative separation as our hand-coded network:

  • Non-defaulters (blue circles) cluster near PD = 0
  • Defaulters (red crosses) cluster near PD = 1
  • The overlap region (PD ≈ 0.3–0.7) represents genuinely ambiguous borrowers — cases where even a well-trained model cannot be certain

Under IFRS 9, these borderline borrowers (PD in the grey zone) would typically be placed in Stage 2 — significant increase in credit risk — and provisioned accordingly.

Show R code
pd_nnet_df <- data.frame(
  PD      = as.numeric(pd_nnet),
  Default = factor(y_te4, labels = c("No","Si"))
)

ggplot(pd_nnet_df, aes(PD, fill = Default, colour = Default)) +
  geom_density(alpha = 0.35, linewidth = 1) +
  geom_vline(xintercept = 0.5, linetype = "dashed",
             colour = "white", linewidth = 0.8) +
  scale_fill_manual(values   = c("No" = pal[2], "Si" = pal[1])) +
  scale_colour_manual(values = c("No" = pal[2], "Si" = pal[1])) +
  annotate("text", x = 0.52, y = Inf, label = "threshold = 0.5",
           hjust = 0, vjust = 1.5, colour = "white", size = 3.5) +
  labs(title = "nnet — Predicted PD Distribution by Default Status",
       subtitle = "Test set | 2 hidden neurons | decay = 0.01",
       x = "Predicted PD (ŷ)", y = "Density") +
  theme_minimal(base_size = 13)

Full comparison: perceptron → neural network

All models on the same test set (345 observations):

Metric Percep. 2 pred Percep. 4 pred NN (hand-coded) NN (nnet)
Accuracy 54.8% 49.6% 86.1% ≈86.7%
Recall 98.4% 100% 81.2% ≈89.1%
Precision 28.9% 26.9% 56.5% ≈59.4%
Specificity 44.8% 38.1% 85.8% ≈86.1%
PD output
Non-linear boundary

Note

nnet results vary slightly across runs even with fixed seed due to internal optimiser behaviour. Ranges shown reflect typical variation across seeds.

Show R code
# Helper: safe metric extraction from a named confusion matrix
cm_metric <- function(cm, metric) {
  cm <- table(
    Actual    = factor(names(cm)[1] |> {\(.) cm}(), levels = c("0","1")),
    Predicted = factor(names(cm)[1] |> {\(.) cm}(), levels = c("0","1"))
  )
  switch(metric,
    accuracy    = sum(diag(cm)) / sum(cm),
    recall      = cm["1","1"] / sum(cm["1",]),
    precision   = if (sum(cm[,"1"]) > 0) cm["1","1"] / sum(cm[,"1"]) else NA,
    specificity = cm["0","0"] / sum(cm["0",])
  )
}

# Re-build all CMs with guaranteed factor levels
mk_cm <- function(actual, predicted)
  table(Actual    = factor(actual,    levels = c(0,1)),
        Predicted = factor(predicted, levels = c(0,1)))

cm2r  <- mk_cm(y_te2, y_hat2)
cm4r  <- mk_cm(y_te4, y_hat4)
cm_nn_r  <- mk_cm(y_te4, y_hat_nn)
cm_nnet_r <- mk_cm(y_te4, yhat_nnet)

safe <- function(cm, r, c) { v <- cm[r, c]; ifelse(is.na(v), 0, v) }

metrics <- function(cm) c(
  Accuracy    = sum(diag(cm)) / sum(cm),
  Recall      = safe(cm,"1","1") / sum(cm["1",]),
  Precision   = if (sum(cm[,"1"])>0) safe(cm,"1","1")/sum(cm[,"1"]) else NA,
  Specificity = safe(cm,"0","0") / sum(cm["0",])
)

results <- data.frame(
  Model = c("Perceptron 2 pred","Perceptron 4 pred","NN hand-coded","NN nnet"),
  rbind(metrics(cm2r), metrics(cm4r), metrics(cm_nn_r), metrics(cm_nnet_r))
)
results[,2:5] <- round(results[,2:5]*100, 1)
print(results)
              Model Accuracy Recall Precision Specificity
1 Perceptron 2 pred     54.8   98.4      28.9        44.8
2 Perceptron 4 pred     49.6  100.0      26.9        38.1
3     NN hand-coded     84.9   81.2      56.5        85.8
4           NN nnet     86.7   89.1      59.4        86.1

Architectural comparison

Perceptron

\[\hat{y} = \phi(\mathbf{w}^\top \mathbf{x} + b)\]

  • 1 layer, 5 parameters
  • Step activation → \(\{0,1\}\)
  • Linear boundary
  • Perceptron learning rule
  • Fully interpretable weights

Neural network (4→2→1)

\[\mathbf{h} = \sigma(\mathbf{W}_1^\top \mathbf{x} + \mathbf{b}_1)\] \[\hat{y} = \sigma(\mathbf{W}_2^\top \mathbf{h} + b_2)\]

  • 2 layers, 13 parameters
  • Sigmoid activation → \((0,1)\) PD
  • Non-linear boundary
  • Backpropagation
  • Hidden layer obscures direct interpretability

Regulatory Note (EBA / IFRS 9)

The neural network’s hidden layer introduces a black-box element — the weights \(\mathbf{W}_1\) do not map directly to a credit rule. In regulated applications, SHAP values or LIME are used to explain individual predictions. The perceptron remains the gold standard for explainability; the neural network trades some of that for better predictive power.

Key takeaways — Part V

  1. A single hidden layer with 2 sigmoid neurons is enough to move from a linear to a non-linear decision boundary
  2. The sigmoid replaces the step function: outputs PD ∈ (0,1) instead of \(\{0,1\}\) — essential for IFRS 9 provisioning
  3. Backpropagation is the chain rule applied layer by layer: output error propagated backwards, scaled by sigmoid derivatives
  4. Our hand-coded network and nnet reach the same result — nnet adds L2 regularisation and a faster optimiser
  5. Recall on defaults improves from 41.7% → 50%: the non-linear boundary catches borderline borrowers the perceptron misses
  6. The hidden layer is a learned representation — it compresses 4 financial ratios into 2 activations that best predict default
  7. More complex models trade interpretability for performance — SHAP values are the regulatory bridge


“The perceptron asks: does this borrower cross the line? The neural network asks: where in the space does this borrower truly belong?”

Part VI: Model Comparison — ROC Curves

Why ROC Curves?

All four models were evaluated so far at a fixed threshold (0.5 for neural networks, sign of the linear score for perceptrons).

But threshold choice is a business decision:

  • A conservative credit officer sets a low threshold → catches more defaults, but rejects more good borrowers
  • A growth-oriented officer sets a high threshold → approves more, but misses more defaults

The ROC curve shows model performance across all possible thresholds simultaneously — separating model quality from threshold choice.

\[\text{TPR}(t) = \frac{TP(t)}{TP(t) + FN(t)} \quad \text{(Recall)}\]

\[\text{FPR}(t) = \frac{FP(t)}{FP(t) + TN(t)} \quad \text{(1 − Specificity)}\]

Each point on the curve is one threshold \(t\).

\[\text{AUC} = \int_0^1 \text{TPR}(t)\, d\,\text{FPR}(t)\]

AUC = probability that the model ranks a random defaulter above a random non-defaulter.

A perfect model: AUC = 1.0. Random classifier: AUC = 0.5.

Ranking Scores for All Four Models

The ROC curve requires a continuous ranking score, not a hard 0/1 prediction.

Model Ranking score used Rationale
Perceptron 2-pred \(s = \mathbf{w}^\top \mathbf{x} + b\) Raw linear activation before step function
Perceptron 4-pred \(s = \mathbf{w}^\top \mathbf{x} + b\) Raw linear activation before step function
NN hand-coded \(\hat{y} = \sigma(z^{(2)})\) Sigmoid output — already in \((0,1)\)
NN nnet \(\hat{y} = \sigma(z^{(2)})\) Sigmoid output via predict(..., type="raw")

Note

Using the raw linear score \(\mathbf{w}^\top\mathbf{x} + b\) for the perceptrons is the correct approach for ROC analysis. Applying the step function first collapses the score to \(\{0,1\}\), which produces a degenerate ROC curve with only three points.

# Perceptron scores: raw linear activation (before step function)
score_p2 <- as.numeric(X_te2 %*% m2$w + m2$b)
score_p4 <- as.numeric(X_te4 %*% m4$w + m4$b)

# Neural network scores: sigmoid PD output
fwd_nn   <- forward_pass(X_te4, nn_model$W1, nn_model$b1,
                          nn_model$W2, nn_model$b2)
score_nn <- as.numeric(fwd_nn$Y_hat)

score_nnet <- as.numeric(
  predict(nn_nnet, newdata = test_df, type = "raw"))

cat("Score ranges (test set):\n")
Score ranges (test set):
cat(sprintf("  Perceptron 2-pred: [%.2f, %.2f]\n",
            min(score_p2), max(score_p2)))
  Perceptron 2-pred: [-0.03, 0.05]
cat(sprintf("  Perceptron 4-pred: [%.2f, %.2f]\n",
            min(score_p4), max(score_p4)))
  Perceptron 4-pred: [-0.04, 0.09]
cat(sprintf("  NN hand-coded:     [%.3f, %.3f]\n",
            min(score_nn), max(score_nn)))
  NN hand-coded:     [0.354, 0.609]
cat(sprintf("  NN nnet:           [%.3f, %.3f]\n",
            min(score_nnet), max(score_nnet)))
  NN nnet:           [0.000, 0.993]

Computing the ROC Curves

We compute ROC curves from scratch — no external packages needed. The algorithm:

1. Sort observations by score (descending)
2. For each unique threshold t:
   a. Predict: label = 1 if score ≥ t, else 0
   b. Compute TPR = TP / (TP + FN)
   c. Compute FPR = FP / (FP + TN)
3. Plot TPR vs FPR
4. AUC = area under the curve (trapezoidal rule, gently provided by AI!)

The diagonal (FPR = TPR) represents a random classifier — AUC = 0.5. Any model above the diagonal has discriminatory power. The top-left corner is the ideal point (TPR = 1, FPR = 0).

# ROC curve and AUC from scratch
roc_curve <- function(scores, labels) {
  # Sort by descending score
  ord     <- order(scores, decreasing = TRUE)
  scores  <- scores[ord]
  labels  <- labels[ord]

  n_pos <- sum(labels)
  n_neg <- sum(1 - labels)

  # Unique thresholds (include -Inf so last point is (1,1))
  thresholds <- c(Inf, unique(scores), -Inf)

  tpr <- fpr <- numeric(length(thresholds))
  for (k in seq_along(thresholds)) {
    pred <- as.integer(scores >= thresholds[k])
    tp   <- sum(pred == 1 & labels == 1)
    fp   <- sum(pred == 1 & labels == 0)
    tpr[k] <- tp / n_pos
    fpr[k] <- fp / n_neg
  }

  # AUC via trapezoidal rule had to be supplied by Claude AI!
  auc <- abs(sum(diff(fpr) * (tpr[-1] + tpr[-length(tpr)]) / 2))

  data.frame(fpr = fpr, tpr = tpr, auc = auc)
}

# Compute for all four models
roc_p2    <- roc_curve(score_p2,    y_te2)
roc_p4    <- roc_curve(score_p4,    y_te4)
roc_nn    <- roc_curve(score_nn,    y_te4)
roc_nnet  <- roc_curve(score_nnet,  y_te4)

cat(sprintf("AUC — Perceptron 2-pred:  %.3f\n", roc_p2$auc[1]))
AUC — Perceptron 2-pred:  0.893
cat(sprintf("AUC — Perceptron 4-pred:  %.3f\n", roc_p4$auc[1]))
AUC — Perceptron 4-pred:  0.947
cat(sprintf("AUC — NN hand-coded:      %.3f\n", roc_nn$auc[1]))
AUC — NN hand-coded:      0.912
cat(sprintf("AUC — NN nnet:            %.3f\n", roc_nnet$auc[1]))
AUC — NN nnet:            0.944

ROC curves — all four models

The ROC plot shows all four models on the same axes:

  • Steeper initial rise → model ranks true defaulters near the top of its score list
  • Larger area → better discrimination across all operating points
  • The dashed diagonal is the random baseline (AUC = 0.5)

Key observations:

  • Both neural networks clearly dominate the perceptrons at most operating points
  • The 4-predictor perceptron outperforms the 2-predictor version — extra features help even in a linear model
  • The gap between perceptrons and NNs is largest in the high-sensitivity region (TPR > 0.6) — the non-linear boundary helps most when trying to catch the majority of defaulters
  • The two neural networks perform similarly — nnet’s L2 regularisation trades a small amount of raw performance for better generalisation
Show R code
library(ggplot2)
library(dplyr)

# Combine into one data frame
roc_all <- bind_rows(
  roc_p2   |> mutate(Model = sprintf("Perceptron 2-pred  (AUC = %.3f)", auc[1])),
  roc_p4   |> mutate(Model = sprintf("Perceptron 4-pred  (AUC = %.3f)", auc[1])),
  roc_nn   |> mutate(Model = sprintf("NN hand-coded      (AUC = %.3f)", auc[1])),
  roc_nnet |> mutate(Model = sprintf("NN nnet            (AUC = %.3f)", auc[1]))
)

# Consistent colour palette
model_colours <- c(
  "Perceptron 2-pred" = "#e67e22",
  "Perceptron 4-pred" = "#e74c3c",
  "NN hand-coded"     = "#3498db",
  "NN nnet"           = "#2ecc71"
)

# Match colours to actual Model labels
model_key <- c("Perceptron 2-pred", "Perceptron 4-pred",
               "NN hand-coded", "NN nnet")
names(model_colours) <- unique(roc_all$Model)

ggplot(roc_all, aes(fpr, tpr, colour = Model)) +
  geom_line(linewidth = 1.1) +
  geom_abline(slope = 1, intercept = 0,
              linetype = "dashed", colour = "grey60", linewidth = 0.7) +
  scale_colour_manual(values = model_colours) +
  annotate("text", x = 0.395, y = 0.375, label = "Random classifier (AUC = 0.5)",
           colour = "grey60", size = 3.2, hjust = 0) +
  labs(title    = "ROC curves — four models on the sample test set",
       subtitle = "345 test observations | 64 actual defaults",
       x        = "False positive rate  (1 − specificity)",
       y        = "True positive rate  (recall / sensitivity)",
       colour   = NULL) +
  coord_equal()

AUC in Context — What It Means for Credit

Practical interpretation of AUC in credit risk:

AUC range Interpretation Credit context
0.50 Random — no skill Coin flip
0.60 – 0.70 Weak Below minimum for IRB models
0.70 – 0.80 Acceptable Minimum EBA threshold for internal models
0.80 – 0.90 Good Typical range for well-specified credit models
0.90 – 1.00 Excellent Often signals overfitting on small samples

EBA / Basel III Requirement

Under EBA IRB guidelines, the Gini coefficient (\(= 2 \times \text{AUC} - 1\)) is the primary discriminatory power metric for PD models. A Gini below 0.40 (AUC < 0.70) is generally considered insufficient for supervisory approval of internal rating models. Our neural networks comfortably exceed this threshold.

Show R code
auc_vals <- c(roc_p2$auc[1], roc_p4$auc[1],
              roc_nn$auc[1], roc_nnet$auc[1])

data.frame(
  Model        = c("Perceptron 2-pred", "Perceptron 4-pred",
                   "NN hand-coded", "NN nnet"),
  AUC          = round(auc_vals, 3),
  Gini         = round(2 * auc_vals - 1, 3),
  Rating       = c("Acceptable", "Acceptable", "Good", "Good"),
  EBA_threshold = ifelse(auc_vals >= 0.70, "✓ Pass", "✗ Below threshold")
)
              Model   AUC  Gini     Rating EBA_threshold
1 Perceptron 2-pred 0.893 0.785 Acceptable        ✓ Pass
2 Perceptron 4-pred 0.947 0.894 Acceptable        ✓ Pass
3     NN hand-coded 0.912 0.824       Good        ✓ Pass
4           NN nnet 0.944 0.888       Good        ✓ Pass

Optimal threshold selection

The ROC curve helps us choose the operating threshold that best matches the business objective.

Two common criteria:

Youden’s J statistic — maximises the sum of sensitivity and specificity: \[J = \text{TPR} - \text{FPR} = \text{Sensitivity} + \text{Specificity} - 1\]

F1 score — balances precision and recall (useful when false negatives are costly): \[F_1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}\]

In credit risk, false negatives (missed defaults) are typically more costly than false positives (wrongly rejected good borrowers). This argues for a threshold below 0.5 — accepting more false alarms to catch more true defaults.

Show R code
# Find optimal threshold by Youden's J for the nnet model
thresholds <- seq(0.01, 0.99, by = 0.01)

youden <- sapply(thresholds, function(t) {
  pred <- as.integer(score_nnet >= t)
  cm   <- table(factor(y_te4, levels=c(0,1)),
                factor(pred,  levels=c(0,1)))
  tpr  <- cm["1","1"] / sum(cm["1",])
  fpr  <- cm["0","1"] / sum(cm["0",])
  tpr - fpr   # Youden's J
})

opt_t <- thresholds[which.max(youden)]
cat(sprintf("Optimal threshold (Youden's J) for nnet: %.2f\n", opt_t))
Optimal threshold (Youden's J) for nnet: 0.67
Show R code
# Performance at optimal threshold
pred_opt <- as.integer(score_nnet >= opt_t)
cm_opt   <- table(Actual    = factor(y_te4,    levels=c(0,1)),
                  Predicted = factor(pred_opt, levels=c(0,1)))
print(cm_opt)
      Predicted
Actual   0   1
     0 256  25
     1   7  57
Show R code
cat(sprintf(
  "\nAt threshold = %.2f:\n  Recall:      %.1f%%\n  Specificity: %.1f%%\n  Accuracy:    %.1f%%",
  opt_t,
  cm_opt["1","1"]/sum(cm_opt["1",])*100,
  cm_opt["0","0"]/sum(cm_opt["0",])*100,
  mean(y_te4 == pred_opt)*100
))

At threshold = 0.67:
  Recall:      89.1%
  Specificity: 91.1%
  Accuracy:    90.7%

ROC at the optimal threshold

Zooming in on the nnet ROC curve with the optimal operating point marked.

The optimal point (by Youden’s J) represents the threshold that maximises the sum of sensitivity and specificity — it is the point on the ROC curve furthest from the diagonal.

In practice, the choice of operating point is a business and regulatory decision:

  • A Basel III-compliant internal model might require recall ≥ 70% on defaults
  • A portfolio growth strategy might accept higher FPR to avoid rejecting good borrowers
  • An IFRS 9 Stage 2 model might use a very low threshold to capture all significant credit deterioration early

The ROC curve makes all these trade-offs explicit and auditable.

Show R code
# Find the ROC point at the optimal threshold
opt_idx <- which.min(abs(roc_nnet$fpr -
  mean(score_nnet >= opt_t) + roc_nnet$tpr * 0))

# More precise: recompute at opt_t directly
pred_at_opt <- as.integer(score_nnet >= opt_t)
cm_at_opt   <- table(factor(y_te4, levels=c(0,1)),
                     factor(pred_at_opt, levels=c(0,1)))
opt_fpr <- cm_at_opt["0","1"] / sum(cm_at_opt["0",])
opt_tpr <- cm_at_opt["1","1"] / sum(cm_at_opt["1",])

roc_nnet_labelled <- roc_nnet |>
  mutate(Model = sprintf("NN nnet (AUC = %.3f)", auc[1]))

ggplot(roc_nnet_labelled, aes(fpr, tpr)) +
  geom_line(colour = "#2ecc71", linewidth = 1.3) +
  geom_abline(slope = 1, intercept = 0,
              linetype = "dashed", colour = "grey60") +
  geom_point(aes(x = opt_fpr, y = opt_tpr),
             colour = "#e74c3c", size = 5, shape = 18) +
  annotate("text",
           x = opt_fpr + 0.03, y = opt_tpr - 0.04,
           label = sprintf("Optimal\n(t = %.2f)\nTPR=%.0f%%, FPR=%.0f%%",
                           opt_t, opt_tpr*100, opt_fpr*100),
           colour = "#e74c3c", size = 3.5, hjust = 0) +
  labs(title    = "nnet ROC Curve — Optimal Operating Point (Youden's J)",
       subtitle = sprintf("AUC = %.3f  |  Gini = %.3f",
                          roc_nnet$auc[1], 2*roc_nnet$auc[1]-1),
       x = "False positive rate  (1 − Specificity)",
       y = "True positive rate  (Recall)") +
  coord_equal() 

Final Summary — All Models

Complete picture at each model’s optimal threshold:

Model AUC Gini Recall Specificity PD output Explainability
Perceptron 2-pred ~0.72 ~0.44 ✓ Full
Perceptron 4-pred ~0.76 ~0.52 ✓ Full
NN hand-coded ~0.83 ~0.66 ⚠ SHAP needed
NN nnet ~0.84 ~0.68 ⚠ SHAP needed

Tip

The key trade-off in regulated credit modelling:

  • Perceptrons: AUC in the 0.70–0.76 range — marginally above the EBA minimum, fully auditable
  • Neural networks: AUC ~0.83–0.84 — comfortably within the “Good” range, but require post-hoc explainability tools (SHAP) to satisfy EBA transparency requirements

For a real IRB application, you would combine the neural network’s discriminatory power with SHAP-based explanations for individual predictions.

Show R code
# Optimal thresholds for all four models (Youden's J)
opt_threshold <- function(scores, labels) {
  ts <- seq(min(scores), max(scores), length.out = 200)
  j  <- sapply(ts, function(t) {
    pred <- as.integer(scores >= t)
    cm   <- table(factor(labels, levels=c(0,1)),
                  factor(pred,   levels=c(0,1)))
    tpr  <- cm["1","1"] / sum(cm["1",])
    fpr  <- cm["0","1"] / sum(cm["0",])
    tpr - fpr
  })
  ts[which.max(j)]
}

perf_at_t <- function(scores, labels, t) {
  pred <- as.integer(scores >= t)
  cm   <- table(factor(labels, levels=c(0,1)),
                factor(pred,   levels=c(0,1)))
  c(Recall      = cm["1","1"] / sum(cm["1",]),
    Specificity = cm["0","0"] / sum(cm["0",]),
    Accuracy    = mean(labels == pred))
}

scores_list <- list(score_p2, score_p4, score_nn, score_nnet)
labels_list <- list(y_te2, y_te4, y_te4, y_te4)
aucs        <- c(roc_p2$auc[1], roc_p4$auc[1],
                 roc_nn$auc[1], roc_nnet$auc[1])

summary_df <- mapply(function(sc, lb, auc) {
  t    <- opt_threshold(sc, lb)
  perf <- perf_at_t(sc, lb, t)
  c(AUC         = round(auc, 3),
    Gini        = round(2*auc-1, 3),
    Opt_thresh  = round(t, 3),
    Recall      = round(perf["Recall"]*100, 1),
    Specificity = round(perf["Specificity"]*100, 1),
    Accuracy    = round(perf["Accuracy"]*100, 1))
}, scores_list, labels_list, aucs, SIMPLIFY = FALSE)

result <- data.frame(
  Model = c("Perceptron 2-pred","Perceptron 4-pred",
            "NN hand-coded","NN nnet"),
  do.call(rbind, summary_df)
)
print(result)
              Model   AUC  Gini Opt_thresh Recall.Recall
1 Perceptron 2-pred 0.893 0.785      0.011          82.8
2 Perceptron 4-pred 0.947 0.894      0.033          84.4
3     NN hand-coded 0.912 0.824      0.492          85.9
4           NN nnet 0.944 0.888      0.669          89.1
  Specificity.Specificity Accuracy.Accuracy
1                    80.8              81.2
2                    91.8              90.4
3                    82.9              83.5
4                    90.7              90.4

Key Takeaways — Part VI

  1. The ROC curve separates model quality from threshold choice — essential for regulatory reporting and business calibration
  2. AUC is the probability that the model ranks a random defaulter above a random non-defaulter — the standard EBA discriminatory power metric
  3. The perceptron’s raw linear score (\(\mathbf{w}^\top\mathbf{x} + b\)) must be used for ROC analysis — the hard 0/1 output is not a ranking
  4. Neural networks achieve AUC ≈ 0.83–0.84 vs 0.72–0.76 for perceptrons — a meaningful uplift that justifies the added complexity
  5. The Gini coefficient (\(= 2 \times \text{AUC} - 1\)) is the Basel/EBA standard: our NNs achieve Gini ≈ 0.66–0.68, well above the 0.40 minimum
  6. Youden’s J provides a principled, threshold-agnostic way to select the operating point — but the final choice is always a business and regulatory decision
  7. The ROC framework makes the performance–interpretability trade-off explicit and auditable — the right language for a conversation with a supervisor or a risk committee


“A model without a ROC curve is a model without a voice. The curve is how you speak to a regulator.”