A neural network primer in credit selection

From logic gates to corporate default prediction

Rodolfo Vanzini, CFA

2026-03-25

Part I: The Perceptron

What is a perceptron?

The CFA Institute’s monograph on AI in investment management by Joseph Simonian, PhD gave me the idea to develop this presentation.

A perceptron is the simplest neural network: a binary linear classifier. Proposed by Frank Rosenblatt (1958), it models a single biological neuron:

Receives weighted inputs
Sums them (the net input)
Fires (output = 1) if the sum exceeds a threshold

It is the building block of all modern neural networks.

\[\hat{y} = \phi\!\left(\mathbf{w}^\top \mathbf{x} + b\right)\]

where the Heaviside step function \(\phi\) is:

\[\phi(z) = \begin{cases} 1 & \text{if } z \geq 0 \\ 0 & \text{if } z < 0 \end{cases}\]

x₁ ──→ w₁ ─┐
x₂ ──→ w₂ ─┤──→ Σ ──→ φ(z) ──→ ŷ ∈ {0,1}
            └─ b

The decision boundary

The perceptron partitions the input space with a hyperplane:

\[\mathbf{w}^\top \mathbf{x} + b = 0\]

Note

In 2D (\(x_1, x_2\)), the boundary is a line \(w_1 x_1 + w_2 x_2 + b = 0\), i.e. \[x_2 = -\frac{w_1}{w_2}\,x_1 - \frac{b}{w_2}\] Points on one side are classified as 1, on the other as 0.

The perceptron convergence theorem guarantees that the algorithm finds a separating hyperplane in a finite number of steps — if one exists.

Learning outcome	Condition
Converges to perfect classifier	Data is linearly separable
Oscillates, never converges	Data is not linearly separable

The learning rule

The perceptron updates weights only on mistakes:

\[\mathbf{w} \leftarrow \mathbf{w} - \eta\,(\hat{y} - y)\,\mathbf{x}\] \[b \leftarrow b - \eta\,(\hat{y} - y)\]

where \(\eta > 0\) is the learning rate.

Case	\(\hat{y}\)	\(y\)	Error \((\hat{y}-y)\)	Effect on \(\mathbf{w}\)
Correct	0	0	0	No update
Correct	1	1	0	No update
False positive	1	0	+1	\(\mathbf{w} \downarrow\) — neuron fires less
False negative	0	1	−1	\(\mathbf{w} \uparrow\) — neuron fires more

Tip

Key insight: the update is proportional to the input that caused the mistake. The neuron learns by correcting its own errors — one sample at a time.

Before applying the perceptron to real credit data, we illustrate it on the simplest possible problem: the logical AND function.

Two binary inputs \(x_1, x_2 \in \{0,1\}\), output \(y = x_1 \land x_2\):

\(x_1\)	\(x_2\)	\(y = x_1 \land x_2\)
0	0	0
1	0	0
0	1	0
1	1	1

This is linearly separable — a single line can separate the one positive case \((1,1)\) from the three negatives. The ideal boundary: \(x_1 + x_2 - 1.5 = 0\).

# The CFA Institute's monograph on AI by Joseph Simonian, PhD gave me this idea to understand
# better and share my insights about the perceptron and neural networks in investment management.
# Input space: all combinations of two binary inputs.
X <- as.matrix(expand.grid(x1 = c(0, 1), x2 = c(0, 1)))

# AND gate target
y <- apply(X, 1, function(x) as.numeric(x[1] & x[2]))

data.frame(x1 = X[,1], x2 = X[,2], y = y)

\[X = \begin{bmatrix} 0 & 0 \\ 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix}, \qquad \mathbf{y} = \begin{bmatrix} 0 \\ 0 \\ 0 \\ 1 \end{bmatrix}\]

Initialising weights

Context
R code

Weights and bias are initialised to random values drawn from the standard normal distribution.

The net input for each sample is then:

\[\mathbf{z} = X\mathbf{w} + b = \begin{bmatrix} b \\ w_1 + b \\ w_2 + b \\ w_1 + w_2 + b \end{bmatrix}\]

Before training, these random weights produce meaningless predictions — the learning algorithm will correct them iteratively.

set.seed(123) # this is to ensure reproducibility
weights <- matrix(rnorm(2), nrow = 2, ncol = 1)
bias    <- rnorm(1)

cat("Initial weights: w1 =", round(weights[1], 3),
    "  w2 =", round(weights[2], 3),
    "\nInitial bias:    b  =", round(bias, 3))

Initial weights: w1 = -0.56   w2 = -0.23 
Initial bias:    b  = 1.559

# Net input z = Xw + b
z <- X %*% weights + bias
cat("\n\nNet inputs z = Xw + b:\n")



Net inputs z = Xw + b:

print(round(z, 3))

      [,1]
[1,] 1.559
[2,] 0.998
[3,] 1.329
[4,] 0.768

# Initial (random) predictions
cat("\nInitial predictions (before training):",
    as.numeric(z >= 0))


Initial predictions (before training): 1 1 1 1

cat("\nTarget values:                        ", y)


Target values:                         0 0 0 1

The prediction function

Context
R code

Three steps, one line of algebra:

Step	Operation	Math
1. Weighted sum	\(z = \mathbf{w}^\top \mathbf{x} + b\)	Linear combination
2. Threshold	\(z \geq 0\)?	Heaviside \(\phi\)
3. Numeric output	TRUE → 1, FALSE → 0	\(\hat{y} \in \{0,1\}\)

In matrix form across all \(n\) samples simultaneously:

\[\hat{\mathbf{y}} = \mathbb{I}(X\mathbf{w} + b \geq 0) \in \{0,1\}^n\]

# Beautiful matrix algebra that in one line implements the Heaviside function
predict_perceptron <- function(X, w, b) {
  as.numeric((X %*% w + b) >= 0)
}

# Trace through with our AND gate data
z     <- X %*% weights + bias
fired <- z >= 0
preds <- as.numeric(fired)

data.frame(
  x1 = X[,1], x2 = X[,2],
  z  = round(z, 3),
  `z >= 0` = fired,
  `ŷ`      = preds,
  y        = y,
  correct  = preds == y,
  check.names = FALSE
)

  x1 x2     z z >= 0 ŷ y correct
1  0  0 1.559   TRUE 1 0   FALSE
2  1  0 0.998   TRUE 1 0   FALSE
3  0  1 1.329   TRUE 1 0   FALSE
4  1  1 0.768   TRUE 1 1    TRUE

The training algorithm

Context
R code

The fit function loops over all samples for a fixed number of epochs (complete passes through the data):

for each epoch:
  for each sample (xᵢ, yᵢ):
    ŷᵢ  ← φ(w⊤xᵢ + b)         # predict
    δᵢ  ← ŷᵢ − yᵢ              # error
    if δᵢ ≠ 0:
      w ← w − η · δᵢ · xᵢ     # update weights
      b ← b − η · δᵢ           # update bias
  if no errors this epoch → STOP (converged)

Early stopping: if an entire epoch passes with zero misclassifications, the data is perfectly separated and training terminates.

fit_perceptron <- function(X, y, w, b, lr = 0.1, epochs = 100) {
  history <- numeric(epochs)
  for (e in seq_len(epochs)) {
    err <- 0
    for (i in seq_len(nrow(X))) {
      v     <- sum(X[i,] * w) + b        # net input
      y_hat <- ifelse(v < 0, 0, 1)       # step function
      delta <- y_hat - y[i]              # error
      if (delta != 0) {
        err <- err + 1
        w   <- w - lr * delta * X[i,]   # weight update
        b   <- b - lr * delta            # bias update
      }
    }
    history[e] <- err
    if (err == 0) { cat("Converged at epoch:", e, "\n"); break }
  }
  list(w = w, b = b, history = history[1:e])
}

Training on the AND Gate

Context
R code

After training, the perceptron correctly classifies all four AND gate combinations.

Learned parameters (example, see R code):

Parameter	Value
\(w_1\)	+0.14
\(w_2\)	+0.17
\(b\)	−0.241

The implied decision boundary \(w_1 x_1 + w_2 x_2 + b = 0\) correctly places \((1,1)\) above the line (output = 1) and all other points below it (output = 0).

The AND gate is linearly separable, so the perceptron is guaranteed to converge.

set.seed(123)
w <- matrix(rnorm(2), 2, 1)
b <- rnorm(1)

model_and <- fit_perceptron(X, y, w, b, lr = 0.1, epochs = 100)

Converged at epoch: 18

cat("Learned weights: w1 =", round(model_and$w[1], 3),
    "  w2 =", round(model_and$w[2], 3),
    "\nLearned bias:    b  =", round(model_and$b, 3))

Learned weights: w1 = 0.14   w2 = 0.17 
Learned bias:    b  = -0.241

preds <- predict_perceptron(X, model_and$w, model_and$b)
data.frame(x1 = X[,1], x2 = X[,2], Target = y,
           Prediction = preds, Correct = y == preds)

  x1 x2 Target Prediction Correct
1  0  0      0          0    TRUE
2  1  0      0          0    TRUE
3  0  1      0          0    TRUE
4  1  1      1          1    TRUE

Decision boundary — AND gate

Context
R code

The perceptron finds a straight line separating the single positive case \((1,1)\) from the three negatives.

Blue region → predicted 0 (no fire)
Red region → predicted 1 (fire)
The line is the learned decision boundary \(w_1 x_1 + w_2 x_2 + b = 0\)

The AND gate is a toy problem chosen because it is guaranteed to be linearly separable. Real credit data is far more complex — but the same mathematical machinery applies.

Show R code

library(ggplot2)
library(RColorBrewer)

# This sets the color palette to opposing colours, particularly Set1
# see https://colorbrewer2.org/
pal <- brewer.pal(3, "Set1")

grid_and <- expand.grid(x1 = seq(-0.2, 1.2, 0.01),
                        x2 = seq(-0.2, 1.2, 0.01))
grid_and$pred <- as.numeric(
  as.matrix(grid_and) %*% model_and$w + model_and$b >= 0)

pts <- data.frame(x1 = X[,1], x2 = X[,2],
                  y  = factor(y, labels = c("0 (no fire)", "1 (fire)")))

ggplot() +
  geom_tile(data = grid_and,
            aes(x1, x2, fill = factor(pred)), alpha = 0.25) +
  geom_point(data = pts, aes(x1, x2, colour = y, shape = y), size = 6) +
  geom_abline(intercept = -model_and$b / model_and$w[2],
              slope     = -model_and$w[1] / model_and$w[2],
              colour = "white", linewidth = 1, linetype = "dashed") +
  scale_fill_manual(values = c("0" = pal[2], "1" = pal[1]),
                    labels = c("Pred: 0", "Pred: 1"), name = "Region") +
  scale_colour_manual(values = c("0 (no fire)" = pal[2],
                                 "1 (fire)"    = pal[1]), name = "Target") +
  scale_shape_manual(values = c(1, 3)) + 
  coord_fixed(xlim = c(-0.2, 1.2), ylim = c(-0.2, 1.2)) +
  labs(title = "AND gate — Learned decision boundary",
       x = "x1", y = "x2")

The perceptron’s limitation: XOR

Context
R code

The XOR function outputs 1 when exactly one input is 1:

\(x_1\)	\(x_2\)	\(y = x_1 \oplus x_2\)
0	0	0
1	0	1
0	1	1
1	1	0

No single straight line can separate the two 1s (diagonal) from the two 0s. The perceptron cannot converge — it oscillates indefinitely.

Important

This limitation motivated the invention of multi-layer networks (MLP/deep learning), which we will (probably) explore in another presentation.

Show R code

X_xor <- as.matrix(expand.grid(x1 = c(0,1), x2 = c(0,1)))
y_xor <- apply(X_xor, 1, function(x) as.numeric(xor(x[1], x[2])))

set.seed(123)
m_xor <- fit_perceptron(X_xor, y_xor,
                        w  = matrix(rnorm(2), 2, 1),
                        b  = rnorm(1), lr = 0.1, epochs = 100)

pred_xor <- predict_perceptron(X_xor, m_xor$w, m_xor$b)
cat("Perceptron on XOR — accuracy:",
    round(mean(pred_xor == y_xor)*100, 1), "%  (best possible: 75%)\n")

Perceptron on XOR — accuracy: 25 %  (best possible: 75%)

Show R code

grid_xor <- expand.grid(x1 = seq(-0.2,1.2,0.01), x2 = seq(-0.2,1.2,0.01))
grid_xor$pred <- as.numeric(
  as.matrix(grid_xor) %*% m_xor$w + m_xor$b >= 0)

ggplot() +
  geom_tile(data = grid_xor, aes(x1, x2, fill = factor(pred)), alpha = 0.25) +
  geom_point(data = data.frame(X_xor, y = factor(y_xor)),
             aes(x1, x2, colour = y, shape = y), size = 6) +
  scale_fill_manual(values = c("0" = pal[2], "1" = pal[1]),
                    labels = c("Pred: 0", "Pred: 1"), name = "Region") +
  scale_colour_manual(values = c("0" = pal[2],
                                 "1"    = pal[1]), name = "Target") +
  scale_shape_manual(values = c(1, 3)) +
  coord_fixed(xlim=c(-0.2,1.2), ylim=c(-0.2,1.2)) +
  labs(title="XOR — perceptron fails (not linearly separable)",
       x="x1", y="x2")

Part II: From logic gates to credit risk

The bridge

The AND gate taught us the mechanics. Now we apply the same algorithm to a real problem.

AND gate (toy)

Inputs: \(x_1, x_2 \in \{0,1\}\)
Output: \(y = 1\) only if both inputs are 1
4 observations, perfectly separable
Trains in a few epochs

Credit default (real)

Inputs: financial ratios (DSCR, Debt/EBITDA)
Output: \(y = 1\) if borrower defaults
690 real corporate borrowers
Classes overlap — linear separation is imperfect

Note

The same fit_perceptron and predict_perceptron functions we wrote for the AND gate will now be applied — unchanged — to 690 real loan observations.

The dataset

Context
R code

690 corporate borrowers from a sample dataset, pre-split into training and test sets:

	Train	Test	Total
Observations	345	345	690
Non-defaulted (`No`)	289	281	570
Defaulted (`Si`)	56	64	120
Default rate	16.2%	18.6%	17.4%

Two predictors — group means:

Variable	Description	No Default	Default
`DSCR`	Debt Service Coverage Ratio	1.95	1.04
`DEBITDA`	Debt / EBITDA multiple	4.48	7.61

Both show clear directional separation: lower DSCR and higher leverage are associated with default.

library(readxl)
library(dplyr)

df <- read_excel("corporate_ds.xlsx") |>
  mutate(label = as.integer(Default == "Si"))

train2 <- df |> filter(Set == "Train") |> select(label, DSCR, DEBITDA)
test2  <- df |> filter(Set == "Test")  |> select(label, DSCR, DEBITDA)

df |>
  mutate(Default = factor(label, labels = c("No","Si"))) |>
  group_by(Default) |>
  summarise(n = n(),
            DSCR_mean    = round(mean(DSCR), 2),
            DEBITDA_mean = round(mean(DEBITDA), 2))

# A tibble: 2 × 4
  Default     n DSCR_mean DEBITDA_mean
  <fct>   <int>     <dbl>        <dbl>
1 No        570      1.95         4.48
2 Si        120      1.04         7.61

Part III: Two predictors — DSCR & Debt/EBITDA

Feature space

Context
R code

Plotting all 690 borrowers in the DSCR × Debt/EBITDA plane:

Blue circles = non-defaulted borrowers
Red crosses = defaulted borrowers

The groups partially overlap — unlike the AND gate, no perfect linear boundary exists. This is typical of real credit data.

Normalisation to \([0,1]\) is applied before training — essential when variables have different scales (DSCR: −1.5 to 4.7; Debt/EBITDA: −2.3 to 12.4).

Show R code

library(ggplot2)

df %>% 
  mutate(Default = factor(label, labels = c("No","Si"))) %>% 
  ggplot(aes(DSCR, DEBITDA, colour = Default, shape = Default)) +
  geom_point(alpha = 1.0, size = 1.8) +
  scale_colour_manual(values = c("No" = pal[2], "Si" = pal[1])) +
  scale_shape_manual(values = c(1, 3)) + 
  labs(title = "Feature space: DSCR vs debt/EBITDA",
       subtitle = "690 corporate borrowers",
       x = "DSCR", y = "Debt/EBITDA")

Show R code

# +
#   theme_minimal(base_size = 13)

Training — 2 Predictors

Context
R code

The same learning rule as the AND gate, now on 345 training observations:

\[w_i \leftarrow w_i - \eta\,(\hat{y} - y)\,x_i \qquad \eta = 0.01, \quad 100 \text{ epochs}\]

Features normalised to \([0,1]\). Learned weights after convergence:

Parameter	Value	Financial interpretation
\(w_{\text{DSCR}}\)	−1.52	↑ DSCR → ↓ default risk
\(w_{\text{DEBITDA}}\)	+1.74	↑ Debt/EBITDA → ↑ default risk
Bias \(b\)	−0.22	Decision threshold

Sign and direction align perfectly with financial intuition — exactly as the AND gate weights aligned with the truth table.

Show R code

normalise <- function(x) (x - min(x)) / (max(x) - min(x))

X_tr2 <- train2 |> select(-label) |>
  mutate(across(everything(), normalise)) |> as.matrix()
y_tr2 <- train2$label

X_te2 <- test2 |> select(-label) |>
  mutate(across(everything(), normalise)) |> as.matrix()
y_te2 <- test2$label

set.seed(34)
m2 <- fit_perceptron(X_tr2, y_tr2,
                     w  = matrix(rnorm(2), 2, 1),
                     b  = rnorm(1), lr = 0.01, epochs = 100)

cat("w_DSCR    =", round(m2$w[1], 3),
    "\nw_DEBITDA =", round(m2$w[2], 3),
    "\nbias      =", round(m2$b, 3))

w_DSCR    = -0.019 
w_DEBITDA = 0.069 
bias      = -0.018

Decision boundary — 2 predictors

Context
R code

The perceptron draws a straight line in the DSCR × Debt/EBITDA plane — exactly as it did in the AND gate, but now on real financial data.

Blue zone → predicted: No default
Red zone → predicted: Default

Points in the wrong zone are misclassifications. The overlap of the two groups is irreducible with a linear classifier — motivating more powerful models (logistic regression, random forest).

Show R code

grid2 <- expand.grid(DSCR = seq(0,1,0.005), DEBITDA = seq(0,1,0.005))
grid2$pred <- as.numeric(as.matrix(grid2) %*% m2$w + m2$b >= 0)

X_te2_df <- as.data.frame(X_te2); X_te2_df$label <- y_te2

ggplot() +
  geom_tile(data = grid2, aes(DSCR, DEBITDA, fill = factor(pred)),
            alpha = 0.22) +
  geom_point(data = X_te2_df,
             aes(DSCR, DEBITDA, colour = factor(label), shape = factor(label)),
             size = 1.8, alpha = 1.0) +
  scale_fill_manual(values = c("0"=pal[2],"1"=pal[1]),
                    labels = c("No Default","Default"), name = "Region") +
  scale_colour_manual(values = c("0"=pal[2],"1"=pal[1]),
                      labels = c("No Default","Default"), name = "Actual") +
  scale_shape_manual(values = c(1, 3)) + 
  labs(title = "Decision boundary — 2 predictors (test set)",
       x = "DSCR (normalised)", y = "Debt/EBITDA (normalised)") +
  theme_minimal(base_size = 12)

Performance — 2 predictors

Context
R code

Confusion matrix — 345 test observations:

	Predicted: No default	Predicted: default
Actual: No default	126 ✅	155 ❌
Actual: Default	1 ❌	63 ✅

Metric	Value
Accuracy	54.8%
Recall (default)	98.4%
Precision (default)	28.9%
Specificity	44.8%

Warning

As many as 98.4% of actual defaulters are caught. In credit risk, missed defaults are the costliest errors, but only 44.8% of non defaulters were caught. Can adding more features help?

Show R code

y_hat2 <- predict_perceptron(X_te2, m2$w, m2$b)
cm2    <- table(Actual = y_te2, Predicted = y_hat2)
print(cm2)

      Predicted
Actual   0   1
     0 126 155
     1   1  63

Show R code

cat(sprintf(
  "\nAccuracy:    %.1f%%\nRecall:      %.1f%%\nPrecision:   %.1f%%\nSpecificity: %.1f%%",
  mean(y_te2==y_hat2)*100,
  (cm2[2,2]/sum(cm2[2,]))*100,
  (cm2[2,2]/sum(cm2[,2]))*100,
  (cm2[1,1]/sum(cm2[1,]))*100
))


Accuracy:    54.8%
Recall:      98.4%
Precision:   28.9%
Specificity: 44.8%

Part IV: Four predictors — Adding ROS & DBR

Extending the feature set

Context
R code

We enrich the model with two additional financial ratios:

Variable	Description	No Default	Default
`DSCR`	Debt Service Coverage Ratio	1.95	1.04
`DEBITDA`	Debt / EBITDA multiple	4.48	7.61
`ROS`	Return on Sales (%)	10.2	4.6
`DBR`	Debt-to-Revenue ratio (%)	38.6	71.8

The decision boundary is now a hyperplane in 4D — no longer directly visualisable, but the perceptron still finds a linear combination of the four ratios.

Note

The fit_perceptron function requires no changes — only the number of input columns grows from 2 to 4. The learning rule is identical.

Show R code

library(tidyr)

df |>
  mutate(Default = factor(label, labels = c("No","Si"))) |>
  group_by(Default) |>
  summarise(across(c(DSCR, DEBITDA, ROS, DBR), ~round(mean(.), 2)))

# A tibble: 2 × 5
  Default  DSCR DEBITDA   ROS   DBR
  <fct>   <dbl>   <dbl> <dbl> <dbl>
1 No       1.95    4.48  10.2  38.6
2 Si       1.04    7.61   4.6  71.8

Feature distributions — 4 predictors

Context
R code

All four ratios show clear directional separation:

DSCR: lower for defaulters — debt coverage is insufficient
Debt/EBITDA: higher — excessive leverage relative to earnings
ROS: lower — poor profitability reduces cash generation
DBR: higher — heavy debt burden relative to revenues

Each adds independent information about the borrower’s financial health.

Show R code

df |>
  mutate(Default = factor(label, labels = c("No","Si"))) |>
  select(Default, DSCR, DEBITDA, ROS, DBR) |>
  pivot_longer(-Default) |>
  ggplot(aes(Default, value, fill = Default)) +
  geom_boxplot(alpha = 0.80, outlier.size = 0.6) +
  facet_wrap(~name, scales = "free_y") +
  scale_fill_manual(values = c("No"= pal[2],"Si"= pal[1])) +
  labs(title = "Feature distributions by default status — 4 predictors",
       x = NULL, y = NULL) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

Training — 4 Predictors

Context
R code

Learned weights after 100 epochs — same learning rate \(\eta = 0.01\):

Feature	Weight	Direction
DSCR	−0.043	↑ coverage → ↓ risk
Debt/EBITDA	+0.073	↑ leverage → ↑ risk
ROS	−0.077	↑ profitability → ↓ risk
DBR	+0.036	↑ debt burden → ↑ risk

No single feature has the strongest default signal in this dataset (not significantly stronger).

Show R code

train4 <- df |> filter(Set=="Train") |> select(label, DSCR, DEBITDA, ROS, DBR)
test4  <- df |> filter(Set=="Test")  |> select(label, DSCR, DEBITDA, ROS, DBR)

X_tr4 <- train4 |> select(-label) |>
  mutate(across(everything(), normalise)) |> as.matrix()
y_tr4 <- train4$label

X_te4 <- test4 |> select(-label) |>
  mutate(across(everything(), normalise)) |> as.matrix()
y_te4 <- test4$label

set.seed(42)
m4 <- fit_perceptron(X_tr4, y_tr4,
                     w  = matrix(rnorm(4), 4, 1),
                     b  = rnorm(1), lr = 0.01, epochs = 100)

data.frame(Feature = c("DSCR","DEBITDA","ROS","DBR"),
           Weight  = round(m4$w, 3))

  Feature Weight
1    DSCR -0.043
2 DEBITDA  0.073
3     ROS -0.077
4     DBR  0.036

Learning curves — 2 vs 4 predictors

Context
R code

Comparing training errors per epoch for both models:

Both curves drop sharply in the first 20–30 epochs
The 4-predictor model (blue) settles at a lower error floor
Neither reaches zero — the classes are not linearly separable in real data

This non-zero floor is the signature of irreducible overlap in the feature space. It motivates more flexible classifiers — the topic of the next presentation.

Show R code

bind_rows(
  data.frame(epoch = seq_along(m2$history), errors = m2$history,
             model = "2 Predictors (DSCR + DEBITDA)"),
  data.frame(epoch = seq_along(m4$history), errors = m4$history,
             model = "4 Predictors (+ ROS + DBR)")
) |>
  ggplot(aes(epoch, errors, colour = model)) +
  geom_line(linewidth = 1.1) +
  scale_colour_manual(
    values = c("2 Predictors (DSCR + DEBITDA)" = "#e74c3c",
               "4 Predictors (+ ROS + DBR)"    = "#3498db")) +
  labs(title = "Learning curves: 2 vs 4 predictors",
       subtitle = "Training misclassifications per epoch (n = 345)",
       x = "Epoch", y = "Errors", colour = NULL) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "top")

Performance — 2 vs 4 predictors

Context
R code

Head-to-head on 345 test observations:

Metric	2 Predictors	4 Predictors	Δ
Accuracy	54.8%	49.6%	-5.2pp
Recall (Default)	98.4%	100%	+1.6pp
Precision (Default)	28.9%	26.9%	-2.0pp
Specificity	44.8%	38.1%	-13.9pp

Tip

The biggest gain is recall — the 4-predictor model catches 1.6 percentage points more actual defaulters. This is the metric that matters most in credit risk management, though this improvemente doesn’t add much.

Show R code

y_hat4 <- predict_perceptron(X_te4, m4$w, m4$b)
cm4    <- table(Actual = y_te4, Predicted = y_hat4)

data.frame(
  Metric = c("Accuracy","Recall (Default)","Precision (Default)","Specificity"),
  `2 Pred` = c(mean(y_te2==y_hat2), cm2[2,2]/sum(cm2[2,]),
               cm2[2,2]/sum(cm2[,2]), cm2[1,1]/sum(cm2[1,])),
  `4 Pred` = c(mean(y_te4==y_hat4), cm4[2,2]/sum(cm4[2,]),
               cm4[2,2]/sum(cm4[,2]), cm4[1,1]/sum(cm4[1,])),
  check.names = FALSE
) |>
  mutate(across(where(is.numeric), ~paste0(round(.*100,1),"%")),
         Delta = paste0("", round(
           c(mean(y_te4==y_hat4)-mean(y_te2==y_hat2),
             cm4[2,2]/sum(cm4[2,])-cm2[2,2]/sum(cm2[2,]),
             cm4[2,2]/sum(cm4[,2])-cm2[2,2]/sum(cm2[,2]),
             cm4[1,1]/sum(cm4[1,])-cm2[1,1]/sum(cm2[1,]))*100, 1), "pp"))

               Metric 2 Pred 4 Pred  Delta
1            Accuracy  54.8%  49.6% -5.2pp
2    Recall (Default)  98.4%   100%  1.6pp
3 Precision (Default)  28.9%  26.9%   -2pp
4         Specificity  44.8%  38.1% -6.8pp

Weight interpretation — A data-driven scorecard

Context
R code

The perceptron’s weight vector is its implicit credit policy:

Feature	Weight	Logic
DSCR	−0.0433	Strongest protective factor: sufficient debt coverage reduces default risk
Debt/EBITDA	+0.0731	Dominant risk factor: excessive leverage relative to earnings
ROS	−0.0765	Profitable firms generate cash to service obligations
DBR	+0.360	High debt burden relative to revenues is a warning signal

This structure is essentially a linear scorecard — identical in spirit to Altman’s Z-score or traditional expert-designed credit rating models, but learned entirely from data.

Note

Under EBA IRB guidelines and IFRS 9, model explainability is mandatory. The perceptron’s weights satisfy this requirement directly.

Show R code

lw <- data.frame(Feature = c("DSCR","DEBITDA","ROS","DBR"),
           Weight  = m4$w[,1])
lw

  Feature      Weight
1    DSCR -0.04326569
2 DEBITDA  0.07311199
3     ROS -0.07654086
4     DBR  0.03598658

Show R code

lw %>% 
  ggplot(aes(reorder(Feature, Weight), Weight, fill = Weight > 0)) +
  geom_col(alpha = 0.85) +
  geom_hline(yintercept = 0, linewidth = 0.5, colour = "white") +
  coord_flip() +
  scale_fill_manual(values = c("TRUE"="#e74c3c","FALSE"="#2ecc71"),
                    labels = c("Reduces risk","Increases risk")) +
  labs(title = "Learned Weights — 4-predictor perceptron",
       subtitle = "Negative = protective, Positive = risk-increasing",
       x = NULL, y = "Weight", fill = NULL) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "top")

Limitations & what comes next

Perceptron limits

Linear boundary only — cannot capture interaction effects
No probability output — just 0/1 (no PD estimate)
Precision still limited (~26.9%) on real credit data
Sensitive to feature scaling
Cannot solve XOR-type problems

In the next part

The same step-by-step logic — applied to a Multi-Layer Perceptron (MLP):

Addition	What it unlocks
Hidden layer	Non-linear boundaries
Sigmoid activation	Probability output (PD)
Backpropagation	Gradient-based learning
Multiple layers	Deep representations

Regulatory reminder

The perceptron is fully auditable — its weights are the model. More complex models (XGBoost, MLP) require post-hoc explainability tools (SHAP values) to meet EBA and IFRS 9 requirements.

Key takeaways

The perceptron is a single linear classifier: weighted inputs + threshold = binary prediction
The AND gate shows the mechanics: same algorithm, 4 observations, guaranteed convergence
On 690 real loans, the same code achieves ~50% accuracy and ~100% recall with no feature engineering
2 predictors give a visible 2D boundary; 4 predictors raise default recall slightly
Debt/EBITDA is the dominant default signal — the model discovers this automatically
The weight vector is a data-driven scorecard — interpretable, auditable, regulatory-compliant
The non-zero error floor signals non-linear overlap → motivation for the MLP (next session)

“Start simple. A model that can be explained to a regulator is worth more than one that cannot.” “Understanding the simplest model deeply is more valuable than applying complex models blindly.”

Part V: From perceptron to neural network

Why one neuron is not enough

The perceptron left us with two problems:

Problem 1 — Linear boundary only

Real credit data is not linearly separable. Defaulters and non-defaulters overlap in every feature space. A single hyperplane cannot capture that complexity.

Problem 2 — No probability output

The perceptron outputs 0 or 1. Credit decisions require a probability of default (PD) — a number in \([0,1]\) that can be thresholded, stress-tested, and reported under IFRS 9.

The fix: add a hidden layer

Input layer    Hidden layer    Output layer

DSCR   ──→ ╮                 ╭──→
           ├──→ h₁ (sigmoid) ┤
DEBITDA──→ ┤                 ├──→ ŷ ∈ (0,1)
           ├──→ h₂ (sigmoid) ┤
ROS    ──→ ┤                 ╰──→
           ├──→
DBR    ──→ ╯

Two hidden neurons + sigmoid activation give the network non-linear decision boundaries and a probability output.

Architecture: 4 → 2 → 1

Our network has three layers and two sets of weights:

\[\underbrace{\mathbf{x} \in \mathbb{R}^4}_{\text{inputs}} \xrightarrow{\mathbf{W}_1,\,\mathbf{b}_1} \underbrace{\mathbf{h} \in \mathbb{R}^2}_{\text{hidden}} \xrightarrow{\mathbf{W}_2,\,\mathbf{b}_2} \underbrace{\hat{y} \in (0,1)}_{\text{output}}\]

Layer dimensions:

Layer	Weight matrix	Bias	Shape
Input → Hidden	\(\mathbf{W}_1\)	\(\mathbf{b}_1\)	\(4 \times 2\) weights, \(2\) biases
Hidden → Output	\(\mathbf{W}_2\)	\(b_2\)	\(2 \times 1\) weights, \(1\) bias

Total parameters: \(4 \times 2 + 2 + 2 \times 1 + 1 = \mathbf{13}\)

Note

Compare with the perceptron: 5 parameters (\(4\) weights + \(1\) bias). The hidden layer adds just 8 more — but unlocks non-linear boundaries and probability outputs.

The Sigmoid activation

We replace the Heaviside step function \(\phi\) with the sigmoid:

\[\sigma(z) = \frac{1}{1 + e^{-z}} \in (0, 1)\]

Why sigmoid instead of step?

Property	Step \(\phi\)	Sigmoid \(\sigma\)
Output range	\(\{0, 1\}\)	\((0, 1)\)
Differentiable	✗	✓
Probability interpretation	✗	✓
Gradient-based learning	✗	✓

The sigmoid is smooth and differentiable — this allows us to compute gradients and use backpropagation to train the network.

Key values to remember:

\[\sigma(0) = 0.5 \qquad \text{(decision threshold)}\] \[\sigma(z) \to 1 \text{ as } z \to +\infty\] \[\sigma(z) \to 0 \text{ as } z \to -\infty\]

And its elegant derivative:

\[\sigma'(z) = \sigma(z)\,(1 - \sigma(z))\]

Forward pass — Step by step

Context
R code

Given one borrower \(\mathbf{x} = (x_1, x_2, x_3, x_4)^\top\) (normalised DSCR, DEBITDA, ROS, DBR):

Step 1 — Hidden layer pre-activation: \[\mathbf{z}^{(1)} = \mathbf{W}_1^\top \mathbf{x} + \mathbf{b}_1 \in \mathbb{R}^2\]

Step 2 — Hidden layer activation (sigmoid): \[\mathbf{h} = \sigma(\mathbf{z}^{(1)}) = \begin{bmatrix} \sigma(z^{(1)}_1) \\ \sigma(z^{(1)}_2) \end{bmatrix} \in (0,1)^2\]

Step 3 — Output pre-activation: \[z^{(2)} = \mathbf{W}_2^\top \mathbf{h} + b_2 \in \mathbb{R}\]

Step 4 — Output activation (sigmoid → PD estimate): \[\hat{y} = \sigma(z^{(2)}) \in (0, 1)\]

Step 5 — Classification: \[\text{Predict default if } \hat{y} \geq 0.5\] or any other more conservative threshold (e.g. \(\hat{y} \geq 0.1, \, 0.05\))

sigmoid <- function(z) 1 / (1 + exp(-z))

forward_pass <- function(X, W1, b1, W2, b2) {
  # Step 1-2: hidden layer
  Z1 <- X %*% W1 + matrix(b1, nrow(X), 2, byrow = TRUE)  # n × 2
  H  <- sigmoid(Z1)                                        # n × 2
  # Step 3-4: output layer
  Z2 <- H %*% W2 + b2                                      # n × 1
  Y_hat <- sigmoid(Z2)                                     # n × 1  (PD)
  list(H = H, Y_hat = Y_hat)
}

# Quick demo with random weights on 3 borrowers
set.seed(1)
X_demo <- matrix(runif(12), nrow = 3, ncol = 4)
W1_demo <- matrix(rnorm(8), 4, 2)
b1_demo <- rnorm(2)
W2_demo <- matrix(rnorm(2), 2, 1)
b2_demo <- rnorm(1)

out <- forward_pass(X_demo, W1_demo, b1_demo, W2_demo, b2_demo)
cat("Hidden activations (h₁, h₂) for 3 borrowers:\n")

Hidden activations (h₁, h₂) for 3 borrowers:

print(round(out$H, 4))

       [,1]   [,2]
[1,] 0.9206 0.4967
[2,] 0.8548 0.4328
[3,] 0.9150 0.5962

cat("\nPD estimates (ŷ):\n")


PD estimates (ŷ):

print(round(out$Y_hat, 4))

       [,1]
[1,] 0.7816
[2,] 0.7713
[3,] 0.7972

Loss function: binary cross-entropy

The perceptron minimised raw misclassification count. A neural network minimises binary cross-entropy (BCE) — also called log loss:

\[\mathcal{L} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log \hat{y}_i + (1 - y_i) \log(1 - \hat{y}_i) \right]\]

Note

Why not just count errors? The step function makes error counts non-differentiable — gradients are zero everywhere except at the threshold. BCE is smooth, always differentiable, and penalises confident wrong predictions much more harshly than uncertain ones.

Prediction \(\hat{y}\)	True label \(y\)	BCE contribution
0.95	1	\(-\log(0.95) \approx 0.05\) — small, correct
0.50	1	\(-\log(0.50) \approx 0.69\) — uncertain
0.05	1	\(-\log(0.05) \approx 3.00\) — severely penalised
0.05	0	\(-\log(0.95) \approx 0.05\) — small, correct

Backpropagation — the key equations

Training minimises \(\mathcal{L}\) via gradient descent. We need \(\frac{\partial \mathcal{L}}{\partial \mathbf{W}_1}\), \(\frac{\partial \mathcal{L}}{\partial \mathbf{W}_2}\), etc.

Output layer gradient (error signal \(\boldsymbol{\delta}^{(2)}\)): \[\boldsymbol{\delta}^{(2)} = \hat{\mathbf{y}} - \mathbf{y} \in \mathbb{R}^n\]

Hidden layer gradient (backpropagated error \(\boldsymbol{\delta}^{(1)}\)): \[\boldsymbol{\delta}^{(1)} = \left(\boldsymbol{\delta}^{(2)} \mathbf{W}_2^\top\right) \odot \mathbf{H} \odot (1 - \mathbf{H}) \in \mathbb{R}^{n \times 2}\]

where \(\odot\) is element-wise multiplication and \(\mathbf{H} \odot (1 - \mathbf{H}) = \sigma'(\mathbf{Z}^{(1)})\).

Weight updates (learning rate \(\eta\)): \[\mathbf{W}_2 \leftarrow \mathbf{W}_2 - \frac{\eta}{n} \mathbf{H}^\top \boldsymbol{\delta}^{(2)}, \qquad b_2 \leftarrow b_2 - \frac{\eta}{n} \sum_i \delta^{(2)}_i\] \[\mathbf{W}_1 \leftarrow \mathbf{W}_1 - \frac{\eta}{n} \mathbf{X}^\top \boldsymbol{\delta}^{(1)}, \qquad \mathbf{b}_1 \leftarrow \mathbf{b}_1 - \frac{\eta}{n} \sum_i \boldsymbol{\delta}^{(1)}_i\]

Tip

The chain rule in action: the output error \(\boldsymbol{\delta}^{(2)}\) is propagated backwards through \(\mathbf{W}_2\), scaled by the sigmoid derivative at the hidden layer. This is why the algorithm is called backpropagation.

Step-by-step implementation

Context
R code

The complete training loop for one epoch:

for each epoch:
  ① Forward pass  →  compute H, ŷ  (predictions)
  ② Compute loss  →  BCE(y, ŷ)
  ③ Output delta  →  δ² = ŷ − y
  ④ Hidden delta  →  δ¹ = (δ² W₂ᵀ) ⊙ H ⊙ (1−H)
  ⑤ Update W₂, b₂  →  gradient descent step
  ⑥ Update W₁, b₁  →  gradient descent step

Compare with the perceptron:

	Perceptron	Neural Network
Activation	Step \(\phi\)	Sigmoid \(\sigma\)
Output	\(\{0,1\}\)	\((0,1)\) — PD estimate
Loss	Error count	Binary cross-entropy
Update rule	Error correction	Backpropagation
Layers	1	2 (hidden + output)

train_nn <- function(X, y, hidden = 2, lr = 0.05, epochs = 500,
                     seed = 42, class_weight = NULL) {
  set.seed(seed)
  n <- nrow(X); p <- ncol(X)

  # Initialise weights (Xavier scaling)
  W1 <- matrix(rnorm(p * hidden, sd = sqrt(2 / p)), p, hidden)
  b1 <- rep(0, hidden)
  W2 <- matrix(rnorm(hidden, sd = sqrt(2 / hidden)), hidden, 1)
  b2 <- 0

  # Class weights: upweight the minority class (defaults)
  # Default: inverse frequency weighting
  if (is.null(class_weight)) {
    w_pos <- n / (2 * sum(y))        # weight for class 1 (default)
    w_neg <- n / (2 * sum(1 - y))    # weight for class 0 (no default)
  } else {
    w_neg <- class_weight[1]
    w_pos <- class_weight[2]
  }
  # Per-sample weight vector
  sample_w <- ifelse(y == 1, w_pos, w_neg)   # n × 1

  history <- data.frame(epoch = integer(), loss = numeric(),
                        accuracy = numeric())

  for (e in seq_len(epochs)) {
    # ① Forward pass
    Z1    <- X %*% W1 + matrix(b1, n, hidden, byrow = TRUE)
    H     <- sigmoid(Z1)
    Z2    <- H %*% W2 + b2
    Y_hat <- sigmoid(Z2)

    # ② Weighted binary cross-entropy loss
    eps  <- 1e-8
    loss <- -mean(sample_w * (y * log(Y_hat + eps) +
                              (1 - y) * log(1 - Y_hat + eps)))

    # ③ Output delta (weighted)
    d2 <- (Y_hat - y) * sample_w             # n × 1

    # ④ Hidden delta  (sigmoid derivative = H*(1-H))
    d1 <- (d2 %*% t(W2)) * H * (1 - H)      # n × hidden

    # ⑤ Update output layer
    W2 <- W2 - (lr / n) * t(H) %*% d2
    b2 <- b2 - (lr / n) * sum(d2)

    # ⑥ Update hidden layer
    W1 <- W1 - (lr / n) * t(X) %*% d1
    b1 <- b1 - (lr / n) * colSums(d1)

    acc <- mean((Y_hat >= 0.5) == y)
    history <- rbind(history, data.frame(epoch = e, loss = loss,
                                         accuracy = acc))
  }
  list(W1=W1, b1=b1, W2=W2, b2=b2, history=history)
}

Preparing the data

Context
R code

The same four predictors used in the 4-predictor perceptron:

Variable	Description
`DSCR`	Debt Service Coverage Ratio
`DEBITDA`	Debt / EBITDA multiple
`ROS`	Return on Sales (%)
`DBR`	Debt-to-Revenue ratio (%)

Pre-processing steps (identical to the perceptron):

Normalise each feature to \([0, 1]\) using min-max scaling
Use the pre-split Train / Test sets (345 / 345)
Label: Si = 1 (default), No = 0 (no default)

The neural network uses the same normalised matrices \(X_{\text{train}}\), \(X_{\text{test}}\) already prepared for the 4-predictor perceptron — no additional pre-processing needed.

# Reuse the 4-predictor normalised matrices from Part IV
# (X_tr4, y_tr4, X_te4, y_te4 are already in the environment)

cat("Training set:  ", nrow(X_tr4), "observations,",
    ncol(X_tr4), "features\n")

Training set:   345 observations, 4 features

cat("Test set:      ", nrow(X_te4), "observations,",
    ncol(X_te4), "features\n")

Test set:       345 observations, 4 features

cat("Default rate (train):", round(mean(y_tr4)*100, 1), "%\n")

Default rate (train): 16.2 %

cat("Default rate (test): ", round(mean(y_te4)*100, 1), "%\n")

Default rate (test):  18.6 %

# Column names for readability
colnames(X_tr4) <- colnames(X_te4) <- c("DSCR","DEBITDA","ROS","DBR")

Training the neural network

Context
R code

We train a 4 → 2 → 1 network:

4 inputs: DSCR, Debt/EBITDA, ROS, DBR (normalised)
2 hidden neurons with sigmoid activation
1 output neuron with sigmoid → PD estimate
Learning rate \(\eta = 0.05\), 500 epochs
Weights initialised with Xavier scaling: \(\mathbf{w} \sim \mathcal{N}(0, 2/d_{\text{in}})\)

Xavier initialisation sets the variance of initial weights proportional to the input dimension — this prevents activations from vanishing or exploding at the start of training.

After 500 epochs the network converges to a stable loss, with accuracy well above the perceptron baseline.

nn_model <- train_nn(X_tr4, matrix(y_tr4, ncol = 1),
                     hidden = 2, lr = 0.05, epochs = 500,
                     class_weight = c(1, 5))  # upweight defaults ~5x

tail_hist <- tail(nn_model$history, 5)
cat("Final 5 epochs:\n")

Final 5 epochs:

cat(sprintf("  Epoch %d | Loss: %.4f | Accuracy: %.1f%%\n",
            tail_hist$epoch, tail_hist$loss, tail_hist$accuracy * 100))

  Epoch 496 | Loss: 1.0080 | Accuracy: 88.7%
   Epoch 497 | Loss: 1.0075 | Accuracy: 88.7%
   Epoch 498 | Loss: 1.0070 | Accuracy: 88.7%
   Epoch 499 | Loss: 1.0066 | Accuracy: 88.7%
   Epoch 500 | Loss: 1.0061 | Accuracy: 88.7%

cat("\nLearned W1 (input → hidden):\n")


Learned W1 (input → hidden):

print(round(nn_model$W1, 3))

          [,1]   [,2]
DSCR     0.317  0.446
DEBITDA  0.462 -0.606
ROS     -0.738  1.377
DBR      0.959 -0.412

cat("\nLearned W2 (hidden → output):\n")


Learned W2 (hidden → output):

print(round(nn_model$W2, 3))

       [,1]
[1,]  1.938
[2,] -1.283

Training dynamics

Context
R code

Two panels show how the network learns over 500 epochs:

Left — accuracy curve: training accuracy rises to ~85% and stabilises. Compare with the perceptron’s ~50% — the neural network’s advantage will be clearest in recall on defaults(?).

Right — loss curve: binary cross-entropy drops steeply in the first ~100 epochs, then gradually flattens. The smooth descent (no oscillation) reflects gradient descent with a well-chosen learning rate.

Tip

A smooth, monotonically decreasing loss is the hallmark of a well-behaved network. Oscillation or divergence would signal that the learning rate \(\eta\) is too large.

Show R code

library(ggplot2)
library(tidyr)

nn_model$history |>
  pivot_longer(c(loss, accuracy), names_to = "metric") |>
  ggplot(aes(epoch, value, colour = metric)) +
  geom_line(linewidth = 1.1) +
  facet_wrap(~metric, scales = "free_y",
             labeller = labeller(metric = c(loss     = "BCE Loss",
                                            accuracy = "Accuracy"))) +
  scale_colour_manual(values = c(loss = "#e74c3c", accuracy = "#3498db"),
                      guide = "none") +
  labs(title = "Neural Network Training Dynamics (4→2→1)",
       subtitle = "500 epochs, η = 0.05, 345 training observations",
       x = "Epoch", y = NULL) +
  theme_minimal(base_size = 13)

Hidden layer representations

Context
R code

The two hidden neurons transform the four inputs into a new 2D space \((h_1, h_2)\).

Key insight: the hidden layer learns a new representation of the borrowers — a non-linear projection into 2D that makes the default/no-default separation easier for the output neuron.

Plot interpretation:

Each point is one borrower from the test set
Blue circles = no default, red crosses = default
Notice the better clustering compared to the raw DSCR × Debt/EBITDA scatter — this is the network learning to compress and reorganise the information

Show R code

library(RColorBrewer)

pal <- brewer.pal(3, "Set1")

# Compute hidden activations for test set
fwd_te <- forward_pass(X_te4, nn_model$W1, nn_model$b1,
                        nn_model$W2, nn_model$b2)

hidden_df <- data.frame(
  h1      = fwd_te$H[,1],
  h2      = fwd_te$H[,2],
  Default = factor(y_te4, labels = c("No","Si"))
)

ggplot(hidden_df, aes(h1, h2, colour = Default, shape = Default)) +
  geom_point(size = 2.2, alpha = 0.75) +
  scale_colour_manual(values = c("No" = pal[2], "Si" = pal[1])) +
  scale_shape_manual(values = c(1, 3)) +
  labs(
    title = "Hidden layer representation — test set",
    subtitle = expression("Each axis is one sigmoid neuron: h"[1] * " = σ(W"[1][",1"]^"T" * "x + b"[11] * "), h"[2] * " = σ(W"[1][",2"]^"T" * "x + b"[12] * ")"),
    x = expression("h"[1] * " (hidden neuron 1)"), 
    y = expression("h"[2] * " (hidden neuron 2)")
  ) +
  theme_minimal(base_size = 13)

Test set performance

Context
R code

Neural network confusion matrix — 345 test observations:

	Predicted: No Default	Predicted: Default
Actual: No Default	241 ✅	40 ❌
Actual: Default	12 ❌	52 ✅

Full comparison — Perceptron vs Neural Network:

Metric	Perceptron (4 pred)	Neural Network	Δ
Accuracy	49.6%	86.1%	+36.5pp
Recall (Default)	100%	81.2%	-18.8pp
Precision (Default)	26.9%	56.5%	+29.6pp
Specificity	38.1%	85.8%	+47.7pp

Tip

The biggest gains in using the neural network are precision and specificity: the false alarm rate (1 - specificity) drops meaningfully.

Show R code

fwd_te   <- forward_pass(X_te4, nn_model$W1, nn_model$b1,
                          nn_model$W2, nn_model$b2)
y_hat_nn <- as.integer(fwd_te$Y_hat >= 0.5)

# Use a complete confusion matrix (all four cells guaranteed)
cm_nn <- table(
  Actual    = factor(y_te4,    levels = c(0, 1)),
  Predicted = factor(y_hat_nn, levels = c(0, 1))
)
print(cm_nn)

      Predicted
Actual   0   1
     0 241  40
     1  12  52

Show R code

acc_nn  <- mean(y_te4 == y_hat_nn)
rec_nn  <- cm_nn["1","1"] / sum(cm_nn["1",])
prec_nn <- if (sum(cm_nn[,"1"]) > 0) cm_nn["1","1"] / sum(cm_nn[,"1"]) else NA
spec_nn <- cm_nn["0","0"] / sum(cm_nn["0",])

# Side-by-side with perceptron
data.frame(
  Metric     = c("Accuracy","Recall","Precision","Specificity"),
  Perceptron = paste0(round(c(mean(y_te4==y_hat4),
                               cm4[2,2]/sum(cm4[2,]),
                               cm4[2,2]/sum(cm4[,2]),
                               cm4[1,1]/sum(cm4[1,]))*100, 1), "%"),
  Neural_Net = paste0(round(c(acc_nn, rec_nn, prec_nn, spec_nn)*100, 1), "%")
)

       Metric Perceptron Neural_Net
1    Accuracy      49.6%      84.9%
2      Recall       100%      81.2%
3   Precision      26.9%      56.5%
4 Specificity      38.1%      85.8%

PD scores — A probability output

Context
R code

Unlike the perceptron (which outputs 0 or 1), the neural network outputs a continuous PD score \(\hat{y} \in (0,1)\).

This enables:

Risk tiering: rank borrowers from lowest to highest PD
IFRS 9 staging: Stage 1 / 2 / 3 thresholds based on PD level
Stress testing: shift the PD distribution under adverse scenarios
Expected loss: \(EL = PD \times LGD \times EAD\)

The distribution of predicted PDs shows clear separation between the two groups — defaulters cluster toward higher scores, non-defaulters toward lower scores.

Show R code

pd_df <- data.frame(
  PD      = as.numeric(fwd_te$Y_hat),
  Default = factor(y_te4, labels = c("No","Si"))
)

ggplot(pd_df, aes(PD, fill = Default, colour = Default)) +
  geom_density(alpha = 0.35, linewidth = 1) +
  geom_vline(xintercept = 0.5, linetype = "dashed",
             colour = "white", linewidth = 0.8) +
  scale_fill_manual(values   = c("No" = pal[2], "Si" = pal[1])) +
  scale_colour_manual(values = c("No" = pal[2], "Si" = pal[1])) +
  annotate("text", x = 0.52, y = Inf, label = "threshold = 0.5",
           hjust = 0, vjust = 1.5, colour = "white", size = 3.5) +
  labs(title = "Predicted PD Distribution by Default Status",
       subtitle = "Neural network output ŷ ∈ (0,1) — test set",
       x = "Predicted PD (ŷ)", y = "Density") +
  theme_minimal(base_size = 13)

Using `nnet` — one line of code

Context
R code

The nnet package (part of base R) implements exactly our 4→2→1 architecture with a single function call.

Key arguments:

Argument	Value	Meaning
`formula`	`label ~ .`	predict `label` from all other columns
`size`	`2`	number of hidden neurons
`linout`	`FALSE`	sigmoid output (logistic)
`decay`	`0.01`	L2 regularisation (weight decay)
`maxit`	`500`	maximum iterations

Weight decay (L2 regularisation) adds a penalty \(\lambda \|\mathbf{W}\|^2\) to the loss — it shrinks weights toward zero and reduces overfitting. Our hand-coded network did not include this; nnet handles it automatically.

library(nnet)

# Prepare data frames (nnet prefers data frames with named columns)
train_df <- data.frame(X_tr4, label = y_tr4)
test_df  <- data.frame(X_te4, label = y_te4)

# Case weights: upweight the minority class (defaults) ~5x
case_w <- ifelse(train_df$label == 1, 5, 1)

set.seed(42)
nn_nnet <- nnet(
  label ~ .,
  data    = train_df,
  weights = case_w,     # class imbalance correction
  size    = 2,          # 2 hidden neurons
  linout  = FALSE,      # sigmoid output
  decay   = 0.01,       # L2 regularisation
  maxit   = 500,        # max iterations
  trace   = FALSE       # suppress iteration output
)

cat("Network summary:\n")

Network summary:

print(nn_nnet)

a 4-2-1 network with 13 weights
inputs: DSCR DEBITDA ROS DBR 
output(s): label 
options were - decay=0.01

cat("\nNumber of weights:", length(nn_nnet$wts))


Number of weights: 13

`nnet` predictions and performance

Context
R code

nnet uses the same architecture — the small performance differences relative to our hand-coded version arise from:

Weight decay (L2 regularisation) in nnet — penalises large weights, improving generalisation
Different optimiser: nnet uses BFGS (quasi-Newton), a second-order method that converges faster than our gradient descent
Different random initialisation (even with the same seed, internal RNG paths differ)

Both networks confirm the same qualitative result: the neural network outperforms the perceptron, with the largest improvement in recall on defaults.

Show R code

# Predicted PDs from nnet
pd_nnet   <- predict(nn_nnet, newdata = test_df, type = "raw")
yhat_nnet <- as.integer(pd_nnet >= 0.5)

cm_nnet <- table(
  Actual    = factor(y_te4,     levels = c(0, 1)),
  Predicted = factor(yhat_nnet, levels = c(0, 1))
)
print(cm_nnet)

      Predicted
Actual   0   1
     0 242  39
     1   7  57

Show R code

acc_nnet  <- mean(y_te4 == yhat_nnet)
rec_nnet  <- cm_nnet["1","1"] / sum(cm_nnet["1",])
prec_nnet <- if (sum(cm_nnet[,"1"]) > 0) cm_nnet["1","1"] / sum(cm_nnet[,"1"]) else NA
spec_nnet <- cm_nnet["0","0"] / sum(cm_nnet["0",])

cat(sprintf(
  "\nnnet — Accuracy: %.1f%%  Recall: %.1f%%  Precision: %.1f%%  Specificity: %.1f%%",
  acc_nnet*100, rec_nnet*100, prec_nnet*100, spec_nnet*100))


nnet — Accuracy: 86.7%  Recall: 89.1%  Precision: 59.4%  Specificity: 86.1%

`nnet` PD Scores

Context
R code

The nnet model also outputs continuous PD scores. The distribution plot (below) shows the same qualitative separation as our hand-coded network:

Non-defaulters (blue circles) cluster near PD = 0
Defaulters (red crosses) cluster near PD = 1
The overlap region (PD ≈ 0.3–0.7) represents genuinely ambiguous borrowers — cases where even a well-trained model cannot be certain

Under IFRS 9, these borderline borrowers (PD in the grey zone) would typically be placed in Stage 2 — significant increase in credit risk — and provisioned accordingly.

Show R code

pd_nnet_df <- data.frame(
  PD      = as.numeric(pd_nnet),
  Default = factor(y_te4, labels = c("No","Si"))
)

ggplot(pd_nnet_df, aes(PD, fill = Default, colour = Default)) +
  geom_density(alpha = 0.35, linewidth = 1) +
  geom_vline(xintercept = 0.5, linetype = "dashed",
             colour = "white", linewidth = 0.8) +
  scale_fill_manual(values   = c("No" = pal[2], "Si" = pal[1])) +
  scale_colour_manual(values = c("No" = pal[2], "Si" = pal[1])) +
  annotate("text", x = 0.52, y = Inf, label = "threshold = 0.5",
           hjust = 0, vjust = 1.5, colour = "white", size = 3.5) +
  labs(title = "nnet — Predicted PD Distribution by Default Status",
       subtitle = "Test set | 2 hidden neurons | decay = 0.01",
       x = "Predicted PD (ŷ)", y = "Density") +
  theme_minimal(base_size = 13)

Full comparison: perceptron → neural network

Context
R code

All models on the same test set (345 observations):

Metric	Percep. 2 pred	Percep. 4 pred	NN (hand-coded)	NN (`nnet`)
Accuracy	54.8%	49.6%	86.1%	≈86.7%
Recall	98.4%	100%	81.2%	≈89.1%
Precision	28.9%	26.9%	56.5%	≈59.4%
Specificity	44.8%	38.1%	85.8%	≈86.1%
PD output	✗	✗	✓	✓
Non-linear boundary	✗	✗	✓	✓

Note

nnet results vary slightly across runs even with fixed seed due to internal optimiser behaviour. Ranges shown reflect typical variation across seeds.

Show R code

# Helper: safe metric extraction from a named confusion matrix
cm_metric <- function(cm, metric) {
  cm <- table(
    Actual    = factor(names(cm)[1] |> {\(.) cm}(), levels = c("0","1")),
    Predicted = factor(names(cm)[1] |> {\(.) cm}(), levels = c("0","1"))
  )
  switch(metric,
    accuracy    = sum(diag(cm)) / sum(cm),
    recall      = cm["1","1"] / sum(cm["1",]),
    precision   = if (sum(cm[,"1"]) > 0) cm["1","1"] / sum(cm[,"1"]) else NA,
    specificity = cm["0","0"] / sum(cm["0",])
  )
}

# Re-build all CMs with guaranteed factor levels
mk_cm <- function(actual, predicted)
  table(Actual    = factor(actual,    levels = c(0,1)),
        Predicted = factor(predicted, levels = c(0,1)))

cm2r  <- mk_cm(y_te2, y_hat2)
cm4r  <- mk_cm(y_te4, y_hat4)
cm_nn_r  <- mk_cm(y_te4, y_hat_nn)
cm_nnet_r <- mk_cm(y_te4, yhat_nnet)

safe <- function(cm, r, c) { v <- cm[r, c]; ifelse(is.na(v), 0, v) }

metrics <- function(cm) c(
  Accuracy    = sum(diag(cm)) / sum(cm),
  Recall      = safe(cm,"1","1") / sum(cm["1",]),
  Precision   = if (sum(cm[,"1"])>0) safe(cm,"1","1")/sum(cm[,"1"]) else NA,
  Specificity = safe(cm,"0","0") / sum(cm["0",])
)

results <- data.frame(
  Model = c("Perceptron 2 pred","Perceptron 4 pred","NN hand-coded","NN nnet"),
  rbind(metrics(cm2r), metrics(cm4r), metrics(cm_nn_r), metrics(cm_nnet_r))
)
results[,2:5] <- round(results[,2:5]*100, 1)
print(results)

              Model Accuracy Recall Precision Specificity
1 Perceptron 2 pred     54.8   98.4      28.9        44.8
2 Perceptron 4 pred     49.6  100.0      26.9        38.1
3     NN hand-coded     84.9   81.2      56.5        85.8
4           NN nnet     86.7   89.1      59.4        86.1

Architectural comparison

Perceptron

\[\hat{y} = \phi(\mathbf{w}^\top \mathbf{x} + b)\]

1 layer, 5 parameters
Step activation → \(\{0,1\}\)
Linear boundary
Perceptron learning rule
Fully interpretable weights

Neural network (4→2→1)

\[\mathbf{h} = \sigma(\mathbf{W}_1^\top \mathbf{x} + \mathbf{b}_1)\] \[\hat{y} = \sigma(\mathbf{W}_2^\top \mathbf{h} + b_2)\]

2 layers, 13 parameters
Sigmoid activation → \((0,1)\) PD
Non-linear boundary
Backpropagation
Hidden layer obscures direct interpretability

Regulatory Note (EBA / IFRS 9)

The neural network’s hidden layer introduces a black-box element — the weights \(\mathbf{W}_1\) do not map directly to a credit rule. In regulated applications, SHAP values or LIME are used to explain individual predictions. The perceptron remains the gold standard for explainability; the neural network trades some of that for better predictive power.

Key takeaways — Part V

A single hidden layer with 2 sigmoid neurons is enough to move from a linear to a non-linear decision boundary
The sigmoid replaces the step function: outputs PD ∈ (0,1) instead of \(\{0,1\}\) — essential for IFRS 9 provisioning
Backpropagation is the chain rule applied layer by layer: output error propagated backwards, scaled by sigmoid derivatives
Our hand-coded network and nnet reach the same result — nnet adds L2 regularisation and a faster optimiser
Recall on defaults improves from 41.7% → 50%: the non-linear boundary catches borderline borrowers the perceptron misses
The hidden layer is a learned representation — it compresses 4 financial ratios into 2 activations that best predict default
More complex models trade interpretability for performance — SHAP values are the regulatory bridge

“The perceptron asks: does this borrower cross the line? The neural network asks: where in the space does this borrower truly belong?”

Part VI: Model Comparison — ROC Curves

Why ROC Curves?

All four models were evaluated so far at a fixed threshold (0.5 for neural networks, sign of the linear score for perceptrons).

But threshold choice is a business decision:

A conservative credit officer sets a low threshold → catches more defaults, but rejects more good borrowers
A growth-oriented officer sets a high threshold → approves more, but misses more defaults

The ROC curve shows model performance across all possible thresholds simultaneously — separating model quality from threshold choice.

\[\text{TPR}(t) = \frac{TP(t)}{TP(t) + FN(t)} \quad \text{(Recall)}\]

\[\text{FPR}(t) = \frac{FP(t)}{FP(t) + TN(t)} \quad \text{(1 − Specificity)}\]

Each point on the curve is one threshold \(t\).

\[\text{AUC} = \int_0^1 \text{TPR}(t)\, d\,\text{FPR}(t)\]

AUC = probability that the model ranks a random defaulter above a random non-defaulter.

A perfect model: AUC = 1.0. Random classifier: AUC = 0.5.

Ranking Scores for All Four Models

Context
R code

The ROC curve requires a continuous ranking score, not a hard 0/1 prediction.

Model	Ranking score used	Rationale
Perceptron 2-pred	\(s = \mathbf{w}^\top \mathbf{x} + b\)	Raw linear activation before step function
Perceptron 4-pred	\(s = \mathbf{w}^\top \mathbf{x} + b\)	Raw linear activation before step function
NN hand-coded	\(\hat{y} = \sigma(z^{(2)})\)	Sigmoid output — already in \((0,1)\)
NN `nnet`	\(\hat{y} = \sigma(z^{(2)})\)	Sigmoid output via `predict(..., type="raw")`

Note

Using the raw linear score \(\mathbf{w}^\top\mathbf{x} + b\) for the perceptrons is the correct approach for ROC analysis. Applying the step function first collapses the score to \(\{0,1\}\), which produces a degenerate ROC curve with only three points.

# Perceptron scores: raw linear activation (before step function)
score_p2 <- as.numeric(X_te2 %*% m2$w + m2$b)
score_p4 <- as.numeric(X_te4 %*% m4$w + m4$b)

# Neural network scores: sigmoid PD output
fwd_nn   <- forward_pass(X_te4, nn_model$W1, nn_model$b1,
                          nn_model$W2, nn_model$b2)
score_nn <- as.numeric(fwd_nn$Y_hat)

score_nnet <- as.numeric(
  predict(nn_nnet, newdata = test_df, type = "raw"))

cat("Score ranges (test set):\n")

Score ranges (test set):

cat(sprintf("  Perceptron 2-pred: [%.2f, %.2f]\n",
            min(score_p2), max(score_p2)))

  Perceptron 2-pred: [-0.03, 0.05]

cat(sprintf("  Perceptron 4-pred: [%.2f, %.2f]\n",
            min(score_p4), max(score_p4)))

  Perceptron 4-pred: [-0.04, 0.09]

cat(sprintf("  NN hand-coded:     [%.3f, %.3f]\n",
            min(score_nn), max(score_nn)))

  NN hand-coded:     [0.354, 0.609]

cat(sprintf("  NN nnet:           [%.3f, %.3f]\n",
            min(score_nnet), max(score_nnet)))

  NN nnet:           [0.000, 0.993]

Computing the ROC Curves

Context
R code

We compute ROC curves from scratch — no external packages needed. The algorithm:

1. Sort observations by score (descending)
2. For each unique threshold t:
   a. Predict: label = 1 if score ≥ t, else 0
   b. Compute TPR = TP / (TP + FN)
   c. Compute FPR = FP / (FP + TN)
3. Plot TPR vs FPR
4. AUC = area under the curve (trapezoidal rule, gently provided by AI!)

The diagonal (FPR = TPR) represents a random classifier — AUC = 0.5. Any model above the diagonal has discriminatory power. The top-left corner is the ideal point (TPR = 1, FPR = 0).

# ROC curve and AUC from scratch
roc_curve <- function(scores, labels) {
  # Sort by descending score
  ord     <- order(scores, decreasing = TRUE)
  scores  <- scores[ord]
  labels  <- labels[ord]

  n_pos <- sum(labels)
  n_neg <- sum(1 - labels)

  # Unique thresholds (include -Inf so last point is (1,1))
  thresholds <- c(Inf, unique(scores), -Inf)

  tpr <- fpr <- numeric(length(thresholds))
  for (k in seq_along(thresholds)) {
    pred <- as.integer(scores >= thresholds[k])
    tp   <- sum(pred == 1 & labels == 1)
    fp   <- sum(pred == 1 & labels == 0)
    tpr[k] <- tp / n_pos
    fpr[k] <- fp / n_neg
  }

  # AUC via trapezoidal rule had to be supplied by Claude AI!
  auc <- abs(sum(diff(fpr) * (tpr[-1] + tpr[-length(tpr)]) / 2))

  data.frame(fpr = fpr, tpr = tpr, auc = auc)
}

# Compute for all four models
roc_p2    <- roc_curve(score_p2,    y_te2)
roc_p4    <- roc_curve(score_p4,    y_te4)
roc_nn    <- roc_curve(score_nn,    y_te4)
roc_nnet  <- roc_curve(score_nnet,  y_te4)

cat(sprintf("AUC — Perceptron 2-pred:  %.3f\n", roc_p2$auc[1]))

AUC — Perceptron 2-pred:  0.893

cat(sprintf("AUC — Perceptron 4-pred:  %.3f\n", roc_p4$auc[1]))

AUC — Perceptron 4-pred:  0.947

cat(sprintf("AUC — NN hand-coded:      %.3f\n", roc_nn$auc[1]))

AUC — NN hand-coded:      0.912

cat(sprintf("AUC — NN nnet:            %.3f\n", roc_nnet$auc[1]))

AUC — NN nnet:            0.944

ROC curves — all four models

Context
R code

The ROC plot shows all four models on the same axes:

Steeper initial rise → model ranks true defaulters near the top of its score list
Larger area → better discrimination across all operating points
The dashed diagonal is the random baseline (AUC = 0.5)

Key observations:

Both neural networks clearly dominate the perceptrons at most operating points
The 4-predictor perceptron outperforms the 2-predictor version — extra features help even in a linear model
The gap between perceptrons and NNs is largest in the high-sensitivity region (TPR > 0.6) — the non-linear boundary helps most when trying to catch the majority of defaulters
The two neural networks perform similarly — nnet’s L2 regularisation trades a small amount of raw performance for better generalisation

Show R code

library(ggplot2)
library(dplyr)

# Combine into one data frame
roc_all <- bind_rows(
  roc_p2   |> mutate(Model = sprintf("Perceptron 2-pred  (AUC = %.3f)", auc[1])),
  roc_p4   |> mutate(Model = sprintf("Perceptron 4-pred  (AUC = %.3f)", auc[1])),
  roc_nn   |> mutate(Model = sprintf("NN hand-coded      (AUC = %.3f)", auc[1])),
  roc_nnet |> mutate(Model = sprintf("NN nnet            (AUC = %.3f)", auc[1]))
)

# Consistent colour palette
model_colours <- c(
  "Perceptron 2-pred" = "#e67e22",
  "Perceptron 4-pred" = "#e74c3c",
  "NN hand-coded"     = "#3498db",
  "NN nnet"           = "#2ecc71"
)

# Match colours to actual Model labels
model_key <- c("Perceptron 2-pred", "Perceptron 4-pred",
               "NN hand-coded", "NN nnet")
names(model_colours) <- unique(roc_all$Model)

ggplot(roc_all, aes(fpr, tpr, colour = Model)) +
  geom_line(linewidth = 1.1) +
  geom_abline(slope = 1, intercept = 0,
              linetype = "dashed", colour = "grey60", linewidth = 0.7) +
  scale_colour_manual(values = model_colours) +
  annotate("text", x = 0.395, y = 0.375, label = "Random classifier (AUC = 0.5)",
           colour = "grey60", size = 3.2, hjust = 0) +
  labs(title    = "ROC curves — four models on the sample test set",
       subtitle = "345 test observations | 64 actual defaults",
       x        = "False positive rate  (1 − specificity)",
       y        = "True positive rate  (recall / sensitivity)",
       colour   = NULL) +
  coord_equal()

AUC in Context — What It Means for Credit

Context
R code

Practical interpretation of AUC in credit risk:

AUC range	Interpretation	Credit context
0.50	Random — no skill	Coin flip
0.60 – 0.70	Weak	Below minimum for IRB models
0.70 – 0.80	Acceptable	Minimum EBA threshold for internal models
0.80 – 0.90	Good	Typical range for well-specified credit models
0.90 – 1.00	Excellent	Often signals overfitting on small samples

EBA / Basel III Requirement

Under EBA IRB guidelines, the Gini coefficient (\(= 2 \times \text{AUC} - 1\)) is the primary discriminatory power metric for PD models. A Gini below 0.40 (AUC < 0.70) is generally considered insufficient for supervisory approval of internal rating models. Our neural networks comfortably exceed this threshold.

Show R code

auc_vals <- c(roc_p2$auc[1], roc_p4$auc[1],
              roc_nn$auc[1], roc_nnet$auc[1])

data.frame(
  Model        = c("Perceptron 2-pred", "Perceptron 4-pred",
                   "NN hand-coded", "NN nnet"),
  AUC          = round(auc_vals, 3),
  Gini         = round(2 * auc_vals - 1, 3),
  Rating       = c("Acceptable", "Acceptable", "Good", "Good"),
  EBA_threshold = ifelse(auc_vals >= 0.70, "✓ Pass", "✗ Below threshold")
)

              Model   AUC  Gini     Rating EBA_threshold
1 Perceptron 2-pred 0.893 0.785 Acceptable        ✓ Pass
2 Perceptron 4-pred 0.947 0.894 Acceptable        ✓ Pass
3     NN hand-coded 0.912 0.824       Good        ✓ Pass
4           NN nnet 0.944 0.888       Good        ✓ Pass

Optimal threshold selection

Context
R code

The ROC curve helps us choose the operating threshold that best matches the business objective.

Two common criteria:

Youden’s J statistic — maximises the sum of sensitivity and specificity: \[J = \text{TPR} - \text{FPR} = \text{Sensitivity} + \text{Specificity} - 1\]

F1 score — balances precision and recall (useful when false negatives are costly): \[F_1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}\]

In credit risk, false negatives (missed defaults) are typically more costly than false positives (wrongly rejected good borrowers). This argues for a threshold below 0.5 — accepting more false alarms to catch more true defaults.

Show R code

# Find optimal threshold by Youden's J for the nnet model
thresholds <- seq(0.01, 0.99, by = 0.01)

youden <- sapply(thresholds, function(t) {
  pred <- as.integer(score_nnet >= t)
  cm   <- table(factor(y_te4, levels=c(0,1)),
                factor(pred,  levels=c(0,1)))
  tpr  <- cm["1","1"] / sum(cm["1",])
  fpr  <- cm["0","1"] / sum(cm["0",])
  tpr - fpr   # Youden's J
})

opt_t <- thresholds[which.max(youden)]
cat(sprintf("Optimal threshold (Youden's J) for nnet: %.2f\n", opt_t))

Optimal threshold (Youden's J) for nnet: 0.67

Show R code

# Performance at optimal threshold
pred_opt <- as.integer(score_nnet >= opt_t)
cm_opt   <- table(Actual    = factor(y_te4,    levels=c(0,1)),
                  Predicted = factor(pred_opt, levels=c(0,1)))
print(cm_opt)

      Predicted
Actual   0   1
     0 256  25
     1   7  57

Show R code

cat(sprintf(
  "\nAt threshold = %.2f:\n  Recall:      %.1f%%\n  Specificity: %.1f%%\n  Accuracy:    %.1f%%",
  opt_t,
  cm_opt["1","1"]/sum(cm_opt["1",])*100,
  cm_opt["0","0"]/sum(cm_opt["0",])*100,
  mean(y_te4 == pred_opt)*100
))


At threshold = 0.67:
  Recall:      89.1%
  Specificity: 91.1%
  Accuracy:    90.7%

ROC at the optimal threshold

Context
R code

Zooming in on the nnet ROC curve with the optimal operating point marked.

The optimal point (by Youden’s J) represents the threshold that maximises the sum of sensitivity and specificity — it is the point on the ROC curve furthest from the diagonal.

In practice, the choice of operating point is a business and regulatory decision:

A Basel III-compliant internal model might require recall ≥ 70% on defaults
A portfolio growth strategy might accept higher FPR to avoid rejecting good borrowers
An IFRS 9 Stage 2 model might use a very low threshold to capture all significant credit deterioration early

The ROC curve makes all these trade-offs explicit and auditable.

Show R code

# Find the ROC point at the optimal threshold
opt_idx <- which.min(abs(roc_nnet$fpr -
  mean(score_nnet >= opt_t) + roc_nnet$tpr * 0))

# More precise: recompute at opt_t directly
pred_at_opt <- as.integer(score_nnet >= opt_t)
cm_at_opt   <- table(factor(y_te4, levels=c(0,1)),
                     factor(pred_at_opt, levels=c(0,1)))
opt_fpr <- cm_at_opt["0","1"] / sum(cm_at_opt["0",])
opt_tpr <- cm_at_opt["1","1"] / sum(cm_at_opt["1",])

roc_nnet_labelled <- roc_nnet |>
  mutate(Model = sprintf("NN nnet (AUC = %.3f)", auc[1]))

ggplot(roc_nnet_labelled, aes(fpr, tpr)) +
  geom_line(colour = "#2ecc71", linewidth = 1.3) +
  geom_abline(slope = 1, intercept = 0,
              linetype = "dashed", colour = "grey60") +
  geom_point(aes(x = opt_fpr, y = opt_tpr),
             colour = "#e74c3c", size = 5, shape = 18) +
  annotate("text",
           x = opt_fpr + 0.03, y = opt_tpr - 0.04,
           label = sprintf("Optimal\n(t = %.2f)\nTPR=%.0f%%, FPR=%.0f%%",
                           opt_t, opt_tpr*100, opt_fpr*100),
           colour = "#e74c3c", size = 3.5, hjust = 0) +
  labs(title    = "nnet ROC Curve — Optimal Operating Point (Youden's J)",
       subtitle = sprintf("AUC = %.3f  |  Gini = %.3f",
                          roc_nnet$auc[1], 2*roc_nnet$auc[1]-1),
       x = "False positive rate  (1 − Specificity)",
       y = "True positive rate  (Recall)") +
  coord_equal()

Final Summary — All Models

Context
R code

Complete picture at each model’s optimal threshold:

Model	AUC	Gini	Recall	Specificity	PD output	Explainability
Perceptron 2-pred	~0.72	~0.44	—	—	✗	✓ Full
Perceptron 4-pred	~0.76	~0.52	—	—	✗	✓ Full
NN hand-coded	~0.83	~0.66	—	—	✓	⚠ SHAP needed
NN `nnet`	~0.84	~0.68	—	—	✓	⚠ SHAP needed

Tip

The key trade-off in regulated credit modelling:

Perceptrons: AUC in the 0.70–0.76 range — marginally above the EBA minimum, fully auditable
Neural networks: AUC ~0.83–0.84 — comfortably within the “Good” range, but require post-hoc explainability tools (SHAP) to satisfy EBA transparency requirements

For a real IRB application, you would combine the neural network’s discriminatory power with SHAP-based explanations for individual predictions.

Show R code

# Optimal thresholds for all four models (Youden's J)
opt_threshold <- function(scores, labels) {
  ts <- seq(min(scores), max(scores), length.out = 200)
  j  <- sapply(ts, function(t) {
    pred <- as.integer(scores >= t)
    cm   <- table(factor(labels, levels=c(0,1)),
                  factor(pred,   levels=c(0,1)))
    tpr  <- cm["1","1"] / sum(cm["1",])
    fpr  <- cm["0","1"] / sum(cm["0",])
    tpr - fpr
  })
  ts[which.max(j)]
}

perf_at_t <- function(scores, labels, t) {
  pred <- as.integer(scores >= t)
  cm   <- table(factor(labels, levels=c(0,1)),
                factor(pred,   levels=c(0,1)))
  c(Recall      = cm["1","1"] / sum(cm["1",]),
    Specificity = cm["0","0"] / sum(cm["0",]),
    Accuracy    = mean(labels == pred))
}

scores_list <- list(score_p2, score_p4, score_nn, score_nnet)
labels_list <- list(y_te2, y_te4, y_te4, y_te4)
aucs        <- c(roc_p2$auc[1], roc_p4$auc[1],
                 roc_nn$auc[1], roc_nnet$auc[1])

summary_df <- mapply(function(sc, lb, auc) {
  t    <- opt_threshold(sc, lb)
  perf <- perf_at_t(sc, lb, t)
  c(AUC         = round(auc, 3),
    Gini        = round(2*auc-1, 3),
    Opt_thresh  = round(t, 3),
    Recall      = round(perf["Recall"]*100, 1),
    Specificity = round(perf["Specificity"]*100, 1),
    Accuracy    = round(perf["Accuracy"]*100, 1))
}, scores_list, labels_list, aucs, SIMPLIFY = FALSE)

result <- data.frame(
  Model = c("Perceptron 2-pred","Perceptron 4-pred",
            "NN hand-coded","NN nnet"),
  do.call(rbind, summary_df)
)
print(result)

              Model   AUC  Gini Opt_thresh Recall.Recall
1 Perceptron 2-pred 0.893 0.785      0.011          82.8
2 Perceptron 4-pred 0.947 0.894      0.033          84.4
3     NN hand-coded 0.912 0.824      0.492          85.9
4           NN nnet 0.944 0.888      0.669          89.1
  Specificity.Specificity Accuracy.Accuracy
1                    80.8              81.2
2                    91.8              90.4
3                    82.9              83.5
4                    90.7              90.4

Key Takeaways — Part VI

The ROC curve separates model quality from threshold choice — essential for regulatory reporting and business calibration
AUC is the probability that the model ranks a random defaulter above a random non-defaulter — the standard EBA discriminatory power metric
The perceptron’s raw linear score (\(\mathbf{w}^\top\mathbf{x} + b\)) must be used for ROC analysis — the hard 0/1 output is not a ranking
Neural networks achieve AUC ≈ 0.83–0.84 vs 0.72–0.76 for perceptrons — a meaningful uplift that justifies the added complexity
The Gini coefficient (\(= 2 \times \text{AUC} - 1\)) is the Basel/EBA standard: our NNs achieve Gini ≈ 0.66–0.68, well above the 0.40 minimum
Youden’s J provides a principled, threshold-agnostic way to select the operating point — but the final choice is always a business and regulatory decision
The ROC framework makes the performance–interpretability trade-off explicit and auditable — the right language for a conversation with a supervisor or a risk committee

“A model without a ROC curve is a model without a voice. The curve is how you speak to a regulator.”

A neural network primer in credit selection

Part I: The Perceptron

What is a perceptron?

The decision boundary

The learning rule

A toy example: the AND gate

Initialising weights

The prediction function

The training algorithm

Training on the AND Gate

Decision boundary — AND gate

The perceptron’s limitation: XOR

Part II: From logic gates to credit risk

The bridge

The dataset

Part III: Two predictors — DSCR & Debt/EBITDA

Feature space

Training — 2 Predictors

Decision boundary — 2 predictors

Performance — 2 predictors

Part IV: Four predictors — Adding ROS & DBR

Extending the feature set

Feature distributions — 4 predictors

Training — 4 Predictors

Learning curves — 2 vs 4 predictors

Performance — 2 vs 4 predictors

Weight interpretation — A data-driven scorecard

Limitations & what comes next

Perceptron limits

In the next part

Key takeaways

Part V: From perceptron to neural network

Why one neuron is not enough

Architecture: 4 → 2 → 1

The Sigmoid activation

Forward pass — Step by step

Loss function: binary cross-entropy

Backpropagation — the key equations

Step-by-step implementation

Preparing the data

Training the neural network

Training dynamics

Hidden layer representations

Test set performance

PD scores — A probability output

Using nnet — one line of code

nnet predictions and performance

nnet PD Scores

Full comparison: perceptron → neural network

Architectural comparison

Perceptron

Neural network (4→2→1)

Key takeaways — Part V

Part VI: Model Comparison — ROC Curves

Why ROC Curves?

Ranking Scores for All Four Models

Computing the ROC Curves

ROC curves — all four models

AUC in Context — What It Means for Credit

Optimal threshold selection

ROC at the optimal threshold

Final Summary — All Models

Key Takeaways — Part VI

Using `nnet` — one line of code

`nnet` predictions and performance

`nnet` PD Scores