Both the gradient vector and Hessian matrix are used in optimisation, particularly in continuous optimisation problems.

 

The gradient vector is a vector of the first partial derivatives of a scalar-valued function f(x) with respect to its n variables:

$$ \bigtriangledown f(x)= { \left [ \frac{\partial f}{\partial _{x_1}}, \frac{\partial f}{\partial _{x_2}},..., \frac{\partial f}{\partial _{x_n}} \right ] }^T $$

 

The gradient vector is used to find the direction of steepest ascent or descent of a scalar-valued function, depending on the problem (minimisation or maximisation). Especially in machine learning, it is essential for training models via backpropagation, where it guides parameter updates to minimise loss functions.

 

The Hessian matrix is an n × n matrix of the second partial derivatives of the function:

$$ H_f(x) = \nabla^2 f(x) =  \frac{\partial^2 f}{\partial x_i \partial x_j} = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix} $$

 

In optimisation, the Hessian matrix is used to determine whether a point is a local minimum, local maximum, or saddle point, and to understand the curvature of the objective function for better convergence. Methods such as Newton's method and quasi-Newton methods use the Hessian to update parameters, leveraging both first and second derivatives to find optima more efficiently than gradient-only methods.

 

To generalise the gradient for functions with multiple outputs, the Jacobian matrix is used. It is a matrix of all first-order partial derivatives of a vector-valued function.

 

A vector-valued function f(x) is defined as:

$$ f(x)=\begin{bmatrix} f_1(x_1,x_2,...,x_n) \\ f_2(x_1,x_2,...,x_n) \\ \vdots  \\ f_m(x_1,x_2,...,x_n) \end{bmatrix} $$

 

The Jacobian matrix of f(x) is:

$$ J_f(x) = \begin{bmatrix}
\frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} &
 \cdots  & \frac{\partial f_1}{\partial x_n} \\

\frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} &
 \cdots  & \frac{\partial f_2}{\partial x_n} \\

\vdots  & \vdots  & \ddots  & \vdots  \\

\frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} &
 \cdots  & \frac{\partial f_m}{\partial x_n} \\
\end{bmatrix} $$

 

The Jacobian plays a key role in backpropagation (especially in layers with multiple outputs) in machine learning, and is also used in Newton's method for solving systems of nonlinear equations involving vector-valued functions.

'MachineLearning > Optimisation' 카테고리의 다른 글

SGD, ADAM, HN ADAM  (3) 2025.06.27
Regularised least squares (RLS)  (0) 2024.01.22
Least-squares problems  (0) 2024.01.16
Mathematical optimisation problem (basic concept)  (0) 2024.01.16

Everything is made of atoms, which are the smallest fundamental units of matter. Each atom has a centre called the nucleus, which contains protons and neutrons. Protons carry a positive charge, while neutrons are electrically neutral and help stabilise the nucleus by reducing the repulsive forces between protons.

 

Unlike protons and neutrons, electrons are located outside the nucleus but remain part of the atom. Electrons carry a negative charge and orbit the nucleus. Electrons are responsible for the atom's chemical properties.

 

Atomic structure. Source: GeeksForGeeks (https://www.geeksforgeeks.org/chemistry/proton/

 

Electrons and neutrons both interact with protons, but their roles differ:

  • Inside the nucleus, neutrons are uncharged and help prevent protons from repelling each other by exerting nuclear forces.
    • The nuclear force holds protons and neutrons together in the nucleus. Maintaining the right balance of protons and neutrons ensures the nucleus remains stable.
  • Outside the nucleus, electrons interact with protons through their negative electric charge, balancing the overall charge of the atom.
    • If an atom has too many or too few neutrons, it can become unstable and radioactive, undergoing decay by emitting radiation.

At our very core, we're all made of atoms. We're lights ourselves, because so much of what happens in our atoms involves electrons releasing energy as light.

  • The light we observe is emitted when electrons within these atoms transition from higher to lower energy levels, releasing energy in the form of light.

With protons, neutrons, and electrons, an atom holds the key to life itself.

SGD (Stochastic Gradient Descent) and ADAM (Adaptive Moment Estimation) are both widely used optimisation algorithms for training machine learning and deep learning models. While both use mini-batches to update parameters, Adam is an advanced adaptive optimisation algorithm, not a simple variant of SGD.

 

The key idea behind mini-batch SGD is to approximate the gradient of the loss function using a small, randomly selected mini-batch of data at a time, rather than the entire dataset. At each training iteration, the algorithm performs the following steps:

  1. Pick a random training sample (b examples)
  2. Compute the gradient of the loss function with respect to the model parameters for this sample
    $$ _{\bigtriangledown _{mini-batch}}J(_{\theta _t})=\frac{1}{b}\sum_{(x^{(j)},y^{(j)})\in B}^{}\bigtriangledown L(_{h_{_{\theta _t}}}(x^{(j)}),y^{(j)}) =_{g_t} $$
  3. Update the parameters in the opposite direction of the gradient, scaled by the learning rate α to minimize the loss
    $$ _{\theta _{t+1}}=_{\theta _t}-\alpha _{\bigtriangledown _{mini-batch}}J(_{\theta _t}) $$

Adam combines the best of both SGD with momentum and adaptive learning rate methods. It maintains two exponentially decaying moving averages for each parameter:

  • The gradients (the first moment, similar to momentum)
  • The squared gradients (the second moment, similar to RMSprop)

The update rule for each training iteration is as follows:

  1. Compute the gradient of the loss function with respect to the parameters
  2. Update the biased moving average of the first moment (m) and the second moment (v)
    $$  _{m_t}={_{\beta _1}}{_{m_{t-1}}}+(1-_{\beta _1})_{g_t} $$
    $$  _{v_t}={_{\beta _2}}{_{v_{t-1}}}+(1-_{\beta _2})_{g_t^{2}} $$
    This is an exponentially decaying average of the squared gradients.
  3. Compute bias-corrected estimates for both moments. This step is necessary because the moments are initialized to zero, which can bias them towards zero, especially during early training steps
    $$ _{\hat{m}_t}=\frac{_{m_t}}{1-{_{\beta _1}}^{t}} $$
    $$ _{\hat{v}_t}=\frac{_{v_t}}{1-{_{\beta _2}}^{t}} $$
  4. Update the parameters using the bias-corrected moment estimates
    $$ _{\theta _{t+1}}=_{\theta _t}-\frac{\alpha }{\sqrt{_{\hat{v}_t}}+\epsilon }{_{\hat{m}_t}} $$

HN Adam is a modified Adam algorithm introduced by Reyad, M., Sarhan, A.M., and Arafa, M. in their 2023 paper, "A modified Adam algorithm for deep neural network optimization" [1]. The algorithm is based on a hybrid approach that combines the original Adam algorithm with the AMSGrad algorithm and incorporates an adaptive norm technique. The letters "H" and "N" in the name refer to the 'hybrid' and 'adaptive norm' components, respectively.

 

This algorithm's key feature is its adaptive norm technique, which adjusts the learning rate step size. According to the paper, the norm value is adaptively computed based on the maximum of the absolute value of the current gradient and the exponential moving average of past gradients. The threshold value of the norm is also randomly chosen.

 

 

The researchers found that this hybrid mechanism enhanced the generalisation performance and achieved a high speed of convergence for most deep neural network architectures.

HN Adam was selected for my final project on breast cancer detection using deep learning. The algorithm's potential to enhance the performance of a modified U-Net model was a key factor in its selection, particularly due to the need for a more dynamic learning rate strategy. This was necessary to address a challenge where the model’s evaluation metrics were stagnating around a specific value during training.

To validate the model's effectiveness, a comprehensive evaluation was conducted using both the metrics reported in the original paper and additional performance indicators, including the Dice coefficient, Jaccard index, and various confusion matrix-based metrics. Among these, the Dice coefficient and Jaccard index served as the primary evaluation criteria.

After training for 1,000 epochs, the model achieved its best performance on the validation dataset as follows:

  • Dice Coefficient: 97.56%
  • Jaccard Index: 95.56%
  • Precision: 98%
  • Recall: 97%
  • F1-Measure: 98%
  • Accuracy: 99.97%
  • Loss: 0.033495

The project is still in progress, and additional details and final results will be shared upon completion.

 

 

Reference:

[1] Reyad, M., Sarhan, A.M. and Arafa, M. (2023). A modified Adam algorithm for deep neural network optimization. A modified Adam algorithm for deep neural network optimization.

Dice Loss is designed to maximise the Dice Coefficient during training, by minimising the inverse of the Dice Coefficient.

  • Minimising Dice Loss encourages greater overlap between predicted and true masks.
  • It inherently addresses class imbalance.

$$ Dice Loss=1-Dice Coefficient $$

Binary Cross-Entropy Loss (BCE) measures the error between predicted probabilities and actual binary labels for each pixel.

  • It is used for pixel-wise classification tasks, which is how segmentation can be viewed (each pixel is classified as foreground or background).

$$  BCE=-\frac{1}{N}\sum_{i=1}^{N}\left [ _{y_1}log(_{\hat{h}_i})+(1-_{y_i})log(1-_{\hat{y}_i}) \right ] $$

Combined Loss (BCE + Dice Loss) leverages the strengths of BCE Loss and Dice Loss.

  • BCE guides the model to learn pixel-wise classification accurately, pushing predicted probabilities towards 0 for background and 1 for foreground. It's good for overall learning!
  • Dice Loss directly optimises for overlap, which is crucial for segmentation performance, especially for small or imbalance foreground objects. It helps the model prioritise getting the object boundaries right and not being overwhelmed by the background.

$$ Combined Loss=BCE+Dice Loss $$

 

Tversky Loss extends Dice Loss by introducing two coefficients, 𝛼 and 𝛽, which control the relative importance of False Positives (FP) and False Negatives (FN).

  • By weighting FP and FN differently, it becomes especially useful for imbalanced segmentation tasks, where either precision or recall needs to be emphasised.

$$ Tversky Index=\frac{TP}{TP+\alpha FP+\beta FN} $$

 

Focal Tversky Loss combines the ideas of Tversky Loss and Focal Loss.

  • Focal Loss was introduced to address the problem of class imbalance by down-weighting the loss contribution from easy examples (well-classified instances) and focusing more on hard examples (misclassified or difficult-to-classify instances).

$$ Focal Tversky Loss=(1-Tversky Index)^{\gamma }=(1-\frac{TP+\epsilon }{TP+\alpha FP+\beta FN+\epsilon }) $$

'MachineLearning > Evaluation' 카테고리의 다른 글

Metrics  (0) 2025.06.26

+ Recent posts