Investigating ferromagnetic transitions using ML (Midway progress)

Team members: Ashish Panigrahi and S. Gautameshwar.

For the project proposal, see here. All the code used for this project is hosted on GitHub at this link. The slides for the presentation can be found here.^[1]

Overview of the Ising Model

Previously presented during the proposal, we deal specifically with one-dimensional and two-dimensional Ising spin models consisting of a lattice of spin-\(1/2\) particles (either spin-\(\uparrow\) or spin-\(\downarrow\)).

The Hamiltonian of the Ising model is given by

\[ H(\sigma) = - \sum_{\langle i, j \rangle} J_{ij} \sigma_i \sigma_j \]

where the summation is carried over the first nearest neighbours i.e. \(j = i+1\) or \(j = i-1\). Hence the equivalent total "eigenenergy" of the lattice is similarly given by

\[ E(\sigma) = - \sum_{\langle i, j \rangle} J_{ij} \sigma_i \sigma_j \]

here \(\sigma_i = \pm1\), giving a scalar value for energy.

The 1D Ising Model

Consider a spin configuration of a 1D spin lattice as below

\[ \text{Lattice: } [1, 1, -1, 1, -1, 1, -1, 1, 1, -1, -1, -1, -1, 1, -1, 1, 1, 1, -1, -1, -1]\\ \text{Energy: 8 units} \]

What is the extent of interaction of our spins with each other?

Clearly, just by having a bunch of these configurations and energies, it is almost impossible to tell how much the first spin talks to the second, or the third, or the spins after them. It is possible, however, to extract this information out of the data, we train a machine learning model to answer this very question.

Data generation for one dimension

Taking the case of 50 particles, we define a one-dimensional lattice by assigning a random value between +1 and -1. We assume that the temperature of the lattice is very high so as to not invoke the complexities of Boltzmann statistics. Hence, we don't have any preferences for lower and more stable energies over the higher ones in the generated data. All these lattice configurations are equally probable.

The total energy of the 1D lattice is calculated using the procedure described before i.e. considering interactions between nearest neighbours only. Also we assume periodic boundary conditions i.e. the neighbours of the 1\(^{st}\) particle are the 2\(^{nd}\) and last particle.

Training a model to find coupling matrix J

Linear regression approach

Model parameters: \(w_{ij} = J_{ij}\)

Prediction: \(E_p^{(n)} = J_{ij} x_{ij}^{(n)}\)

Minimization function:

\[ \theta = \frac{1}{N} \sum_{n=1}^{N} (E_p^{(n)} - E_{label}^{(n)})^2 \]

Using the above 50 particle lattice data, we train our model using a total dataset of 10000 lattices with a training/testing ratio of 70:30. The python code used is as follows:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model

# function to return the energy of our lattice
def energy(l):
    e = 0
    j = 1
    # our energy assumes periodic boundary conditions
    for i in range(l.size - 1):
        e += -j * l[i] * (l[i - 1] + l[i + 1])

    e += -j * l[l.size - 1] * (l[l.size - 2] + l[0])
    return e


# length of our lattice
d = 50

n = 10000

# defining our lattice
l = np.random.choice([-1, 1], size=(n, d))

# array to store the labels
label = np.zeros(n)

# creating the labels
for i in range(n):
    label[i] = energy(l[i])

# partition of our training n testing sets
m = int(7 * n / 10)

train = l[:m]
test = l[m:]

train_label = label[:m]
test_label = label[m:]

# data generated!!

# setting up linear regression
leastsq = linear_model.LinearRegression()

# needed for training the data: computing the outer product of S_i
train_states = np.einsum("...i,...j->...ij", train, train)
shape = train_states.shape

# each row is a reshaped from a 5x5 matrix of S_i*S_j
train_states = train_states.reshape((shape[0], shape[1] * shape[2]))

# needed for training the data: computing the outer product of S_i
test_states = np.einsum("...i,...j->...ij", test, test)
shape = test_states.shape

# each row is a reshaped from a 5x5 matrix of S_i*S_j
test_states = test_states.reshape((shape[0], shape[1] * shape[2]))

leastsq.fit(train_states, train_label)
J = leastsq.coef_.reshape((shape[1], shape[2]))

mark1 = leastsq.score(train_states, train_label) * 100
mark2 = leastsq.score(test_states, test_label) * 100

print("n: %s" % n, "train-test: ", m, "-", n - m)
print("Training score: %s /100" % mark1)
print("Testing score: %s /100" % mark2, "\n")

plt.figure()
plt.title("n=%s" % n)
c = plt.imshow(
    J, vmin=-1, vmax=1, cmap="Spectral_r", interpolation="nearest", origin="lower"
)
plt.colorbar(c)

n: 10000 train-test:  7000 - 3000
Training score: 100.0/100
Testing score: 100.0/100 (Too good!)

Clearly something is wrong here. Even though the predicted model is nowhere close to the actual \(J\), the test score gives us 100% accuracy for this! The 100% training score can be explained by observing that our model overfits our data to get a perfect score. But why a perfect test score?

Taking a few steps back...

Let us train the model for lower values of \(n\). Let us consider the values \(n = 500, 1700, 2500\).

n: 500 train-test:  350 - 150
Training score: 100.0 /100
Testing score: 38.64463490802842 /100

n: 1700 train-test:  1190 - 510
Training score: 100.0 /100
Testing score: 96.71980203823644 /100

n: 2500 train-test:  1750 - 750
Training score: 100.0 /100
Testing score: 100.0 /100

One can notice that for greater values of \(n\), the accuracy score for the testing data is increasing. We proceed to plot this variation of accuracy with increasing \(n\).

Variation of training and testing score.

Looking at above plot, after a certain number of \(n\) datasets, the training and testing data become almost identical. Consider the following cases:

Randomly generate \(n\) datasets and divide the training and testing data in the ratio 70:30.
Generate 7\(n\)/10 data for training and 3\(n\)/10 data for testing. For randomly generated data, both these partitions tend to have similar statistical properties for large enough \(n\). This is mostly not the case when we consider daily-life data which is manually tabulated. This means the statistical properties for both training and testing is the same, giving us an accuracy of 100% in the testing score even though the model does not predict a value close to the actual \(J\).

Implementing regularization to avoid an overfitting model

Lasso regression approach

Model parameters: \(w_{ij} = J_{ij}\), \(x_{ij} = \sigma_i \sigma_j\)

Prediction: \(E_p^{(n)} = J_{ij} x_{ij}^{(n)}\)

Minimization function:

\[ \theta = \lambda |J| + \frac{1}{N} \sum_{n=1}^{N} (E_p^{(n)} - E_{label}^{(n)})^2 \]

The goal is to minimize the above function. A regularization parameter \(\lambda\) is introduced, which on properly fixing gives a balanced fitting.

One can make the following observations from the above gif (jif?).

For smaller values of \(n\) (datasets), the parameter \(\lambda\) has a sweet spot that gives a score as good as the training dataset.
For larger values of \(n\), the training and testing scores give the same statistical results as one might expect. We proceed to check what Lasso has predicted in these sweet spots.

λ: 0.1
train-test: 350 - 150
Training score: 99.71/100
Testing score: 99.53/100

λ: 0.01
train-test: 1050 - 450
Training score: 99.99/100
Testing score: 99.99/100

λ: 0.1
train-test: 3500 - 1500
Training score: 99.76/100
Testing score: 99.75/100

So, our Lasso accurately predicts the \(J\) matrix at these sweet spots. It does not have any issue of overfitting like linear regression for large values of \(n\).

Ridge regression approach

Model parameters: \(w_{ij} = J_{ij}\), \(x_{ij} = \sigma_i \sigma_j\)

Prediction: \(E_p^{(n)} = J_{ij} x_{ij}^{(n)}\)

Minimization function:

\[ \theta = \lambda |J|^2 + \frac{1}{N} \sum_{n=1}^{N} (E_p^{(n)} - E_{label}^{(n)})^2 \]

The goal is to minimize the above function. A regularization parameter \(\lambda\) is introduced, which on properly fixing gives a balanced fitting i.e. same purpose as done in Lasso regression.

A few points to note:

Unlike Lasso regression, Ridge regression is just as bad as least-square based linear regression for small values of \(n\). There are no sweet spots for any value of \(\lambda\), as we saw in Lasso regression and the model simply mimics linear regression model for small \(\lambda\).
For larger values of \(n\), we get the same statistical results for both training and testing datasets. However, the model predicts the correct value of \(J\) without the issue of overfitting.

λ: 0.001
train-test: 350 - 150
Training score: 99.99/100
Testing score: 17.14/100

λ: 0.001
train-test: 1050 - 450
Training score: 99.99/100
Testing score: 85.24/100

λ: 0.001
train-test: 3500 - 1500
Training score: 99.99/100
Testing score: 99.99/100

Remark on Lasso and Ridge regression

Why do we end up with different results for Lasso and Ridge regression despite them being very similar?

Both Lasso and Ridge don't favour large values of \(|J|\) as it makes our data more sensitive to errors. They tend to prefer \(|J|\) with smaller "slope" than that with a larger one resulting in a model that prefers less variance over less bias. Thus, regularizing the issue of overfitting in both the cases efficiently.
However, the optimization provided by Lasso and Ridge with respect to Linear regression is different, for it can be mathematically shown that

\[ J_{ridge} = J_{linear} / (1+\lambda) \]

Hence, Ridge always scales the value of \(J\) by the regularization parameter \(\lambda\). The larger is its value, the flatter our "slope" becomes. But even for large \(\lambda\), \(J\) only tends to 0 asymptotically (close but never equal to 0).

Lasso regression, on the other hand does no care about asymptotically reducing \(J\). As we increase \(\lambda\), the gap between \(J_{linear}\) and \(J_{lasso}\) where \(J = 0\) widens. This results in more and more trivial parts of our data being set to 0 (which is not the case for Ridge regression). Thus, for some specific \(\lambda\), we have exactly the unnecessary parts of our data set to null, and the prominent parts of our "slope" \(J\) intact. This is the reason we see the sweet spots that perfectly regularize by making the trivial parts being overfitted to 0 and only retain the original model. This phenomenon is referred to as "sparse" regression.

Beyond nearest neighbours

We considered instances of lattices where only interactions between the first nearest neighbours mattered. What happens if we involve the second nearest neighbour interactions as well? Does our ML model predict their physical nature as well?

Ridge regression for next nearest neighbour interaction picture

We define our coupling constant \(J_{ij}\) in this case as

\[ J_{ij} = -1 \text{ if } j = i \pm 1 \] \[ J_{ij} = 0.5 \text{ if } j = i \pm 2 \] \[ J_{ij} = 0 \text{ otherwise.} \]

n: 5000
train-test: 2500 - 1500
Training score: 99.99/100
Testing score: 99.99/100

Considering an alternate model:

\[ J_{ij} = -1 \text{ if } j = i \pm 1 \] \[ J_{ij} = -0.5 \text{ if } j = i \pm 2 \] \[ J_{ij} = 0 \text{ otherwise.} \]

n: 5000
train-test: 3500 - 1500
Training score: 99.99/100
Testing score: 99.99/100

Thus, Ridge regression not only detects the existence of a second neighbour interaction, it also gives us the accurate coupling constant between the first and second nearest neighbours.

Conclusion

It's a very bad idea to use linear regression to train 1D lattice.
If the dataset is small, Lasso predicts the model accurately with regularization \(\lambda\) ~ \(10^{-2}\) to \(10^{-4}\).
If the dataset is large, both Lasso and Ridge regression predicts an accurate model with regularization \(\lambda\) ~ \(10^{-2}\) to \(10^{-4}\).
Ridge regression works like "magic" for predicting interactions in our lattice beyond first neighbour particles as well!

Data generation for two dimensions

For this we require some knowledge in statistical mechanics. Consider a system which is in thermal equilibrium with its surroundings. The probability of being in a configuration \(\mu\) with corresponding energy \(E_{\mu}\) is given by

\[ p_{\mu} = \frac{1}{Z} e^{-\beta E_{\mu}} \]

where \(Z = \sum_{\mu} e^{-\beta E_{\mu}}\) is the partition function (normalizing factor). At equilibrium, the following must be true

\[ \sum_{\nu} p_{\mu} P(\mu \rightarrow \nu) = \sum_{\nu} p_{\nu} P(\nu \rightarrow \mu) \]

where the probability of transitioning from state \(\mu\) to \(\nu\), is \(P(\mu \rightarrow \nu)\). However from a practical point of view, enforcing this numerically is difficult. Hence we assume a detailed balance condition as

\[ p_{\mu} P(\mu \rightarrow \nu) = p_{\nu} P(\nu \rightarrow \mu) \]

Let us go over our energy Hamiltonian as

\[ E(\mu) = - \sum_{\langle i, j \rangle} J \sigma_i \sigma_j \]

where the sum is over the first nearest neighbours and \(\sigma_i\) is the spin of the i\(^{th}\) particle (either +1 or -1). \(\mu\) denotes a particular configuration of the spin lattice system. Hence, the detailed balance condition can be written as

\[ \frac{P(\mu \rightarrow \nu)}{P(\nu \rightarrow \mu} = \frac{p_{\nu}}{p_{\mu}} = e^{-\beta (E_\nu - E_\mu)} \]

Metropolis algorithm

The basic idea behind this algorithm is pretty simple. The goal is to find an equilibrium state for the two-dimensional spin lattice system at a particular temperature dictated by \(\beta\) where

\[ \beta = \frac{1}{k_B T} \]

where \(k_B\) is the Boltzmann constant and \(T\) is the temperature of the lattice.

The algorithm is as follows:

We start with a random state \(\mu\) for the lattice configuration.
Pick any random lattice site and flip the sign of the spin. We call this new state \(\nu\). The goal s to find the probability \(P(\mu \rightarrow \nu)\) that we'll accept this new state.
Evaluate the total energy of these states and call them \(E_\mu\) and \(E_\nu\) respectively.
- If \(E_\nu > E_\mu\) then we set \(P(\nu \rightarrow \mu) = 1\) and thus by using the detailed balance equation we get \(P(\mu \rightarrow \nu) = e^{-\beta (E_\nu - E_\mu)}\).
- Similarly, if \(E_\mu > E_\nu\), then set \(P(\mu \rightarrow \nu) = 1\). Correspondingly, we see \(P(\nu \rightarrow \mu) = e^{-\beta (E_\mu - E_\nu)}\).
Change to state \(\nu\) (i.e. flip the spin of the particle) with the probabilities as shown above.
Iterate by continuing with step 1. We repeat this procedure as many times as required. Eventually we end up with an equilibrium state.

Hence, we see that we only need to evaluate the quantity (for \(E_\nu > E_\mu\))

\[ -\beta (E_\nu - E_\mu) = -\beta J \sum_{k=1}^{4} \sigma_i \sigma_k \]

where we flip the spin of \(i^{th}\) particle and \(\sigma_k\) represent the spins of the four neighbouring particles (imagine a square lattice for 2D). However, if the flipping particle is at the boundary then there might be less than 4 neighbours, since in the case of two-dimensions we do not consider periodic boundary conditions.

This is where we end up with our midway progress. The next milestone is training models for the two-dimensional Ising model for predicting the critical temperature at which a transition is seen from ferromagnetic to paramagnetic phase.

Bibliography

Pankaj Mehta et al. “A high-bias, low-variance introduction to machine learning for physicists”. In: Physics reports 810 (2019), pp. 1-124
Beichl, Isabel, and Francis Sullivan. "The metropolis algorithm." Computing in Science & Engineering 2.1 (2000): 65-69.
Hastings, W.K. (1970). "Monte Carlo Sampling Methods Using Markov Chains and Their Applications". 57 (1): 97–109.
Deisenroth, Marc Peter, A. Aldo Faisal, and Cheng Soon Ong. Mathematics for machine learning. Cambridge University Press, 2020.

[1]	The original slides were written in a Mathematica notebook and conversion to pdf does not represent the actual formatting.