Investigating ferromagnetic transitions using ML (Midway progress)

Team members: Ashish Panigrahi and S. Gautameshwar.

For the project proposal, see here. All the code used for this project is hosted on GitHub at this link. The slides for the presentation can be found here.[1]

Overview of the Ising Model

Previously presented during the proposal, we deal specifically with one-dimensional and two-dimensional Ising spin models consisting of a lattice of spin-\(1/2\) particles (either spin-\(\uparrow\) or spin-\(\downarrow\)).

The Hamiltonian of the Ising model is given by

\[ H(\sigma) = - \sum_{\langle i, j \rangle} J_{ij} \sigma_i \sigma_j \]

where the summation is carried over the first nearest neighbours i.e. \(j = i+1\) or \(j = i-1\). Hence the equivalent total "eigenenergy" of the lattice is similarly given by

\[ E(\sigma) = - \sum_{\langle i, j \rangle} J_{ij} \sigma_i \sigma_j \]

here \(\sigma_i = \pm1\), giving a scalar value for energy.

The 1D Ising Model

Consider a spin configuration of a 1D spin lattice as below

\[ \text{Lattice: } [1, 1, -1, 1, -1, 1, -1, 1, 1, -1, -1, -1, -1, 1, -1, 1, 1, 1, -1, -1, -1]\\ \text{Energy: 8 units} \]

What is the extent of interaction of our spins with each other?

Clearly, just by having a bunch of these configurations and energies, it is almost impossible to tell how much the first spin talks to the second, or the third, or the spins after them. It is possible, however, to extract this information out of the data, we train a machine learning model to answer this very question.

Data generation for one dimension

Taking the case of 50 particles, we define a one-dimensional lattice by assigning a random value between +1 and -1. We assume that the temperature of the lattice is very high so as to not invoke the complexities of Boltzmann statistics. Hence, we don't have any preferences for lower and more stable energies over the higher ones in the generated data. All these lattice configurations are equally probable.

The total energy of the 1D lattice is calculated using the procedure described before i.e. considering interactions between nearest neighbours only. Also we assume periodic boundary conditions i.e. the neighbours of the 1\(^{st}\) particle are the 2\(^{nd}\) and last particle.

Training a model to find coupling matrix J

Linear regression approach

Model parameters: \(w_{ij} = J_{ij}\)

Prediction: \(E_p^{(n)} = J_{ij} x_{ij}^{(n)}\)

Minimization function:

\[ \theta = \frac{1}{N} \sum_{n=1}^{N} (E_p^{(n)} - E_{label}^{(n)})^2 \]

Using the above 50 particle lattice data, we train our model using a total dataset of 10000 lattices with a training/testing ratio of 70:30. The python code used is as follows:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model

# function to return the energy of our lattice
def energy(l):
    e = 0
    j = 1
    # our energy assumes periodic boundary conditions
    for i in range(l.size - 1):
        e += -j * l[i] * (l[i - 1] + l[i + 1])

    e += -j * l[l.size - 1] * (l[l.size - 2] + l[0])
    return e


# length of our lattice
d = 50

n = 10000

# defining our lattice
l = np.random.choice([-1, 1], size=(n, d))

# array to store the labels
label = np.zeros(n)

# creating the labels
for i in range(n):
    label[i] = energy(l[i])

# partition of our training n testing sets
m = int(7 * n / 10)

train = l[:m]
test = l[m:]

train_label = label[:m]
test_label = label[m:]

# data generated!!

# setting up linear regression
leastsq = linear_model.LinearRegression()

# needed for training the data: computing the outer product of S_i
train_states = np.einsum("...i,...j->...ij", train, train)
shape = train_states.shape

# each row is a reshaped from a 5x5 matrix of S_i*S_j
train_states = train_states.reshape((shape[0], shape[1] * shape[2]))

# needed for training the data: computing the outer product of S_i
test_states = np.einsum("...i,...j->...ij", test, test)
shape = test_states.shape

# each row is a reshaped from a 5x5 matrix of S_i*S_j
test_states = test_states.reshape((shape[0], shape[1] * shape[2]))

leastsq.fit(train_states, train_label)
J = leastsq.coef_.reshape((shape[1], shape[2]))

mark1 = leastsq.score(train_states, train_label) * 100
mark2 = leastsq.score(test_states, test_label) * 100

print("n: %s" % n, "train-test: ", m, "-", n - m)
print("Training score: %s /100" % mark1)
print("Testing score: %s /100" % mark2, "\n")

plt.figure()
plt.title("n=%s" % n)
c = plt.imshow(
    J, vmin=-1, vmax=1, cmap="Spectral_r", interpolation="nearest", origin="lower"
)
plt.colorbar(c)
n: 10000 train-test:  7000 - 3000
Training score: 100.0/100
Testing score: 100.0/100 (Too good!)

Clearly something is wrong here. Even though the predicted model is nowhere close to the actual \(J\), the test score gives us 100% accuracy for this! The 100% training score can be explained by observing that our model overfits our data to get a perfect score. But why a perfect test score?

Taking a few steps back...

Let us train the model for lower values of \(n\). Let us consider the values \(n = 500, 1700, 2500\).

n: 500 train-test:  350 - 150
Training score: 100.0 /100
Testing score: 38.64463490802842 /100

n: 1700 train-test:  1190 - 510
Training score: 100.0 /100
Testing score: 96.71980203823644 /100

n: 2500 train-test:  1750 - 750
Training score: 100.0 /100
Testing score: 100.0 /100

One can notice that for greater values of \(n\), the accuracy score for the testing data is increasing. We proceed to plot this variation of accuracy with increasing \(n\).

 Variation of training and testing score.
Variation of training and testing score.

Looking at above plot, after a certain number of \(n\) datasets, the training and testing data become almost identical. Consider the following cases:

Implementing regularization to avoid an overfitting model

Lasso regression approach

Model parameters: \(w_{ij} = J_{ij}\), \(x_{ij} = \sigma_i \sigma_j\)

Prediction: \(E_p^{(n)} = J_{ij} x_{ij}^{(n)}\)

Minimization function:

\[ \theta = \lambda |J| + \frac{1}{N} \sum_{n=1}^{N} (E_p^{(n)} - E_{label}^{(n)})^2 \]

The goal is to minimize the above function. A regularization parameter \(\lambda\) is introduced, which on properly fixing gives a balanced fitting.

One can make the following observations from the above gif (jif?).

λ: 0.1
train-test: 350 - 150
Training score: 99.71/100
Testing score: 99.53/100
λ: 0.01
train-test: 1050 - 450
Training score: 99.99/100
Testing score: 99.99/100
λ: 0.1
train-test: 3500 - 1500
Training score: 99.76/100
Testing score: 99.75/100

So, our Lasso accurately predicts the \(J\) matrix at these sweet spots. It does not have any issue of overfitting like linear regression for large values of \(n\).

Ridge regression approach

Model parameters: \(w_{ij} = J_{ij}\), \(x_{ij} = \sigma_i \sigma_j\)

Prediction: \(E_p^{(n)} = J_{ij} x_{ij}^{(n)}\)

Minimization function:

\[ \theta = \lambda |J|^2 + \frac{1}{N} \sum_{n=1}^{N} (E_p^{(n)} - E_{label}^{(n)})^2 \]

The goal is to minimize the above function. A regularization parameter \(\lambda\) is introduced, which on properly fixing gives a balanced fitting i.e. same purpose as done in Lasso regression.

A few points to note:

 λ = 0.001 for n = 500
λ = 0.001 for n = 500
λ: 0.001
train-test: 350 - 150
Training score: 99.99/100
Testing score: 17.14/100
 λ = 0.001 for n = 1500
λ = 0.001 for n = 1500
λ: 0.001
train-test: 1050 - 450
Training score: 99.99/100
Testing score: 85.24/100
 λ = 0.001 for n = 5000
λ = 0.001 for n = 5000
λ: 0.001
train-test: 3500 - 1500
Training score: 99.99/100
Testing score: 99.99/100

Remark on Lasso and Ridge regression

Why do we end up with different results for Lasso and Ridge regression despite them being very similar?

\[ J_{ridge} = J_{linear} / (1+\lambda) \]

Hence, Ridge always scales the value of \(J\) by the regularization parameter \(\lambda\). The larger is its value, the flatter our "slope" becomes. But even for large \(\lambda\), \(J\) only tends to 0 asymptotically (close but never equal to 0).

Beyond nearest neighbours

We considered instances of lattices where only interactions between the first nearest neighbours mattered. What happens if we involve the second nearest neighbour interactions as well? Does our ML model predict their physical nature as well?

Ridge regression for next nearest neighbour interaction picture

We define our coupling constant \(J_{ij}\) in this case as

\[ J_{ij} = -1 \text{ if } j = i \pm 1 \] \[ J_{ij} = 0.5 \text{ if } j = i \pm 2 \] \[ J_{ij} = 0 \text{ otherwise.} \]
 Actual model
Actual model
 Predicted model
Predicted model
n: 5000
train-test: 2500 - 1500
Training score: 99.99/100
Testing score: 99.99/100

Considering an alternate model:

\[ J_{ij} = -1 \text{ if } j = i \pm 1 \] \[ J_{ij} = -0.5 \text{ if } j = i \pm 2 \] \[ J_{ij} = 0 \text{ otherwise.} \]
 Actual model
Actual model
 Predicted model
Predicted model
n: 5000
train-test: 3500 - 1500
Training score: 99.99/100
Testing score: 99.99/100

Thus, Ridge regression not only detects the existence of a second neighbour interaction, it also gives us the accurate coupling constant between the first and second nearest neighbours.

Conclusion

Data generation for two dimensions

For this we require some knowledge in statistical mechanics. Consider a system which is in thermal equilibrium with its surroundings. The probability of being in a configuration \(\mu\) with corresponding energy \(E_{\mu}\) is given by

\[ p_{\mu} = \frac{1}{Z} e^{-\beta E_{\mu}} \]

where \(Z = \sum_{\mu} e^{-\beta E_{\mu}}\) is the partition function (normalizing factor). At equilibrium, the following must be true

\[ \sum_{\nu} p_{\mu} P(\mu \rightarrow \nu) = \sum_{\nu} p_{\nu} P(\nu \rightarrow \mu) \]

where the probability of transitioning from state \(\mu\) to \(\nu\), is \(P(\mu \rightarrow \nu)\). However from a practical point of view, enforcing this numerically is difficult. Hence we assume a detailed balance condition as

\[ p_{\mu} P(\mu \rightarrow \nu) = p_{\nu} P(\nu \rightarrow \mu) \]

Let us go over our energy Hamiltonian as

\[ E(\mu) = - \sum_{\langle i, j \rangle} J \sigma_i \sigma_j \]

where the sum is over the first nearest neighbours and \(\sigma_i\) is the spin of the i\(^{th}\) particle (either +1 or -1). \(\mu\) denotes a particular configuration of the spin lattice system. Hence, the detailed balance condition can be written as

\[ \frac{P(\mu \rightarrow \nu)}{P(\nu \rightarrow \mu} = \frac{p_{\nu}}{p_{\mu}} = e^{-\beta (E_\nu - E_\mu)} \]

Metropolis algorithm

The basic idea behind this algorithm is pretty simple. The goal is to find an equilibrium state for the two-dimensional spin lattice system at a particular temperature dictated by \(\beta\) where

\[ \beta = \frac{1}{k_B T} \]

where \(k_B\) is the Boltzmann constant and \(T\) is the temperature of the lattice.

The algorithm is as follows:

  1. We start with a random state \(\mu\) for the lattice configuration.

  2. Pick any random lattice site and flip the sign of the spin. We call this new state \(\nu\). The goal s to find the probability \(P(\mu \rightarrow \nu)\) that we'll accept this new state.

  3. Evaluate the total energy of these states and call them \(E_\mu\) and \(E_\nu\) respectively.

    • If \(E_\nu > E_\mu\) then we set \(P(\nu \rightarrow \mu) = 1\) and thus by using the detailed balance equation we get \(P(\mu \rightarrow \nu) = e^{-\beta (E_\nu - E_\mu)}\).

    • Similarly, if \(E_\mu > E_\nu\), then set \(P(\mu \rightarrow \nu) = 1\). Correspondingly, we see \(P(\nu \rightarrow \mu) = e^{-\beta (E_\mu - E_\nu)}\).

  4. Change to state \(\nu\) (i.e. flip the spin of the particle) with the probabilities as shown above.

  5. Iterate by continuing with step 1. We repeat this procedure as many times as required. Eventually we end up with an equilibrium state.

Hence, we see that we only need to evaluate the quantity (for \(E_\nu > E_\mu\))

\[ -\beta (E_\nu - E_\mu) = -\beta J \sum_{k=1}^{4} \sigma_i \sigma_k \]

where we flip the spin of \(i^{th}\) particle and \(\sigma_k\) represent the spins of the four neighbouring particles (imagine a square lattice for 2D). However, if the flipping particle is at the boundary then there might be less than 4 neighbours, since in the case of two-dimensions we do not consider periodic boundary conditions.

This is where we end up with our midway progress. The next milestone is training models for the two-dimensional Ising model for predicting the critical temperature at which a transition is seen from ferromagnetic to paramagnetic phase.

Bibliography

  1. Pankaj Mehta et al. “A high-bias, low-variance introduction to machine learning for physicists”. In: Physics reports 810 (2019), pp. 1-124

  2. Beichl, Isabel, and Francis Sullivan. "The metropolis algorithm." Computing in Science & Engineering 2.1 (2000): 65-69.

  3. Hastings, W.K. (1970). "Monte Carlo Sampling Methods Using Markov Chains and Their Applications". 57 (1): 97–109.

  4. Deisenroth, Marc Peter, A. Aldo Faisal, and Cheng Soon Ong. Mathematics for machine learning. Cambridge University Press, 2020.

[1] The original slides were written in a Mathematica notebook and conversion to pdf does not represent the actual formatting.