To generate the successive few frames in a video using
Graph Neural Networks.
Course Project, CS460, Fall 2021
Group 6, (Shivam Raj (1711127)
Shrishti Barethiya (1711128) )
https://github.com/raj2022/21cs460_group06
GOAL OR MOTIVATION:
To predict the successive few frames of the video using a Graph Neural Network for spatio-temporal
graphs.
To know about the nodes a few frames later.
PROJECT PROPOSAL
Graph neural networks are a highly scalable class of models that can learn on graph-structured data.Graph
Neural Network,is a neural network that can directly be applied to graphs .It provides a convenient way for
node level, edge level, and graph level prediction task. Many fundamental relationships among data in several
areas of science and engineering can be represented in graphs.
There are mainly three types of graph neural networks:
Recurrent Graph Neural Network
Spatial Convolutional Network
Spectral Convolutional Network
A graph G is an ordered pair of a set V of vertices and a set E of edges. A graph G can be defined as G=(V, E),
where V is the set of nodes or vertices, and E is the set of edges between them. As shown in the fig.1
Figure 1: General Graph representation of a problem
GNN has many application and uses as in Natural Language Processing, Computer Vision, human behavior
detection, traffic control, molecular structure study, recommender system, program verification, logical rea-
soning, social influence prediction, and adversarial attack prevention.
1
In the Fig.1, Let the edge connecting two vertices A(v
i
) and B(v
j
) is represented as e
i j
= (v
i
,v
j
), n is the num-
ber of nodes in the graph and m is the number of edges. If A being the adjacency matrix of the graph then the
dimension of A is n × n. Here, A
i j
= 1 if e
i j
² E. The neighborhood N(v) of a node v is defined as N(v) = u² V |
(u,v) ² E.
Graph data exists in many fields. For example, in medicine and pharmacy in which atoms or molecules can be
taken as nodes(vertices) and their bond distance as edges, or in social media networks, people can be taken
as nodes and their age or gender as attributes.
0.1 What can we do with graph data in machine learning?
We can perform node-level prediction. Graph with unlabeled nodes and want to predict attributes about
these notes or classify them. We can find information about the other nodes. We can also use link prediction
or edge-level prediction and determine find connections between nodes.
0.2 Problems with graph data?
The size and shape of the graph might change within a dataset. We need a method for arbitrary input shapes.
Two different graphs that look different can be structurally identical. This issue can be solved with the help of
permutation invariant.
GNN can be broadly divided according to their work as, Node Classification, Link Prediction and Graph clas-
sification. Node classification (the objective here is to determine the labeling of samples by looking at the
labels of their neighbors), Graph classification, Graph Visualization, Link Prediction, and Graph Clustering
are just a few examples of GNN applications.Link prediction, the task is to understand the relationship be-
tween entities in graphs and predict if two entities have a connection in between. Graph classification, the
task is to classify the whole graph into different categories. It is similar to image classification but the target
changes into graph domain.
0.3 Image Segmentation
Segmentation subdivides an image into constituents regions or objects. Segmentation is the process of par-
titioning a digital image into multiple regions and extracting the meaningful region which is known as region
of interest(ROI).
Image segmentation based on:
1 Similarity principle: Objective is a group pixels based on common property to extract a coherent region.
2 Discontinuity principle: objective is to extract regions that differ in properties like intensity, color, tex-
ture.
Image segmentation to represent the image as a graph. So how can we represent the image as a graph?
Every pixel in a image is a vertex. Edge is between pairs of pixels, not all pairs of pixels but pairs of pixels
which are close to each other. so, we have a graph G = (V,E) where V and E are the sets of vertices and edges,
respectively. Each edge is weighted by the affinity or similarity between its two vertices.
How to measure affinity? Let i and j be two pixels whose features are f i and f j and there is an edge between
them.
Pixels Dissimilarity:
S(f
i
, f
j
) =
r
(
X
k
(f
ik
f
j k
)
2
)
Pixels affinity
w(i, j) = A(f
i
, f
j
) = e(1/2σ
2
)S(f
i
, f
j
)
Where, A(f
i
, f
j
) is the affinity. The smaller the dissimilarity S(f
i
, f
j
) the larger the affinity. This relationship
defined by parameter σ. This affinity is the weight w(i , j ) associate with the edge.
2
Graph cut: Cut C = (V
A
,V
B
) is a partition of the vertices V of a graph G = (V,E ) into two disjoint subsets V
A
)
and V
B
. Cut-set: Set of the edges whose vertices are in different subsets of partition. Cost of the cut: Sum of
weights of cut-set edges.
cut (V
A
,V
B
) =
X
u,v
w(u, v)
Algorithm: uses the following criteria. A pair of vertices(pixels) within a subgraph have high affinity. A pair of
vertices from two different subgraphs have low affinity. That is, minimize the cost of cut, also called Min-Cut.
But there is a problem with Min-Cut algorithm.
There is a bias to cut small, isolated segments. For instance, lets take pair of 3 weak affinity edges which are
desired for cut, but the cost of cutting these three edges is higher than the cost of cutting single, isolated edge.
So instead of 3 weak edges isolated edge will be cut.
Although, there is simple solution for this which is some how you have to normalize the cut such that yo favor
larger subgraphs.. That requires us to come up with a measure of the size of a subgraph one way to do that is
by seeing how strongly vertices of the subgraph V
A
are associated with the larger graph V.
assoc(V
A
,V ) =
X
u,v
w(u, v)
Here, assoc() is the sum of the weights of the solid edges. Therefore, take the cost of cut and normalize with
respect to association. Minimize cost of normalized cut(NCut) during partition.
NCut(V
A
,V
B
) =
cut (V
A
,V
B
)
assoc(V
A
,V )
+
cut (V
A
,V
B
)
assoc(V
A
,V )
To solve minimize NCut an approximation used which uses fast eigenvector-based approximation known as
spectral methods. So, by applying this simple algorithm image can be segmented easily into graph.
0.4 How will we proceed?
Here we will use Graph Neural Network(GNN) to determine the next few video frames and further about a
few frames. The multiple frames add the temporal aspect to the graph. Hence, the graph generated from the
number of frames is a spatio-temporal graph. In general we will try to use STGCN. An image can be repre-
sented as graphs by dividing the image into regions and the neighboring regions, which are interconnected.
The video consists of multiple frames. Using the above methods, the frame can be converted into a graph.
Here, in video prediction we will mostly focuses on the spectral convolutional network, which is build on
graph signal processing theory and by simplification and approximation of graph convolution. By Chebyshev
polynomial approximation (Hammond et al. 2011), graph convolution can be simplified to below form:
g
θ
,x '
K
X
k=o
θ
k
T
k
(Λ)
After further simplification, the GCN paper[5] suggests a 2-layered neural network structure, which can be
described in one equation as below:
Z = f (X , A) = so f tmax(
ˆ
ARel u(
ˆ
AX W
0
)W
1
)
Where A is the adjacency matrix. The graphs auto-encoders aim at reconstructing the original adjacency ma-
trix.
Since a large amount of video data is captured, Video summarizing is used to condense the original video
into a summary while preserving the main content. Lots of videos summarizing approaches have been pro-
posed but, a fully convolutional structure on time axis approach has gained significantly. We can compare
this situation of video prediction with forecasting traffic speed problem. Forecasting traffic speed, volume
or the density of roads in traffic networks is fundamentally important in a smart transportation system. We
can address the traffic prediction problem by using STGNNs or STCNs.Considering the traffic network as a
spatial-temporal graph where the nodes are sensors installed on roads, the edges are measured by the dis-
tance between pairs of nodes, and each node has the average traffic speed within a window as dynamic input
features.
3
0.5 Mid-way Goals.
By midway we will try to understand how STGCN work and how can it be implemented for video prediction.
By the end we will try to find the result using STGCN.
MIDWAY
Zhanxing Zhu and Co-workers propose a novel deep learning framework, Spatio Temporal Graph Convolu-
tional Network(STGCN), to tackle time series prediction problem as in the traffic domain. Instead of applying
regular convolutional and recurrent units, they formulate the problem on graphs and build the model with
complete convolutional structures. By purely apply convolutional structures to extract spatio-temporal fea-
tures simultaneously from graph-structured in video prediction as has been done for time series in a traffic
study. The built model with complete convolutional structures, which enables much faster training speed
with fewer parameters.[1]
To take full advantage of spatial features, we uses convolutional neural network(CNN) to capture adjacent
relation among the frame of the videos, along applying recurrent neural network(RNN) to time axis. Here to
handle the inherent deficiencies of recurrent networks, we have employed a fully convlutional structure on
time axis.
In today’s world, on every street there are cameras. Therefore an immense amount of video data is available
online. There is a need for an automated tools to handle these large-scale video data. Here, video summa-
rization becomes very helpful. It condenses the original video into a small summary while preserving the
important information. There are many summary formats available. In paper[2], Xuelong Li and Co-workers
uses the methodology of video skimming that is formed by several key-shots. There are many approaches pro-
posed for video summarization. But Recurrent Neural Networks(RRN) has provided significant advantages.
These approaches take video data as a sequence of the frame and summarize video by using the temporal
dependencies. RNN is able to capture local dependencies but fails to capture global dependencies as this
has been easily distracted by noises. The video data is layered as frames and shots. Shots are the intermedi-
ate states between the video and frame, formed from several consecutive frames. The frames in the shot are
easily modeled as a temporal sequence using RNN. But information between the adjacent shots can largely
varies. Current sequence models that just take neighborhood dependencies into account can cause inter-
ference. Another type of graph is a sequence. In sequence, only the consecutive items are connected. To
capture local and global dependencies it is better to model the video shot as a complete graph. All shots are
connected as graph nodes and the dependencies are calculated by the interaction between two shots. This
paper[2] is really helpful to understand basic of frames and and help to know about helpful video technique.
These techniques with other algorithms can useful in video prediction
For our project we have to do video prediction. Video frame prediction is the task of predicting subsequent
frames, given a sequence of video frames. Formally, it can be defined as follows. Let X
t
² R
w×h×c
be the t
th
frame in the input video sequence X = (X
tn
,. . . . . . .., X
t1
, X
t
) consisting of n-frames. Here, w, h, and c de-
note the width, height, and number of channels respectively. The target is to predict the subsequent m frame
Y = ( Y
t+1
, Y
t+2
, , . . . , Y
t+m
)with clarity and high precision, using the input X . [4]
0.6 Problem Formulation
Video frame prediction is a typical time-series prediction problem, i.e. predicting the most likely video frames
in the next H time steps given the previous M observations from previous video frames as,
ˆ
v
t+1
,...,
ˆ
v
t+H
= argmax
v
t+1
,...,v
t+H
logP(v
t+1
,...,v
t+H
|v
tM+1
,...,v
t
). (1)
where v
t
² IR
n
is an observation vector of n image frames at time steps t.
In this project, we define the traffic network on a graph and focus on structured image frames of the videos.
The observation v
t
not independent but connected by pairwise connection in graph.Therefore, the data point
4
v
t
can be regarded as a graph signal that is defined on an undirected graph (or directed one) G with weights
w
i
j as shown in Fig.2.
Figure 2: Graph-structured traffic data:Each v
t
indicates a frame of current traffic status at time step t, which is
recorded in a graph-structured data matrix.
At the t
th
time step, in graph G
t
= (V
t
, ², W ), V
t
is a finite set of vertices, corresponding to the observations
from n monitor stations in a traffic network; ² is a set of edges, indicating the connectedness between sta-
tions; while W ² IR
n×n
denotes the weighted adjacency matrix of G
t
.
0.7 Convolutions on Graph
There are two basic approaches currently exploring how to generalize CNNs to structured data forms. One
is to expand the spatial definition of a convolution [Niepert et al., 2016], and the other is to manipulate in
the spectral domain with graph Fourier transforms [Bruna etal., 2013]. The former approach rearranges the
vertices into certain grid forms which can be processed by normal convolutional operations. The latter one
introduces the spectral framework to apply convolutions in spectral domains, often named as the spectral
graph convolution. Several followingup studies make the graph convolution more promising by reducing the
computational complexity from O
to linear. We have used here notion of graph convolution operator "*G",
based on the conception of spectral graph convolution, as the multiplication of a signal x ²IR
n
with a kernel
Θ,
Θ
G
x = Θ(L)x = Θ(U ΛU
T
)x = (UΘ(Λ)U
T
)x
where graph Fourier basis U ² IR
n×n
is the matrxi of eigenvectors of the normalized graph laplacian L =
I
n
D
1
2
W D
1
2
= UΛU
T
²IR
n×n
(I
n
is an identity matrix, D is diagonal matrix, and Λ is the eigenvalues of
L, also Θ(Λ) is also a diagonal matrix. From this definition, a graph signal x is filtered by a kernel Θ with
multiplication between Θ and graph Fourier transform U
T
x.
0.8 Spatio-Temporal Graph Neural Networks
The spatio-temporal systems have dependency on both time and space. In this section, we elaborate on
the proposed architecture of spatio-temporal graph convolutional networks (STGCN). As shown in Figure 3,
STGCN is composed of several spatiotemporal convolutional blocks, each of which is formed as a sandwich
structure with two gated sequential convolution layers and one spatial graph convolution layer in between.
We now summarize the main characteristics of our model STGCN in the following points:-
STGCN is a universal framework to process structured time series. It is not only able to tackle image
modeling and prediction issues but also to be applied to more general spatio-temporal sequence learn-
ing tasks.
The spatio-temporal block combines graph convolutions and gated temporal convolutions, which can
extract the most useful spatial features and capture the most essential temporal features coherently.
5
Figure 3: : Architecture of spatio-temporal graph convolutional networks. The framework STGCN consists of
two spatio-temporal convolutional blocks (ST-Conv blocks) and a fully-connected output layer in the end. Each
ST-Conv block contains two temporal gated convolution layers and one spatial graph convolution layer in the
middle. The residual connection and bottleneck strategy are applied inside each block k. The input v
tM+1
, ...,
v
t
is uniformly processed by ST-Conv blocks to explore spatial and temporal dependencies coherently. Compre-
hensive features are integrated by an output layer to generate the final prediction
ˆ
v.
The model is entirely composed of convolutional structures and therefore achieves parallelization over
input with fewer parameters and faster training speed. More importantly, this economic architecture
allows the model to handle large-scale networks with more efficiency.
Another method we can use is spatio-temporal Convolutional Block
0.9 Spatio-temporal Convolutional Block
To use features from both spatial and temporal domains, the spatio-temporal convolutional block (ST-Conv
block) is constructed to jointly process graph-structured time series. The block itself can be stacked or ex-
tended based on the scale and complexity of particular cases.
As in Fig.3, the spatial layer in the middle is to bridge two temporal layers which can achieve fast spatial-state
propagation from graph convolution through temporal convolutions. The sandwich structure also helps the
network sufficiently apply bottleneck strategy to achieve scale compression and feature squeezing by down-
scaling and up-scaling of channels C through the graph convolutional layer. Moreover, layer normalization
is utilized within every ST-Conv block to prevent over-fitting.The input and output of ST-Conv blocks are all
3-D tensors.
0.10 Future Frame Prediction of a Video Sequence
In this paper[4], they have addressed the issue of the ambiguous nature of the problem, there may be multiple
future sequences possible for the same input video. This problem is resolved by designing a model that aver-
ages multiple possible futures into a single prediction. Two different approaches have attempted to address
this problem:
6
Use of latent variable models that represent underlying stochasticity.
Adversarially trained models to aim to produce sharper images.
Both of the approaches fail to produce predictions. The former approach fails to produce realistic results, and
the later approach fails to produce diverse predictions. Since both of them have complementary strengths
and weaknesses. Thus, Sukhendu Das and Co-worker combined both approaches and proposed a multi-
scale architecture. In this paper, they have used Moving MNIST, UCF101, and Sports-1M Datasets. We will try
to implement these datasets using STGCN.
Prediction of future events is necessary for any decision-making system. It is relatively easy for humans and
humans can make a staggering amount of predictions about the future just by looking at a series of pictures.
For instance, lets take an example of a short clip as shown in Fig.4.
Figure 4: A bowling ball moving towards a bowling pin. Our aim is to predict whether there will be a collision or
not. Prediction of such collisions is very important in the context of video frames.
Here, the aim is to predict whether this ball will collide with the bowling pin or not. Since the ball is having a
distance from the pin it will not collide. As a human we are able to predict the ball approaching towards the
pin will not collide with the pin. There are many cases where it is necessary to predict the future. There are
several approaches that are made.
0.10.1 Proposed Architecture
The SAVP architecture as proposed in ”Stochastic Adversarial Video Prediction combines GAN and VAE.
GANs are trained with randomly drawn input, whereas VAEs only observe ground truth images and are never
trained on random input which may lead to a train-test mismatch.
Modifications Proposed
Replace Convolutional LSTM with E3D-LSTM
Convolutional LSTM predicts future frame based on sequentially updated memory states. When the
memory cell is refreshed, older memories are discarded. In contrast, the proposed E3D-LSTM model
maintains a list of historical memory records and revokes them when necessary, thereby facilitating
long-range video reasoning. The Eidetic 3D LSTM (E3D-LSTM) has a new state, namely RECALL state
which can recall long term past unlike LSTM. RECALL mechanism helps to recall temporally distant
memory.
Replace GAN with MG-GAN Generative ad- versarial networks are susceptible to mode collapse. For
alleviating mode collapse, Manifold Guided Generative Adversarial Network (MGGAN) is used. MGGAN
leverages a guidance network on the existing GAN architecture. The guidance network consists of an
encoder and a discriminator. In order to solve the mode collapse, we design the encoder such that the
output distribution of encoder is the best approximate of p data . So, it captures all modes of data. To
this end, we adopt an encoder of a pre-trained autoencoder as a manifold mapping function. Because
autoencoder aims to reconstruct all samples in dataset, the encoder should not ignore minor modes .
Consequently, the generator avoids mode missing during training because it receives feedback for minor
modes of data distribution from the guidance network
Modified SAVP Architecture The modified SAVP architecture is shown in Figure 5.The modifications made
7
are encircled. The architecture has two MGGANs and a VAE. The decoder of the VAE also as a generator.
MGGAN integrates a guidance network to the existing GAN architecture. The guidance network en- courages
the generator to learn all modes of p data . The guidance network comprises of an encoder for manifold
map- ping and a discriminator for measuring dissimilarity between the distributions of p data and p model
in the manifold space. To cope with mode collapse, guidance network uses an encoder layer of a pre-trained
autoencoder. This autoen- coder is optimized for reconstructing all samples of real images, and is kept fixed
after pre-training so that it does not get updated during the training of GAN[4].
Figure 5: Modified SAVP Architecture
DATASETS
We have planned to use a few datasets, which are available, such as
PeMSD7:
This is the file of traffic signal used for signal prediction for the city of California. It is similar to the video
prediction.PeMSD7 was collected from Caltrans Performance Measurement System (PeMS) in real-time
by over 39, 000 sensor stations, deployed across the major metropolitan areas of California state highway
system [Chen et al., 2001]. The dataset is also aggregated into 5-minute interval from 30-second data
samples. We randomly select a medium and a large scale among the District 7 of California containing
228 and 1, 026 stations, labeled as PeMSD7(M) and PeMSD7(L), respectively, as data sources
MNIST datasets: This dataset has randomly sampled two digits from the original MNIST dataset, floating
and bouncing at boundaries at constant velocity and angle inside a 64 × 64 patch. New sequences can
be generated as and when required making the dataset an almost infinite source of video sequences.
8000 videos are used for training and 2000 videos for testing
UCF101: : This dataset contains 13320 annotated action videos. The videos have been accumulated
from YouTube and has 101 different categories of action. The videos have a resolution of 320 × 240 pixels
and a frame rate of 25 fps. 10000 videos are used for training and 3320 videos for testing. Data can be
downloaded from here
https://www.crcv.ucf.edu/data/UCF101.php
Sports-1M Dataset: The Sports-1M dataset is licensed under Creative Commons 3.0 and contains 1,133,158
video URLs which have been annotated automatically with 487 Sports labels using the YouTube Topics
API.
All the datasets can be found here,
https://drive.google.com/drive/folders/1SRRc22E045PcgBY7N5GS1s-3fn5OJeny?usp=sharing
In these datasets, we need to find weighted adjacency Matrix, which can be found using threshold Gaussian
8
kernel as,
a
i j
=
(
= exp
[(di st(v
i
,v
j
)]
2
σ
2
i f di st (v
i
, v
j
) k,
0 other wi se
where a
i j
represents the edge weight between v
i
and v
j
.
dist(v
i
, v
j
) denotes the physical distance between vertices v
i
and v
j
.
σ is the standard deviation of distances and k is the threshold
FUTURE WORKS
In the next few days we will try to predict the next frame of the video using the proposed model and also to
find the position of the nodes after some number of frames.
FINAL REPORT
The package from sklearn, ext r act_patches_2d function extracts patches from an image stored as a two-
dimensional array, or three-dimensional with color information along the third axis, as shown in the figure 6.
For rebuilding an image from all its patches, we have used reconst r uct_f r om_patches_2d
0.11 Algorithm for STGCN on custom skeleton-based dataset
If we have a few mini video set including three 10-seconds clips in the f older name/dat a
e
xampl e with the
structure(taken From UCF101 dataset):
folder name/dataset_example
skateboarding.mp4
clean_and_jerk.mp4
ta_chi.mp4
Now, convert them into skeleton,using this algorithm
{
" info " :
{
"video_name " : " skateboarding .mp4" ,
" resolution " : [340 , 256] ,
"num_frame" : 300 ,
"num_keypoints " : 17 ,
9
"keypoint_channels ": [" x " , "y " , " score "] ,
" version " : "1.0"
} ,
" annotations " :
[
{
"frame_index " : 0 ,
" id " : 0 ,
"person_id " : null ,
" keypoints " : [ [ x , y , score ] , [ x , y , score ] , . . . ]
} ,
. . .
] ,
" category_id " : 0 ,
}
Further, train the STGCN model and also test the model.
0.12 STGCN on MMSkeleton
MMSkeleton is an open source toolbox for skeleton-based human understanding. It is a part of the open-mmlab
project in the charge of Multimedia Laboratory, CUHK. On the MMSkeleton, where a video of a skater skating
over a staircase in video(gif). The graph can be converted from the given frame of the video as,
10
Figure 6: ST-GCN used for recognition of skeleton-based behavior
Figure 7: Action segmentation results using Stacked-STGCN on the CAD120 dataset. Green: correct detection
and red: erroneous detection.
0.13 Result from STGCN on Traffic Forecasting
For the PeMSD7 dataset, we run the given STGCN code, we obtained:
Model PeMSD7 dataset
t(s) 15 30 45
STGCN(Cheb) 2.23 2.98 3.58
STGCN(1st) 2.28 3.06 3.79
0.14 SAVP video prediction
This method which was propsed during the midway was producing a significant result on the dataset given in
the paper[6]. A generative model is a set of powerful algorithms in which model learn from any kind of data
distribution using unsupervised learning. There are two common and efficient approaches are Variational au-
toencoder(VAE) and generative adversial network(GAN).
0.15 Variational Auto encoder(VAE)
Autoencoders(AE) are neural network that aims to copy their inputs to their outputs. AE takes input data wit very
high dimensionality and run it into NN and try to compress data into smaller representation. They work by com-
pressing the input into a latent-space representation and then reconstruction output from this representation.
This network composed of two parts:
1 Encoder: this compresses the input into a latent-representation.
2 Decoder: this reconstruct the input from the latent space representation.
First, let talk briefly about representation learning. It is defined as a set of techniques that allow a system to dis-
cover the representations needed for feature detection or classification from raw data. It must represent feature
of the original data.
11
Figure 8: Architecture of Autoencoders.
The encoder takes an input sample and converts its information in some vectors and decoder which takes vector
and expands it out to reconstruct the input sample. Then why are we doing this if output is just reconstructed
image. The vector constructed in the middle as shown in Fig. 5. This vector is important because it is a repre-
sentation of the input image or audio and it’s in a form that the computer understands. AE learn to transform
an input into some vector by minimizing reconstruction loss. During training AE try to minimize the difference
from the original and the reconstructed images, hence it seeks to minimize the reconstruction loss The simplest
encoder is vanilla encoder. It is three layers net that is neural network with one hidden layer. Then there is a
convolutions encoder: It uses convolution instead of fully connected AE. Use 3D vector image instead of flat 1D
vector. Image dimension decrease using latent representation of smaller dimension and force the autoencoder
to learn a compressed version of images. The VAE also learns the hidden representation of input and generate
new data. VAE generate images by minimizing the sum of reconstruction loss and a latent loss.
0.16 Generative Adversarial Network(GAN)
GAN is a generative model which frame the problem as a supervised learning problem. There are two sub-
models:
1 Generator: In this model, it generates output.
2 Discriminator: It classifies output as real that is original image or fake that is generate by the generator.
The generator model: The generator model takes a fixed-length random vector as input and generates a sample
in the domain. The vector is drawn randomly from a Gaussian distribution, and the vector is used to seed the
generative process. After training, points in this multidimensional vector space will correspond to points in the
problem domain, forming a compressed representation of the data distribution.
The Discriminator Model: The discriminator is a normal classification model. It only does binary classification.
The two models, the generator and discriminator, are trained together. The generator generates a batch of sam-
ples, and these, along with real examples from the domain, are provided to the discriminator and classified as
real or fake.
The discriminator is then updated to get better at discriminating real and fake samples in the next round,
and importantly, the generator is updated based on how well, or not, the generated samples fooled the discrimi-
nator.
0.17 Stochastic Adversarial Video Prediction(SAVP)[5]
As we know, the SAVP architecture combines GAN and VAE. GANs are trained with randomly drawn input,
whereas VAEs only observe ground truth images and are never trained on random input which may lead to a
12
train-test mismatch.
VAE The generator specifies a distribution p(x
t
|x
0:t 1
, z
0:t 1
), and it is parametrized as a fixed-variance Laplacian
distribution with the mean given by
ˆ
x
t
= G(x
0
, z
0:t 1
) The likelihood of the data p(x
1:T
|x
0
) cannot be directly max-
imized, since it involves marginalizing over the latent variables. Maximize the variational lower bound of the
log-likelihood. We approximate the posterior with a recognition model q(z
t
|x
t:t+1
), which is parametrized as a
conditionally Gaussian distribution Nt
z t
,
2
z
t
), represented by a deep network E(x
t:t+1
). The encoder E is condi-
tioned on the frames of the adjacent time steps to allow the latent variable z
t
to encode any ambiguous informa-
tion of the transition between frames x
t
and x
t+1
. During training, the latent code is sampled from q(z
t
|x
t:t+1
).
The generation of each frame can be thought of as the reconstruction of frame
ˆ
x
t+1
. To allow back-propagation
through the encoder, the reconstruction term is rewritten using the re-parametrization trick.
L
1
(G, E) = E
x
0:T
,z
t
E(x
t:t
)|
T 1
t=0
[
T
X
t=1
||x
t
G(x
0
, z
0:t 1
)||
1
]
To enable sampling from the prior at test time, a regularization term encourages the approximate posterior
to be close to the prior distribution
L
kl
(E) = E
x
0:T
T
X
t=1
D
kl
(E(x
t1:t
)||p(z
t
1))
The VAE optimization involves minimizing the following objective, where the relative weighting of 1 and kl is
determined by the (fixed) variance of the Laplacian distribution,
G,E = ar gmin
G,E
λ
1
L
1
(G, E) + λ
kl
L
kl
(E)
GAN: Given a classifier D that is capable of distinguishing generated videos
ˆ
x
1:T
from real videos x
1:T
, the gener-
ator can be trained to match the statistics of the real data distribution using the binary cross-entropy loss,
L
G AN
(G, D) = E
x
1:T
[log D(x
0:T 1
)] + E
x 1:T
, z
t p
(z
t
)|
T 1
t=0
[log (1D(G(x
0
, z
0:T 1
)))]
The classifier, which is not known a priori and is problem-specific, can be realized as a deep discriminator net-
work that can be adversarially learned,
G = ar gmin
G
max
D
L
G AN
(G, D)
Therefore, the objective function of SAVP model is
G,E = ar gmin
G,E
max
D,D
V A E
λ
1
L
1
(G, E) + λ
kl
L
kl
(E)+ L
G AN
(G, D)+ L
V AE
G AN
(G, E ,D
V AE
)
Network: A convolutional LSTM serves as the generator, predicting pixel-space transformations between the
current and future frame. The network is conditioned by the current frame and latent code at each time step. The
network is conditioned on its own predictions after the initial frames. Concatenating the latent codes along the
channel dimension to the inputs of all the convolutional layers of the convolutional LSTM achieves conditioning
on the latent codes. At each time step, the encoder uses a feed-forward convolutional network to encode pairs of
pictures. The video discriminator is an SNGAN-based feed-forward convolutional network with 3D filters, with
the filters "inflated" from 2D to 3D.
0.18 Results of SAVP
This method appears more promising and efficient in video prediction, compared to ST-GCN. We cannot apply
this on our each dataset. Few results were obtained, using the datasets available from the paper.
13
0.19 Future work
Due to complexity of this project, we are yet to finish this. We did not obtained the desired outcome. We and our
mentor on this project Rucha Balchandra Joshi, have planned to finish this by December’s end.
0.20 Relevant Papers:
1. Bing Yu, Haoteng Yin, Zhanxing Zhu. Spatio -Temporal Graph Convolutional Networks: A Deep Learning
Framework for Traffic Forecasting. 10.24963/ijcai.2018/
2. Bin Zhao, Haopeng Li, Xiaoqiang Lu, Xuelong Li. Reconstructive Sequence-Graph Network for Video Sum-
marization.10.1109/TPAMI.2021.3072117.
3. Rucha Bhalchandra Joshi, Subhankar Mishra. Learning Graph Representations.
4. Future Frame Prediction of a Video Sequence,Jasmeen Kaur, Sukhendu Das.
5. Stochastic Adversarial Video Prediction, Alex X. Lee and Richard Zhang and Frederik Ebert and Pieter
Abbeel and Chelsea Finn and Sergey Levine, journal, arXiv preprint arXiv:1804.01523.
6. Semi-Supervised Classification with Graph Convolutional Networks, Thomas N. Kipf, Max Welling.
7. https://scikit-learn.org/stable/modules/feature_extraction.html#image-feature-extraction
8. Babaeizadeh, Mohammad, Finn, Chelsea, Erhan, Dumitru, Campbell, Roy H., and Levine, Sergey. Stochas-
tic variational video prediction. arXiv:1710.11252, 2017
9. kipf2017semisupervised, title=Semi-Supervised Classification with Graph Convolutional Networks, au-
thor=Thomas N. Kipf and Max Welling, year=2017, eprint=1609.02907, archivePrefix=arXiv, primaryClass=cs.LG
14