To generate the successive few frames in a video using

Graph Neural Networks.

Course Project, CS460, Fall 2021

Group 6, (Shivam Raj (1711127)

Shrishti Barethiya (1711128) )

https://github.com/raj2022/21cs460_group06

GOAL OR MOTIVATION:

• To predict the successive few frames of the video using a Graph Neural Network for spatio-temporal

graphs.

• To know about the nodes a few frames later.

PROJECT PROPOSAL

Graph neural networks are a highly scalable class of models that can learn on graph-structured data.Graph

Neural Network,is a neural network that can directly be applied to graphs .It provides a convenient way for

node level, edge level, and graph level prediction task. Many fundamental relationships among data in several

areas of science and engineering can be represented in graphs.

There are mainly three types of graph neural networks:

• Recurrent Graph Neural Network

• Spatial Convolutional Network

• Spectral Convolutional Network

A graph G is an ordered pair of a set V of vertices and a set E of edges. A graph G can be deﬁned as G=(V, E),

where V is the set of nodes or vertices, and E is the set of edges between them. As shown in the ﬁg.1

Figure 1: General Graph representation of a problem

GNN has many application and uses as in Natural Language Processing, Computer Vision, human behavior

detection, trafﬁc control, molecular structure study, recommender system, program veriﬁcation, logical rea-

soning, social inﬂuence prediction, and adversarial attack prevention.

In the Fig.1, Let the edge connecting two vertices A(v

) and B(v

) is represented as e

i j

= (v

), n is the num-

ber of nodes in the graph and m is the number of edges. If A being the adjacency matrix of the graph then the

dimension of A is n × n. Here, A

i j

= 1 if e

i j

² E. The neighborhood N(v) of a node v is deﬁned as N(v) = u² V |

(u,v) ² E.

Graph data exists in many ﬁelds. For example, in medicine and pharmacy in which atoms or molecules can be

taken as nodes(vertices) and their bond distance as edges, or in social media networks, people can be taken

as nodes and their age or gender as attributes.

0.1 What can we do with graph data in machine learning?

We can perform node-level prediction. Graph with unlabeled nodes and want to predict attributes about

these notes or classify them. We can ﬁnd information about the other nodes. We can also use link prediction

or edge-level prediction and determine ﬁnd connections between nodes.

0.2 Problems with graph data?

The size and shape of the graph might change within a dataset. We need a method for arbitrary input shapes.

Two different graphs that look different can be structurally identical. This issue can be solved with the help of

permutation invariant.

GNN can be broadly divided according to their work as, Node Classiﬁcation, Link Prediction and Graph clas-

siﬁcation. Node classiﬁcation (the objective here is to determine the labeling of samples by looking at the

labels of their neighbors), Graph classiﬁcation, Graph Visualization, Link Prediction, and Graph Clustering

are just a few examples of GNN applications.Link prediction, the task is to understand the relationship be-

tween entities in graphs and predict if two entities have a connection in between. Graph classiﬁcation, the

task is to classify the whole graph into different categories. It is similar to image classiﬁcation but the target

changes into graph domain.

0.3 Image Segmentation

Segmentation subdivides an image into constituents regions or objects. Segmentation is the process of par-

titioning a digital image into multiple regions and extracting the meaningful region which is known as region

of interest(ROI).

Image segmentation based on:

1 Similarity principle: Objective is a group pixels based on common property to extract a coherent region.

2 Discontinuity principle: objective is to extract regions that differ in properties like intensity, color, tex-

ture.

Image segmentation to represent the image as a graph. So how can we represent the image as a graph?

Every pixel in a image is a vertex. Edge is between pairs of pixels, not all pairs of pixels but pairs of pixels

which are close to each other. so, we have a graph G = (V,E) where V and E are the sets of vertices and edges,

respectively. Each edge is weighted by the afﬁnity or similarity between its two vertices.

How to measure afﬁnity? Let i and j be two pixels whose features are f i and f j and there is an edge between

them.

Pixels Dissimilarity:

S(f

, f

) =

(

− f

j k

)

Pixels afﬁnity

w(i, j) = A(f

, f

) = e(−1/2σ

)S(f

, f

)

Where, A(f

, f

) is the afﬁnity. The smaller the dissimilarity S(f

, f

) the larger the afﬁnity. This relationship

deﬁned by parameter σ. This afﬁnity is the weight w(i , j ) associate with the edge.

Graph cut: Cut C = (V

) is a partition of the vertices V of a graph G = (V,E ) into two disjoint subsets V

)

and V

. Cut-set: Set of the edges whose vertices are in different subsets of partition. Cost of the cut: Sum of

weights of cut-set edges.

cut (V

) =

u,v

w(u, v)

Algorithm: uses the following criteria. A pair of vertices(pixels) within a subgraph have high afﬁnity. A pair of

vertices from two different subgraphs have low afﬁnity. That is, minimize the cost of cut, also called Min-Cut.

But there is a problem with Min-Cut algorithm.

There is a bias to cut small, isolated segments. For instance, lets take pair of 3 weak afﬁnity edges which are

desired for cut, but the cost of cutting these three edges is higher than the cost of cutting single, isolated edge.

So instead of 3 weak edges isolated edge will be cut.

Although, there is simple solution for this which is some how you have to normalize the cut such that yo favor

larger subgraphs.. That requires us to come up with a measure of the size of a subgraph one way to do that is

by seeing how strongly vertices of the subgraph V

are associated with the larger graph V.

assoc(V

,V ) =

u,v

w(u, v)

Here, assoc() is the sum of the weights of the solid edges. Therefore, take the cost of cut and normalize with

respect to association. Minimize cost of normalized cut(NCut) during partition.

NCut(V

) =

cut (V

)

assoc(V

,V )

cut (V

)

assoc(V

,V )

To solve minimize NCut an approximation used which uses fast eigenvector-based approximation known as

spectral methods. So, by applying this simple algorithm image can be segmented easily into graph.

0.4 How will we proceed?

Here we will use Graph Neural Network(GNN) to determine the next few video frames and further about a

few frames. The multiple frames add the temporal aspect to the graph. Hence, the graph generated from the

number of frames is a spatio-temporal graph. In general we will try to use STGCN. An image can be repre-

sented as graphs by dividing the image into regions and the neighboring regions, which are interconnected.

The video consists of multiple frames. Using the above methods, the frame can be converted into a graph.

Here, in video prediction we will mostly focuses on the spectral convolutional network, which is build on

graph signal processing theory and by simpliﬁcation and approximation of graph convolution. By Chebyshev

polynomial approximation (Hammond et al. 2011), graph convolution can be simpliﬁed to below form:

,∗x '

k=o

(Λ)

After further simpliﬁcation, the GCN paper[5] suggests a 2-layered neural network structure, which can be

described in one equation as below:

Z = f (X , A) = so f tmax(

ARel u(

AX W

)

Where A is the adjacency matrix. The graphs auto-encoders aim at reconstructing the original adjacency ma-

trix.

Since a large amount of video data is captured, Video summarizing is used to condense the original video

into a summary while preserving the main content. Lots of videos summarizing approaches have been pro-

posed but, a fully convolutional structure on time axis approach has gained signiﬁcantly. We can compare

this situation of video prediction with forecasting trafﬁc speed problem. Forecasting trafﬁc speed, volume

or the density of roads in trafﬁc networks is fundamentally important in a smart transportation system. We

can address the trafﬁc prediction problem by using STGNNs or STCNs.Considering the trafﬁc network as a

spatial-temporal graph where the nodes are sensors installed on roads, the edges are measured by the dis-

tance between pairs of nodes, and each node has the average trafﬁc speed within a window as dynamic input

features.

0.5 Mid-way Goals.

By midway we will try to understand how STGCN work and how can it be implemented for video prediction.

By the end we will try to ﬁnd the result using STGCN.

MIDWAY

Zhanxing Zhu and Co-workers propose a novel deep learning framework, Spatio Temporal Graph Convolu-

tional Network(STGCN), to tackle time series prediction problem as in the trafﬁc domain. Instead of applying

regular convolutional and recurrent units, they formulate the problem on graphs and build the model with

complete convolutional structures. By purely apply convolutional structures to extract spatio-temporal fea-

tures simultaneously from graph-structured in video prediction as has been done for time series in a trafﬁc

study. The built model with complete convolutional structures, which enables much faster training speed

with fewer parameters.[1]

To take full advantage of spatial features, we uses convolutional neural network(CNN) to capture adjacent

relation among the frame of the videos, along applying recurrent neural network(RNN) to time axis. Here to

handle the inherent deﬁciencies of recurrent networks, we have employed a fully convlutional structure on

time axis.

In today’s world, on every street there are cameras. Therefore an immense amount of video data is available

online. There is a need for an automated tools to handle these large-scale video data. Here, video summa-

rization becomes very helpful. It condenses the original video into a small summary while preserving the

important information. There are many summary formats available. In paper[2], Xuelong Li and Co-workers

uses the methodology of video skimming that is formed by several key-shots. There are many approaches pro-

posed for video summarization. But Recurrent Neural Networks(RRN) has provided signiﬁcant advantages.

These approaches take video data as a sequence of the frame and summarize video by using the temporal

dependencies. RNN is able to capture local dependencies but fails to capture global dependencies as this

has been easily distracted by noises. The video data is layered as frames and shots. Shots are the intermedi-

ate states between the video and frame, formed from several consecutive frames. The frames in the shot are

easily modeled as a temporal sequence using RNN. But information between the adjacent shots can largely

varies. Current sequence models that just take neighborhood dependencies into account can cause inter-

ference. Another type of graph is a sequence. In sequence, only the consecutive items are connected. To

capture local and global dependencies it is better to model the video shot as a complete graph. All shots are

connected as graph nodes and the dependencies are calculated by the interaction between two shots. This

paper[2] is really helpful to understand basic of frames and and help to know about helpful video technique.

These techniques with other algorithms can useful in video prediction

For our project we have to do video prediction. Video frame prediction is the task of predicting subsequent

frames, given a sequence of video frames. Formally, it can be deﬁned as follows. Let X

² R

w×h×c

be the t

frame in the input video sequence X = (X

t−n

,. . . . . . .., X

t−1

, X

) consisting of n-frames. Here, w, h, and c de-

note the width, height, and number of channels respectively. The target is to predict the subsequent m frame

Y = ( Y

t+1

, Y

t+2

, , . . . , Y

t+m

)with clarity and high precision, using the input X . [4]

0.6 Problem Formulation

Video frame prediction is a typical time-series prediction problem, i.e. predicting the most likely video frames

in the next H time steps given the previous M observations from previous video frames as,

t+1

,...,

t+H

= argmax

t+1

,...,v

t+H

logP(v

t+1

,...,v

t+H

t−M+1

,...,v

). (1)

where v

² IR

is an observation vector of n image frames at time steps t.

In this project, we deﬁne the trafﬁc network on a graph and focus on structured image frames of the videos.

The observation v

not independent but connected by pairwise connection in graph.Therefore, the data point

can be regarded as a graph signal that is deﬁned on an undirected graph (or directed one) G with weights

j as shown in Fig.2.

Figure 2: Graph-structured trafﬁc data:Each v

indicates a frame of current trafﬁc status at time step t, which is

recorded in a graph-structured data matrix.

At the t

time step, in graph G

= (V

, ², W ), V

is a ﬁnite set of vertices, corresponding to the observations

from n monitor stations in a trafﬁc network; ² is a set of edges, indicating the connectedness between sta-

tions; while W ² IR

n×n

denotes the weighted adjacency matrix of G

0.7 Convolutions on Graph

There are two basic approaches currently exploring how to generalize CNNs to structured data forms. One

is to expand the spatial deﬁnition of a convolution [Niepert et al., 2016], and the other is to manipulate in

the spectral domain with graph Fourier transforms [Bruna etal., 2013]. The former approach rearranges the

vertices into certain grid forms which can be processed by normal convolutional operations. The latter one

introduces the spectral framework to apply convolutions in spectral domains, often named as the spectral

graph convolution. Several followingup studies make the graph convolution more promising by reducing the

computational complexity from O

∈

to linear. We have used here notion of graph convolution operator "*G",

” based on the conception of spectral graph convolution, as the multiplication of a signal x ²IR

with a kernel

Θ,

∗G

x = Θ(L)x = Θ(U ΛU

)x = (UΘ(Λ)U

where graph Fourier basis U ² IR

n×n

is the matrxi of eigenvectors of the normalized graph laplacian L =

− D

−

W D

−

= UΛU

²IR

n×n

is an identity matrix, D is diagonal matrix, and Λ is the eigenvalues of

L, also Θ(Λ) is also a diagonal matrix. From this deﬁnition, a graph signal x is ﬁltered by a kernel Θ with

multiplication between Θ and graph Fourier transform U

0.8 Spatio-Temporal Graph Neural Networks

The spatio-temporal systems have dependency on both time and space. In this section, we elaborate on

the proposed architecture of spatio-temporal graph convolutional networks (STGCN). As shown in Figure 3,

STGCN is composed of several spatiotemporal convolutional blocks, each of which is formed as a “sandwich”

structure with two gated sequential convolution layers and one spatial graph convolution layer in between.

We now summarize the main characteristics of our model STGCN in the following points:-

• STGCN is a universal framework to process structured time series. It is not only able to tackle image

modeling and prediction issues but also to be applied to more general spatio-temporal sequence learn-

ing tasks.

• The spatio-temporal block combines graph convolutions and gated temporal convolutions, which can

extract the most useful spatial features and capture the most essential temporal features coherently.

Figure 3: : Architecture of spatio-temporal graph convolutional networks. The framework STGCN consists of

two spatio-temporal convolutional blocks (ST-Conv blocks) and a fully-connected output layer in the end. Each

ST-Conv block contains two temporal gated convolution layers and one spatial graph convolution layer in the

middle. The residual connection and bottleneck strategy are applied inside each block k. The input v

t−M+1

, ...,

is uniformly processed by ST-Conv blocks to explore spatial and temporal dependencies coherently. Compre-

hensive features are integrated by an output layer to generate the ﬁnal prediction

• The model is entirely composed of convolutional structures and therefore achieves parallelization over

input with fewer parameters and faster training speed. More importantly, this economic architecture

allows the model to handle large-scale networks with more efﬁciency.

Another method we can use is spatio-temporal Convolutional Block

0.9 Spatio-temporal Convolutional Block

To use features from both spatial and temporal domains, the spatio-temporal convolutional block (ST-Conv

block) is constructed to jointly process graph-structured time series. The block itself can be stacked or ex-

tended based on the scale and complexity of particular cases.

As in Fig.3, the spatial layer in the middle is to bridge two temporal layers which can achieve fast spatial-state

propagation from graph convolution through temporal convolutions. The “sandwich” structure also helps the

network sufﬁciently apply bottleneck strategy to achieve scale compression and feature squeezing by down-

scaling and up-scaling of channels C through the graph convolutional layer. Moreover, layer normalization

is utilized within every ST-Conv block to prevent over-ﬁtting.The input and output of ST-Conv blocks are all

3-D tensors.

0.10 Future Frame Prediction of a Video Sequence

In this paper[4], they have addressed the issue of the ambiguous nature of the problem, there may be multiple

future sequences possible for the same input video. This problem is resolved by designing a model that aver-

ages multiple possible futures into a single prediction. Two different approaches have attempted to address

this problem:

• Use of latent variable models that represent underlying stochasticity.

• Adversarially trained models to aim to produce sharper images.

Both of the approaches fail to produce predictions. The former approach fails to produce realistic results, and

the later approach fails to produce diverse predictions. Since both of them have complementary strengths

and weaknesses. Thus, Sukhendu Das and Co-worker combined both approaches and proposed a multi-

scale architecture. In this paper, they have used Moving MNIST, UCF101, and Sports-1M Datasets. We will try

to implement these datasets using STGCN.

Prediction of future events is necessary for any decision-making system. It is relatively easy for humans and

humans can make a staggering amount of predictions about the future just by looking at a series of pictures.

For instance, let’s take an example of a short clip as shown in Fig.4.

Figure 4: A bowling ball moving towards a bowling pin. Our aim is to predict whether there will be a collision or

not. Prediction of such collisions is very important in the context of video frames.

Here, the aim is to predict whether this ball will collide with the bowling pin or not. Since the ball is having a

distance from the pin it will not collide. As a human we are able to predict the ball approaching towards the

pin will not collide with the pin. There are many cases where it is necessary to predict the future. There are

several approaches that are made.

0.10.1 Proposed Architecture

The SAVP architecture as proposed in ”Stochastic Adversarial Video Prediction” combines GAN and VAE.

GANs are trained with randomly drawn input, whereas VAEs only observe ground truth images and are never

trained on random input which may lead to a train-test mismatch.

Modiﬁcations Proposed

• Replace Convolutional LSTM with E3D-LSTM

Convolutional LSTM predicts future frame based on sequentially updated memory states. When the

memory cell is refreshed, older memories are discarded. In contrast, the proposed E3D-LSTM model

maintains a list of historical memory records and revokes them when necessary, thereby facilitating

long-range video reasoning. The Eidetic 3D LSTM (E3D-LSTM) has a new state, namely RECALL state

which can recall long term past unlike LSTM. RECALL mechanism helps to recall temporally distant

memory.

• Replace GAN with MG-GAN Generative ad- versarial networks are susceptible to mode collapse. For

alleviating mode collapse, Manifold Guided Generative Adversarial Network (MGGAN) is used. MGGAN

leverages a guidance network on the existing GAN architecture. The guidance network consists of an

encoder and a discriminator. In order to solve the mode collapse, we design the encoder such that the

output distribution of encoder is the best approximate of p data . So, it captures all modes of data. To

this end, we adopt an encoder of a pre-trained autoencoder as a manifold mapping function. Because

autoencoder aims to reconstruct all samples in dataset, the encoder should not ignore minor modes .

Consequently, the generator avoids mode missing during training because it receives feedback for minor

modes of data distribution from the guidance network

Modiﬁed SAVP Architecture The modiﬁed SAVP architecture is shown in Figure 5.The modiﬁcations made

are encircled. The architecture has two MGGANs and a VAE. The decoder of the VAE also as a generator.

MGGAN integrates a guidance network to the existing GAN architecture. The guidance network en- courages

the generator to learn all modes of p data . The guidance network comprises of an encoder for manifold

map- ping and a discriminator for measuring dissimilarity between the distributions of p data and p model

in the manifold space. To cope with mode collapse, guidance network uses an encoder layer of a pre-trained

autoencoder. This autoen- coder is optimized for reconstructing all samples of real images, and is kept ﬁxed

after pre-training so that it does not get updated during the training of GAN[4].

Figure 5: Modiﬁed SAVP Architecture

DATASETS

We have planned to use a few datasets, which are available, such as

• PeMSD7:

This is the ﬁle of trafﬁc signal used for signal prediction for the city of California. It is similar to the video

prediction.PeMSD7 was collected from Caltrans Performance Measurement System (PeMS) in real-time

by over 39, 000 sensor stations, deployed across the major metropolitan areas of California state highway

system [Chen et al., 2001]. The dataset is also aggregated into 5-minute interval from 30-second data

samples. We randomly select a medium and a large scale among the District 7 of California containing

228 and 1, 026 stations, labeled as PeMSD7(M) and PeMSD7(L), respectively, as data sources

• MNIST datasets: This dataset has randomly sampled two digits from the original MNIST dataset, ﬂoating

and bouncing at boundaries at constant velocity and angle inside a 64 × 64 patch. New sequences can

be generated as and when required making the dataset an almost inﬁnite source of video sequences.

8000 videos are used for training and 2000 videos for testing

• UCF101: : This dataset contains 13320 annotated action videos. The videos have been accumulated

from YouTube and has 101 different categories of action. The videos have a resolution of 320 × 240 pixels

and a frame rate of 25 fps. 10000 videos are used for training and 3320 videos for testing. Data can be

downloaded from here

https://www.crcv.ucf.edu/data/UCF101.php

• Sports-1M Dataset: The Sports-1M dataset is licensed under Creative Commons 3.0 and contains 1,133,158

video URLs which have been annotated automatically with 487 Sports labels using the YouTube Topics

API.

All the datasets can be found here,

https://drive.google.com/drive/folders/1SRRc22E045PcgBY7N5GS1s-3fn5OJeny?usp=sharing

In these datasets, we need to ﬁnd weighted adjacency Matrix, which can be found using threshold Gaussian

kernel as,

i j

(

= exp

[(di st(v

)]

i f di st (v

, v

) ≤ k,

0 other wi se

• where a

i j

represents the edge weight between v

and v

• dist(v

, v

) denotes the physical distance between vertices v

and v

• σ is the standard deviation of distances and k is the threshold

FUTURE WORKS

In the next few days we will try to predict the next frame of the video using the proposed model and also to

ﬁnd the position of the nodes after some number of frames.

FINAL REPORT

The package from sklearn, ext r act_patches_2d function extracts patches from an image stored as a two-

dimensional array, or three-dimensional with color information along the third axis, as shown in the ﬁgure 6.

For rebuilding an image from all its patches, we have used reconst r uct_f r om_patches_2d

0.11 Algorithm for STGCN on custom skeleton-based dataset

If we have a few mini video set including three 10-seconds clips in the f older name/dat a

xampl e with the

structure(taken From UCF101 dataset):

folder name/dataset_example

−→ skateboarding.mp4

−→ clean_and_jerk.mp4

−→ ta_chi.mp4

Now, convert them into skeleton,using this algorithm

{

" info " :

{

"video_name " : " skateboarding .mp4" ,

" resolution " : [340 , 256] ,

"num_frame" : 300 ,

"num_keypoints " : 17 ,

"keypoint_channels ": [" x " , "y " , " score "] ,

" version " : "1.0"

} ,

" annotations " :

[

{

"frame_index " : 0 ,

" id " : 0 ,

"person_id " : null ,

" keypoints " : [ [ x , y , score ] , [ x , y , score ] , . . . ]

} ,

. . .

] ,

" category_id " : 0 ,

}

Further, train the STGCN model and also test the model.

0.12 STGCN on MMSkeleton

MMSkeleton is an open source toolbox for skeleton-based human understanding. It is a part of the open-mmlab

project in the charge of Multimedia Laboratory, CUHK. On the MMSkeleton, where a video of a skater skating

over a staircase in video(gif). The graph can be converted from the given frame of the video as,

Figure 6: ST-GCN used for recognition of skeleton-based behavior

Figure 7: Action segmentation results using Stacked-STGCN on the CAD120 dataset. Green: correct detection

and red: erroneous detection.

0.13 Result from STGCN on Trafﬁc Forecasting

For the PeMSD7 dataset, we run the given STGCN code, we obtained:

Model PeMSD7 dataset

t(s) 15 30 45

STGCN(Cheb) 2.23 2.98 3.58

STGCN(1st) 2.28 3.06 3.79

0.14 SAVP video prediction

This method which was propsed during the midway was producing a signiﬁcant result on the dataset given in

the paper[6]. A generative model is a set of powerful algorithms in which model learn from any kind of data

distribution using unsupervised learning. There are two common and efﬁcient approaches are Variational au-

toencoder(VAE) and generative adversial network(GAN).

0.15 Variational Auto encoder(VAE)

Autoencoders(AE) are neural network that aims to copy their inputs to their outputs. AE takes input data wit very

high dimensionality and run it into NN and try to compress data into smaller representation. They work by com-

pressing the input into a latent-space representation and then reconstruction output from this representation.

This network composed of two parts:

1 Encoder: this compresses the input into a latent-representation.

2 Decoder: this reconstruct the input from the latent space representation.

First, let talk brieﬂy about representation learning. It is deﬁned as a set of techniques that allow a system to dis-

cover the representations needed for feature detection or classiﬁcation from raw data. It must represent feature

of the original data.

Figure 8: Architecture of Autoencoders.

The encoder takes an input sample and converts its information in some vectors and decoder which takes vector

and expands it out to reconstruct the input sample. Then why are we doing this if output is just reconstructed

image. The vector constructed in the middle as shown in Fig. 5. This vector is important because it is a repre-

sentation of the input image or audio and it’s in a form that the computer understands. AE learn to transform

an input into some vector by minimizing reconstruction loss. During training AE try to minimize the difference

from the original and the reconstructed images, hence it seeks to minimize the reconstruction loss The simplest

encoder is vanilla encoder. It is three layers net that is neural network with one hidden layer. Then there is a

convolutions encoder: It uses convolution instead of fully connected AE. Use 3D vector image instead of ﬂat 1D

vector. Image dimension decrease using latent representation of smaller dimension and force the autoencoder

to learn a compressed version of images. The VAE also learns the hidden representation of input and generate

new data. VAE generate images by minimizing the sum of reconstruction loss and a latent loss.

0.16 Generative Adversarial Network(GAN)

GAN is a generative model which frame the problem as a supervised learning problem. There are two sub-

models:

1 Generator: In this model, it generates output.

2 Discriminator: It classiﬁes output as real that is original image or fake that is generate by the generator.

The generator model: The generator model takes a ﬁxed-length random vector as input and generates a sample

in the domain. The vector is drawn randomly from a Gaussian distribution, and the vector is used to seed the

generative process. After training, points in this multidimensional vector space will correspond to points in the

problem domain, forming a compressed representation of the data distribution.

The Discriminator Model: The discriminator is a normal classiﬁcation model. It only does binary classiﬁcation.

The two models, the generator and discriminator, are trained together. The generator generates a batch of sam-

ples, and these, along with real examples from the domain, are provided to the discriminator and classiﬁed as

real or fake.

The discriminator is then updated to get better at discriminating real and fake samples in the next round,

and importantly, the generator is updated based on how well, or not, the generated samples fooled the discrimi-

nator.

0.17 Stochastic Adversarial Video Prediction(SAVP)[5]

As we know, the SAVP architecture combines GAN and VAE. GANs are trained with randomly drawn input,

whereas VAEs only observe ground truth images and are never trained on random input which may lead to a

train-test mismatch.

VAE The generator speciﬁes a distribution p(x

0:t 1

, z

0:t 1

), and it is parametrized as a ﬁxed-variance Laplacian

distribution with the mean given by

= G(x

, z

0:t 1

) The likelihood of the data p(x

1:T

) cannot be directly max-

imized, since it involves marginalizing over the latent variables. Maximize the variational lower bound of the

log-likelihood. We approximate the posterior with a recognition model q(z

t:t+1

), which is parametrized as a

conditionally Gaussian distribution N(¸t

z t

), represented by a deep network E(x

t:t+1

). The encoder E is condi-

tioned on the frames of the adjacent time steps to allow the latent variable z

to encode any ambiguous informa-

tion of the transition between frames x

and x

t+1

. During training, the latent code is sampled from q(z

t:t+1

The generation of each frame can be thought of as the reconstruction of frame

t+1

. To allow back-propagation

through the encoder, the reconstruction term is rewritten using the re-parametrization trick.

(G, E) = E

0:T

E(x

t:t

T −1

t=0

[

t=1

||x

−G(x

, z

0:t −1

)||

]

To enable sampling from the prior at test time, a regularization term encourages the approximate posterior

to be close to the prior distribution

(E) = E

0:T

t=1

(E(x

t1:t

)||p(z

1))

The VAE optimization involves minimizing the following objective, where the relative weighting of 1 and kl is

determined by the (ﬁxed) variance of the Laplacian distribution,

G∗,E∗ = ar gmin

G,E

(G, E) + λ

(E)

GAN: Given a classiﬁer D that is capable of distinguishing generated videos

1:T

from real videos x

1:T

, the gener-

ator can be trained to match the statistics of the real data distribution using the binary cross-entropy loss,

G AN

(G, D) = E

1:T

[log D(x

0:T 1

)] + E

x 1:T

, z

t p

T 1

t=0

[log (1D(G(x

, z

0:T 1

)))]

The classiﬁer, which is not known a priori and is problem-speciﬁc, can be realized as a deep discriminator net-

work that can be adversarially learned,

G = ar gmin

max

G AN

(G, D)

Therefore, the objective function of SAVP model is

G∗,E∗ = ar gmin

G,E

max

D,D

V A E

(G, E) + λ

(E)+ L

G AN

(G, D)+ L

V AE

G AN

(G, E ,D

V AE

)

Network: A convolutional LSTM serves as the generator, predicting pixel-space transformations between the

current and future frame. The network is conditioned by the current frame and latent code at each time step. The

network is conditioned on its own predictions after the initial frames. Concatenating the latent codes along the

channel dimension to the inputs of all the convolutional layers of the convolutional LSTM achieves conditioning

on the latent codes. At each time step, the encoder uses a feed-forward convolutional network to encode pairs of

pictures. The video discriminator is an SNGAN-based feed-forward convolutional network with 3D ﬁlters, with

the ﬁlters "inﬂated" from 2D to 3D.

0.18 Results of SAVP

This method appears more promising and efﬁcient in video prediction, compared to ST-GCN. We cannot apply

this on our each dataset. Few results were obtained, using the datasets available from the paper.

0.19 Future work

Due to complexity of this project, we are yet to ﬁnish this. We did not obtained the desired outcome. We and our

mentor on this project Rucha Balchandra Joshi, have planned to ﬁnish this by December’s end.

0.20 Relevant Papers:

1. Bing Yu, Haoteng Yin, Zhanxing Zhu. Spatio -Temporal Graph Convolutional Networks: A Deep Learning

Framework for Trafﬁc Forecasting. 10.24963/ijcai.2018/

2. Bin Zhao, Haopeng Li, Xiaoqiang Lu, Xuelong Li. Reconstructive Sequence-Graph Network for Video Sum-

marization.10.1109/TPAMI.2021.3072117.

3. Rucha Bhalchandra Joshi, Subhankar Mishra. Learning Graph Representations.

4. Future Frame Prediction of a Video Sequence,Jasmeen Kaur, Sukhendu Das.

5. Stochastic Adversarial Video Prediction, Alex X. Lee and Richard Zhang and Frederik Ebert and Pieter

Abbeel and Chelsea Finn and Sergey Levine, journal, arXiv preprint arXiv:1804.01523.

6. Semi-Supervised Classiﬁcation with Graph Convolutional Networks, Thomas N. Kipf, Max Welling.

7. https://scikit-learn.org/stable/modules/feature_extraction.html#image-feature-extraction

8. Babaeizadeh, Mohammad, Finn, Chelsea, Erhan, Dumitru, Campbell, Roy H., and Levine, Sergey. Stochas-

tic variational video prediction. arXiv:1710.11252, 2017

9. kipf2017semisupervised, title=Semi-Supervised Classiﬁcation with Graph Convolutional Networks, au-

thor=Thomas N. Kipf and Max Welling, year=2017, eprint=1609.02907, archivePreﬁx=arXiv, primaryClass=cs.LG