
0.5 Mid-way Goals.
By midway we will try to understand how STGCN work and how can it be implemented for video prediction.
By the end we will try to find the result using STGCN.
MIDWAY
Zhanxing Zhu and Co-workers propose a novel deep learning framework, Spatio Temporal Graph Convolu-
tional Network(STGCN), to tackle time series prediction problem as in the traffic domain. Instead of applying
regular convolutional and recurrent units, they formulate the problem on graphs and build the model with
complete convolutional structures. By purely apply convolutional structures to extract spatio-temporal fea-
tures simultaneously from graph-structured in video prediction as has been done for time series in a traffic
study. The built model with complete convolutional structures, which enables much faster training speed
with fewer parameters.[1]
To take full advantage of spatial features, we uses convolutional neural network(CNN) to capture adjacent
relation among the frame of the videos, along applying recurrent neural network(RNN) to time axis. Here to
handle the inherent deficiencies of recurrent networks, we have employed a fully convlutional structure on
time axis.
In today’s world, on every street there are cameras. Therefore an immense amount of video data is available
online. There is a need for an automated tools to handle these large-scale video data. Here, video summa-
rization becomes very helpful. It condenses the original video into a small summary while preserving the
important information. There are many summary formats available. In paper[2], Xuelong Li and Co-workers
uses the methodology of video skimming that is formed by several key-shots. There are many approaches pro-
posed for video summarization. But Recurrent Neural Networks(RRN) has provided significant advantages.
These approaches take video data as a sequence of the frame and summarize video by using the temporal
dependencies. RNN is able to capture local dependencies but fails to capture global dependencies as this
has been easily distracted by noises. The video data is layered as frames and shots. Shots are the intermedi-
ate states between the video and frame, formed from several consecutive frames. The frames in the shot are
easily modeled as a temporal sequence using RNN. But information between the adjacent shots can largely
varies. Current sequence models that just take neighborhood dependencies into account can cause inter-
ference. Another type of graph is a sequence. In sequence, only the consecutive items are connected. To
capture local and global dependencies it is better to model the video shot as a complete graph. All shots are
connected as graph nodes and the dependencies are calculated by the interaction between two shots. This
paper[2] is really helpful to understand basic of frames and and help to know about helpful video technique.
These techniques with other algorithms can useful in video prediction
For our project we have to do video prediction. Video frame prediction is the task of predicting subsequent
frames, given a sequence of video frames. Formally, it can be defined as follows. Let X
t
² R
w×h×c
be the t
th
frame in the input video sequence X = (X
t−n
,. . . . . . .., X
t−1
, X
t
) consisting of n-frames. Here, w, h, and c de-
note the width, height, and number of channels respectively. The target is to predict the subsequent m frame
Y = ( Y
t+1
, Y
t+2
, , . . . , Y
t+m
)with clarity and high precision, using the input X . [4]
0.6 Problem Formulation
Video frame prediction is a typical time-series prediction problem, i.e. predicting the most likely video frames
in the next H time steps given the previous M observations from previous video frames as,
ˆ
v
t+1
,...,
ˆ
v
t+H
= argmax
v
t+1
,...,v
t+H
logP(v
t+1
,...,v
t+H
|v
t−M+1
,...,v
t
). (1)
where v
t
² IR
n
is an observation vector of n image frames at time steps t.
In this project, we define the traffic network on a graph and focus on structured image frames of the videos.
The observation v
t
not independent but connected by pairwise connection in graph.Therefore, the data point
4