How to deal the Graph data in Deep learning with Graph Convolution Netwoks(GCN)

Introduction to Graph Convolution Networks(Why GCN?)
A Brief History of GCN
Defination-What is graph?
What GCN does?
Pytorch implementation of GCN

In previous post. I have explained about Generation of molecues using SMILE Dataset. But I want to explore the things if we have to work on Graph dataset.

SMILES strings are generated from a graph-based representation of molecules, thereby working in the original graph space has the benefit of removing additional overhead. With recent progress in the area of deep learning on graphs, training deep generative models directly on graph representations becomes a feasible alternative that has been explored in a range of recent works.

Why GCN- Recently, there is increasing interest in extending deep learning approaches for graph data. Driven by the success of deep learning, researchers have borrowed ideas from convolution networks, recurrent networks, and deep autoencoders to design the architecture of graph neural networks. While deep learning has achieved great success on Euclidean data, there is an increasing number of applications where data are generated from the non-Euclidean domain and need to be effectively analyzed. For instance, in e-commerce, a graph-based learning system is able to exploit the interactions between users and products to make highly accurate recommendations. In chemistry, molecules are modeled as graphs and their bio-activity needs to be identified for drug discovery. User data on social networks, gene data on biological regulatory networks, log data on telecommunication networks, or text documents on word embeddings are important examples of data lying on irregular or non-Euclidean domains that can be structured with graphs. which are universal representations of heterogeneous pair wise relationships. Graphs can encode complex geometric structures and can be studied with strong mathematical tools such as spectral graph theory (). The complexity of graph data has imposed significant challenges on existing machine learning algorithms. This is because graph data are irregular. Each graph has a variable size of unordered nodes and each node in a graph has a different number of neighbors, causing some important operations (e.g., convolutions), which are easy to compute in the image domain but are not directly applicable to the graph domain anymore. To handle the complexity of graph data, new generalizations and definitions for important operations have been rapidly developed over the past few years. For instance, Below Figure illustrates how a kind of graph convolution is inspired by a standard 2D convolution. This survey aims to provide a comprehensive overview of these methods, for both interested researchers who want to enter this rapidly developing field and experts who would like to compare graph neural network algorithms.

Fig- 2D Convolution vs. Graph Convolution.

(a) 2D Convolution. Analogous to a graph, each pixel in an image is taken as a node where neighbors are determined by the filter size. The 2D convolution takes a weighted average of pixel values of the red node along with its neighbors. The neighbors of a node are ordered and have a fixed size. (b) Graph Convolution. To get a hidden representation of the red node, one simple solution of graph convolution operation takes the average value of node features of the red node along with its neighbors. Different from image data, the neighbors of a node are unordered and variable in size.

A Brief History of Graph Convolutional Networks- GCNs are a very powerful neural network architecture for machine learning on graphs.This method directly perform the convolution in the graph domain by aggregating the neighbor nodes’ information. Together with sampling strategies, the computation can be performed in a batch of nodes instead of the whole graph, which has the potential to improve efficiency. In addition to graph convolutional networks, many alternative graph neural networks have been developed in the past few years. These approaches include graph attention networks, graph auto encoders, graph generative networks, and graph spatial-temporal networks.

Graph neural networks vs. network embedding - The research on graph neural networks is closely related to graph embedding or network embedding,Network embedding aims to represent network vertices into a low-dimensional vector space, by preserving both network topology structure and node content information,so that any subsequent graph analytics tasks such as classification, clustering, and recommendation can be easily performed by using simple off-the-shelf machine learning algorithm (eg.support vector machines for classification).Many network embedding algorithms are typically unsupervised algorithms and they can be broadly classified into three groups ex-, matrix factorization ,random walks ,and deep learning approaches. The deep learning approaches for network embedding at the same time belong to graph neural networks, which include graph autoencoder-based algorithms (e.g., DNGR and SDNE ) and graph convolution neural networks with unsupervised training(e.g., GraphSage ).

DEFINITION –What is Graph- A graph $G$ can be well described by the set of vertices $V$ and $E$. $G=(V,E)$ ,where $V$ is the set of nodes, $E$ is the set of edges. Edges can be either directed or undirected, depending on whether there exist directional dependencies between vertices, e_i,j= (v_i, v_j),∈ E to denote an edge. The degree of a node is the number of edges connected to it.The vertices are often called nodes, let v_i ∈ V to denote a node.

a GCN takes as input-

An input feature matrix X_i for every node i; summarized in a $N×D$ feature matrix $X$, ($N$: number of nodes, $D$: number of input features) A description of the graph structure in matrix form; typically in the form of an adjacency matrix $A$ (or some function thereof).The adjacency matrix $A$ of size $N×N$ ,where A_i,j= 1 if there is an edge from vertex v_i to vertex v_j , and A_i,j = 0 otherwise.In this case, we say that vertex v_i has position i in A. Moreover, if A_i,j = 1 we say v_i and v_j are adjacent.

Every neural network layer can then be written as a non-linear function-

H^(l+1)=f(H^(l),A)

with $H(0)=X$ and H^(l)=Z (or Z for graph-level outputs), l being the number of layers, f is a propagation rule. Each layer H ^(l) corresponds to an N ×D^(l) feature matrix where each row is a feature representation of a node.

As any other stacked neural layers, GCN can be multiple layers. For each layer,
$H^(l+1)=f(H^(l),A),$ The input matrix A is the adjacency matrix. If there is a connection between two nodes, then the element is 1, otherwise 0.

H^(l) is the graph-level outputs of layer l. In other words, to generate the output for the next layer, we take the current layer as well as the adjacency matrix, then apply a non-linear function f.

The final output of GCN at the laster layer is a matrix Z, and the shape is N*F. N is the number of nodes, F is the number of output features per node.

Initially, we need H(0) =X, X here is a matrix with node features with the shape being $N*D$, and $D$ is the number of input features. So to start the GCN, we need to provide $A$ and $X$, eventually the output is $Z$ which is able to provide meaningful embedding for each node in the graph. When choosing the function, a simple way is to apply a ReLu and we add a weight matrix W to each layer, then the function becomes: H^(l+1) = σ (AH^(l)W^(l))

At each layer, these features are aggregated to form the next layer’s features using the propagation rule $f$. In this way, features become increasingly more abstract at each consecutive layer. Where, $W^(l)$ is a weight matrix for the Lth neural network layer and σ(⋅) is a non-linear activation function like the ReLU. The weight matrix has dimensions F^l ×F^l+1; in other words the size of the second dimension of the weight matrix determines the number of features at the next layer. However, simply by multiplying with adjacency matrix $A$ may lead to two problems.The first one is the adjacency matrix is not considering itself for each node unless it is self-connected, multiplication with $A$ means that, for every node, we sum up all the feature vectors of all neighboring nodes but not the node itself (unless there are self-loops in the graph). In this case, we force the diagonal elements to be $1$, or add an identity matrix $I$ into $A$. $A^{^} =A+I$ Since the node is now a neighbor of itself, the node’s own features is included when summing up the features of its neighbors!

The second major limitation is that A is typically not normalized and therefore the multiplication with A will completely change the scale of the feature vectors-The feature representations can be normalized by node degree by transforming the adjacency matrix A by multiplying it with the inverse degree matrix D –

D=np.array(np.sum(A,axis=0))[0]
D=np.matrix(np.diag(D))

Normalizing A such that all rows sum to one, i.e. D^-1A, where D is the diagonal node degree matrix, gets rid of this problem. Multiplying with D^-1A , now corresponds to taking the average of neighboring node features. turning the aggregate into a mean where the scale of the aggregated feature representation is invariant to node degree. The representation of each node (each row) is now a sum of its neighbors features!

In other words, the graph convolutional layer represents each node as an aggregate of its neighborhood.

In practice, dynamics get more interesting when we use a symmetric normalization, i.e. D^-1/2AD^-1/2 (as this no longer amounts to merge averaging of neighboring nodes). Combining these two tricks, we essentially arrive at the propagation rule:

$\f(H^(l),A)=σ(D^{^-1/2}A^{^} D^{^-1/2}H^(l)W^(l))$,

with A^{^}=A+I, where I is the identity matrix and D^ is the diagonal node degree matrix of A^. Where, the weight matrix can be learned during training.

What GCN does?

To my understanding, GCN is used to extract features for each node by representing it with a vector. Similarly, it is possible to represent words by word embeddings. Also, we represent each node by looking at its neighbor nodes, or the structure. Graph convolutional networks play a central role in building up many other complex graph neural network models, including auto-encoder-based models, generative models, and spatial-temporal networks, etc.

Definition2-(Directed Graph)-A directed graph is a graph with all edges pointing from one node to another. For a directed graph, A_i,j (not equal to)= A_j,i.

An undirected graph is a graph with all edges undirected. For an undirected graph, A_i,j = A_j,i.

Semi-supervised learning for node-level classification- Given a single network with partial nodes being labeled and others remaining unlabeled, GCN’s can learn a robust model that effectively identify the class labels for the unlabeled nodes . To this end, an end-to-end framework can be built by stacking a couple of graph GCNs followed by a softmax layer for multi-class classification. Supervised learning for graph-level classification- Given a graph dataset, graph-level classification aims to predict the class label(s) for an entire graph. The end-to-end learning for this task can be done with a framework which combines both graph convolutional layers and the pooling procedure. Specifically, by applying graph convolutional layers, we obtain representation with a fixed number of dimensions for each node in every single graph. Then, we can get the representation of an entire graph through pooling which summarizes the representation vectors of all nodes in a graph. Finally, by applying linear layers and a softmax layer, we can build an end-to-end framework for graph classification. An example is given in Fig (a)

(a) Graph Convolution Networks with Pooling Modules for Graph Classification . A GCN layer is followed by a pooling layer to coarsen a graph into sub-graphs so that node representations on coarsened graphs represent higher graph-level representations. To calculate the probability for each graph label, the output layer is a linear layer with the Soft Max function.

Unsupervised learning for graph embedding- When no class labels are available in graphs, we can learn the graph embedding in a purely unsupervised way in an end-to-end framework. These algorithms exploit edge-level information in two ways. One simple way is to adopt an auto encoder framework where the encoder employs graph convolutional layers to embed the graph into the latent representation upon which a decoder is used to reconstruct the graph structure . Another way is to utilize the negative sampling approach which samples a portion of node pairs as negative pairs while existing node pairs with links in the graphs being positive pairs. Then a logistic regression layer is applied after the convolutional layers for end-to-end learning.

Pytorch Implementation of GCN- For Pytorch implementation of code we have to write the code -….. We have to load the dataset and and follow the simple preprocessing ,Build the Adjacency matrixes and Normalize it.

 import numpy as np
import scipy.sparse as sp
import torch
def encode_onehot(labels):
    classes= set(labels)
    classes_dict = {c: np.identity(len(classes))[i,:] for i,c in enumerate(classes)}
    labels_onehot= np.araay(list(map(classes_dict.get,labels)),dtype=np.int32)
    return labels_onehot
def load_data( dataset = "cora"):
    idx_features_labels = np.genfromtxt("cora.content",
                                       )
    features = sp.csr_matrix(idx_features_labels[:, 1:-1], dtype=np.float32)
    labels = encode_onehot(idx_features_labels[:, -1])

    # build graph
    idx = np.array(idx_features_labels[:, 0], dtype=np.int32)
    idx_map = {j: i for i, j in enumerate(idx)}
    edges_unordered = np.genfromtxt("{}{}.cites".format(dataset),
                                    dtype=np.int32)
    edges = np.array(list(map(idx_map.get, edges_unordered.flatten())),
                     dtype=np.int32).reshape(edges_unordered.shape)
    adj = sp.coo_matrix((np.ones(edges.shape[0]), (edges[:, 0], edges[:, 1])),
                        shape=(labels.shape[0], labels.shape[0]),
                        dtype=np.float32)

    # build symmetric adjacency matrix
    adj = adj + adj.T.multiply(adj.T > adj) - adj.multiply(adj.T > adj)

    features = normalize(features)
    adj = normalize(adj + sp.eye(adj.shape[0]))

    idx_train = range(140)
    idx_val = range(200, 500)
    idx_test = range(500, 1500)
    train=idx_train
    test=idx_test
    val=idx_val
    features = torch.FloatTensor(np.array(features.todense()))
    labels = torch.LongTensor(np.where(labels)[1])
    adj = sparse_mx_to_torch_sparse_tensor(adj)
    train = torch.LongTensor(train)
    val = torch.LongTensor(val)
    test = torch.LongTensor(test)

    return adj, features, labels, train, val, test

def normalize(mx):
    """Row-normalize sparse matrix"""
    rowsum = np.array(mx.sum(1))
    r_inv = np.power(rowsum, -1).flatten()
    r_inv[np.isinf(r_inv)] = 0.
    r_mat_inv = sp.diags(r_inv)
    mx = r_mat_inv.dot(mx)
    return mx
def accuracy(output, labels):
    preds = output.max(1)[1].type_as(labels)
    correct = preds.eq(labels).double()
    correct = correct.sum()
    return correct / len(labels)


def sparse_mx_to_torch_sparse_tensor(sparse_mx):
    """Convert a scipy sparse matrix to a torch sparse tensor."""
    sparse_mx = sparse_mx.tocoo().astype(np.float32)
    indices = torch.from_numpy(
        np.vstack((sparse_mx.row, sparse_mx.col)).astype(np.int64))
    values = torch.from_numpy(sparse_mx.data)
    shape = torch.Size(sparse_mx.shape)
    return torch.sparse.FloatTensor(indices, values, shape)    

We have to build the simple GCN layers-

import math
import torch
from torch.nn.parameter import Parameter
from torch.nn.modules.module import Module

class GraphConvolution(Module):
    """
    Simple GCN layer, 
    """
   def __init__(self, in_features, out_features, bias=True):
        super(GraphConvolution, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.weight = Parameter(torch.FloatTensor(in_features, out_features))
        if bias:
            self.bias = Parameter(torch.FloatTensor(out_features))
        else:
            self.register_parameter('bias', None)
        self.reset_parameters()

    def reset_parameters(self):
        stdv = 1. / math.sqrt(self.weight.size(1))
        self.weight.data.uniform_(-stdv, stdv)
        if self.bias is not None:
            self.bias.data.uniform_(-stdv, stdv)

    def forward(self, input, adj):
        support = torch.mm(input, self.weight)
        output = torch.spmm(adj, support)
        if self.bias is not None:
            return output + self.bias
        else:
            return output

    def __repr__(self):
        return self.__class__.__name__ + ' (' \
               + str(self.in_features) + ' -> ' \
               + str(self.out_features) + ')'

GCN model.py-

#model
import torch.nn as nn
import torch.nn.functional as F
class GCN(nn.Module):
    def __init__(self, nfeat, nhid, nclass, dropout):
        super(GCN, self).__init__()

        self.gc1 = GraphConvolution(nfeat, nhid)
        self.gc2 = GraphConvolution(nhid, nclass)
        self.dropout = dropout
   def forward(self, x, adj):
        x = F.relu(self.gc1(x, adj))
        x = F.dropout(x, self.dropout, training=self.training)
        x = self.gc2(x, adj)
        return F.log_softmax(x, dim=1)

After implementation of GCN model -We have to declare the hyperparameter and,for training we used Adam-Optimizer and calculating the Accuracy and loss for every epochs. and while testing ,for test result we are printing the loss and Accuracy.For full code plz visit Github. We used GCN in MolGAN Model to deal with graph data while molecular generation.