Model Configuration

The model configuration in XGCN is basically a python Dict containing all the setting parameters. XGCN supports parsing model configurations from command line arguments and .yaml files. You can also manually write a Dict with all the parameters in a python script.

Configuration Template

Directory config/ includes .yaml configuration file templates for all the models. Each file contains all the arguments needed to run a model. A typical .yaml configuration file is like this:

# Dataset/Results root
data_root: ""       # root of the dataset instance
results_root: ""    # root for model outputs, training record, and evaluation results

# Trainer configuration
epochs: 200
use_validation_for_early_stop: 1
val_freq: 1
key_score_metric: r100
convergence_threshold: 20
val_method: ""
val_batch_size: 256
file_val_set: ""

# Testing configuration  (required for model.test())
test_method: ""
test_batch_size: 256
file_test_set: ""

# DataLoader configuration
Dataset_type: BlockDataset
num_workers: 0
num_gcn_layers: 2
train_num_layer_sample: "[10, 20]"
NodeListDataset_type: LinkDataset
pos_sampler: ObservedEdges_Sampler
neg_sampler: RandomNeg_Sampler
num_neg: 1
BatchSampleIndicesGenerator_type: SampleIndicesWithReplacement
train_batch_size: 1024
str_num_total_samples: num_edges
epoch_sample_ratio: 0.1

# Model configuration
model: GraphSAGE
seed: 1999
graph_device: "cuda:0"
emb_table_device: "cuda:0"
gnn_device: "cuda:0"
out_emb_table_device: "cuda:0"
forward_mode: sample
infer_num_layer_sample: "[10, 20]"
from_pretrained: 0
file_pretrained_emb: ""
freeze_emb: 0
use_sparse: 0
emb_dim: 64
emb_init_std: 0.1
emb_lr: 0.005
gnn_arch: "[{'in_feats': 64, 'out_feats': 64, 'aggregator_type': 'pool', 'activation': torch.tanh}, {'in_feats': 64, 'out_feats': 64, 'aggregator_type': 'pool'}]"
gnn_lr: 0.01
loss_type: bpr
L2_reg_weight: 0.0

The configuration consists of five parts:

  1. Dataset/Results root

  2. Trainer configuration

  3. Testing configuration

  4. DataLoader configuration

  5. Model configuration

Dataset/Results root

This part only has two arguments:

  • data_root: (str) The dataset instance root (for dataset instance generation, please refer to Data Preparation). This argument specifies which dataset to use.

  • results_root: (str) The directory to save the outputs during the model training.

Note that when calling the XGCN.create_model(config) function, the results_root directory will be automatically created if it does not exist.

Trainer configuration

This part specifies the configuration about training loop control:

  • epochs: (int) The maximum epochs to run.

  • use_validation_for_early_stop: (bool: 0 or 1) Whether to use validation scores for early stop. If this argument is 1, then the following 6 arguments are required.

  • val_freq: (int) Evaluate the model on the validation set every val_freq epochs.

  • key_score_metric: (str) The metric used for early stop. Once a better result on the key_score_metric is achieved on the validation set, the whole model will be saved. For available metrics, please refer to Model Evaluation.

  • convergence_threshold: (int) If the key_score_metric has not increased for convergence_threshold epochs, then we consider the training has already converged and the early stop is triggered (stop training).

  • val_method: (str) Evaluation method for validation. For evaluation methods, please refer to Model Evaluation.

  • val_batch_size: (int) Batch size for validation.

  • file_val_set: (str) The file of the validation set.

Testing configuration

Note that this part is optional for model training (i.e. model.fit()) and is required for model.test() function. For more information about testing, please refer to Model Evaluation.

  • test_method: (str) Evaluation method for testing.

  • test_batch_size: (int) Batch size for testing.

  • file_test_set: (str) The file of the test set.

DataLoader configuration

In general, we consider two types of dataloaders for GNN training:

  1. node-only dataloader: In each mini-batch, returns the needed node IDs: (source nodes, positive nodes, negative nodes).

  2. block dataloader: Not only returns node IDs, but also returns the DGL’s “blocks” (also known as “message flow graph” (MFG)).

The node-only dataloader is used in the following cases:

  1. The GNN’s message-passing is performed on the full graph. i.e. embeddings of all the nodes are inferred in a mini-bach.

  2. Additional graph information is not need. For example, the PPRGo model use the top-k PPR neighbor for each node, and the neighbors are held by the model itself. As another example, the UltraGCN model does not use message-passing, the node IDs is enough for batch training.

The block dataloader is used for graph sampling when training on large graphs (please refer to DGL docs: Chapter 6: Stochastic Training on Large Graphs for more information). In each mini-batch, it returns node IDs and the needed DGL “blocks”.

For some GNNs, XGCN provide both “full graph message-passing” and “block message-passing” training method. Their configuration templates are included in the config/ directory. For example:

config
├── LightGCN-full_graph-config.yaml
├── LightGCN-block-config.yaml
├── GraphSAGE-full_graph-config.yaml
├── GraphSAGE-block-config.yaml
...

The “full graph message-passing” training uses the node-only dataloader, and the “block message-passing” training uses the block dataloader.

Their configuration arguments of the two dataloaders are as follows:

####### for node-only dataloader #######
# DataLoader configuration
Dataset_type: NodeListDataset  # fixed
num_workers: 0
NodeListDataset_type: LinkDataset  # fixed
pos_sampler: ObservedEdges_Sampler
neg_sampler: RandomNeg_Sampler
num_neg: 1
BatchSampleIndicesGenerator_type: SampleIndicesWithReplacement
train_batch_size: 1024
str_num_total_samples: num_edges
epoch_sample_ratio: 0.1
#######  for block dataloader #######
# DataLoader configuration
Dataset_type: BlockDataset  # fixed
num_workers: 0
num_gcn_layers: 2
train_num_layer_sample: "[10, 20]"
NodeListDataset_type: LinkDataset  # fixed
pos_sampler: ObservedEdges_Sampler
neg_sampler: RandomNeg_Sampler
num_neg: 1
BatchSampleIndicesGenerator_type: SampleIndicesWithReplacement
train_batch_size: 1024
str_num_total_samples: num_edges
epoch_sample_ratio: 0.1

The meanings of the arguments are as follows:

  • Dataset_type: (str) This argument is fixed as “NodeListDataset” for node-only dataloader, and is fixed as “BlockDataset” for block dataloader.

  • NodeListDataset_type: (str) This field is fix as “LinkDataset”.

  • num_workers: (int) Number of workers for dataloading. 0 means loading data in the main process. Set to 0 if the graph is on GPU.

  • num_gcn_layers: (int) Number of GNN(GCN) layers. This argument is required for the block dataloader.

  • train_num_layer_sample: (str) Number of nodes to sample in each layer during training. For example, “[10, 20]” means 10 nodes in the first layer and 20 nodes in the second layer. This argument is required for the block dataloader.

  • pos_sampler: (str) Postive sampler. Available options:

    • “ObservedEdges_Sampler”: given edge IDs, return the edges.

    • “NodeBased_ObservedEdges_Sampler”: given node IDs, sample a neighbor for each node.

  • neg_sampler: (str) Negative sampler. Available options:

    • “RandomNeg_Sampler”: random sampling from all the nodes (from all the item nodes for user-item graphs).

    • “StrictNeg_Sampler”: sample strictly un-interacted nodes.

  • num_neg: (int) Number of negative samples for each positive sample.

  • str_num_total_samples: (str) the number of all the IDs used to generate samples. Available options:

    • “num_edges”: sample from all the edges for training, this is required by “ObservedEdges_Sampler”;

    • “num_nodes”: first sample a node, then sample a neighbor from it. This is required by “NodeBased_ObservedEdges_Sampler”;

    • “num_users”: This is required by the “NodeBased_ObservedEdges_Sampler” when the graph is a user-item network.

  • epoch_sample_ratio: (float) the str_num_total_samples might be a large number, e.g. the edges in a graph. We can shrink the number of samples for an epoch to epoch_sample_ratio times str_num_total_samples by setting epoch_sample_ratio to a value between 0 and 1. We can also expand the number of samples by setting it larger than 1.

  • BatchSampleIndicesGenerator_type: (str) the way to generate samples IDs in a batch. Available options:

    • “SampleIndicesWithReplacement”: sampling without replacement, e.g. sampling from all the edges without replacement;

    • “SampleIndicesWithoutReplacement”: sampling with replacement, e.g. all the edges is guaranteed to be sampled within a number of epochs.

  • train_batch_size: (int) training batch size.

Model configuration

This part specifies the model configuration such as hyper-parameters. Please refer to Supported Models for the detailed explaination of each model.

Load config from yaml file

We can load a .yaml configuration file with XGCN.data.io module:

import XGCN
from XGCN.data import io

config = io.load_yaml('config.yaml')  # load template
config['data_root'] = ...             # add/modify some configurations

Parse config from command line

We also provide a parse_arguments() to parse command line arguments:

import XGCN
from XGCN.utils.parse_arguments import parse_arguments

config = parse_arguments()

You can specify a .yaml configuration file with --config_file. Note that a configuration file is not a necessity for the parse_arguments() function and has lower priority when the same command line argument is given.