Example: amazon-book ====================== Here we present a user-item graph example. We run LightGCN and xGCN on the amazon-book dataset, which is used in the LightGCN paper. The results are as follows: +-----------+-----------+----------+----------------+ | | Recall@20 | NDCG@20 | Training Time | +===========+===========+==========+================+ | LightGCN | 0.0409 | 0.0316 | 69,954s | +-----------+-----------+----------+----------------+ | xGCN | 0.0452 | 0.0355 | 8,135s | +-----------+-----------+----------+----------------+ (Training time: time for an epoch \* number of epochs used to achieve the best score in the validation.) The amazon-book dataset can be found in our XGCN repository: ``data/raw_amazon-book/``, which is copied from LightGCN's official code repository: https://github.com/gusye1234/LightGCN-PyTorch. --------------------- Data Preparation --------------------- Before getting started ------------------------- We recommend to arrange the data with a clear directory structure. Before getting started, you may manually setup an ``XGCN_data`` (or other names you like) directory as follows: (It's recommended to put your ``XGCN_data`` somewhere else than in this repository.) .. code:: XGCN_data └── dataset └── raw_amazon-book ├── train.txt └── test.txt We'll use this directory to hold all the different datasets and models outputs. We refer to its path as ``all_data_root`` in our scripts. Dataset instance generation ----------------------------- First, let's process the graph: .. code:: shell ###### process graph for training # set to your own path: file_input_graph='/home/xxx/XGCN_data/dataset/raw_amazon-book/train.txt' data_root='/home/xxx/XGCN_data/dataset/instance_amazon-book' mkdir -p $data_root # make sure to setup the directory graph_type='homo' graph_format='edge_list' python -m XGCN.data.process.process_int_graph \ --file_input_graph $file_input_graph --data_root $data_root \ --graph_type $graph_type --graph_format $graph_format \ Then we process the test set (the LightGCN paper does not provide a validation set): .. code:: shell ###### process test set file_input='/home/xxx/XGCN_data/dataset/raw_amazon-book/test.txt' file_output='/home/xxx/XGCN_data/dataset/instance_amazon-book/test.pkl' evaluation_method='multi_pos_whole_graph' python -m XGCN.data.process.process_evaluation_set \ --file_input $file_input --file_output $file_output \ --evaluation_method $evaluation_method \ The test set is large (52643 source nodes) and the testing process is time-consuming. So here we sample some source nodes for quick validation during the model training. Here is the python script: .. code:: python # script/examples/amazon-book/sample_from_test_set_for_validation.py from XGCN.data import io from XGCN.utils.parse_arguments import parse_arguments import numpy as np def main(): config = parse_arguments() file_input = config['file_input'] file_output = config['file_output'] num_sample = config['num_sample'] test_set = io.load_pickle(file_input) src = test_set['src'] pos_list = test_set['pos_list'] print("number of souce node in the test set:", len(src)) print("num_sample:", num_sample) np.random.seed(1999) idx = np.arange(len(src)) np.random.shuffle(idx) sampled_idx = idx[:num_sample] val_src = src[sampled_idx] val_pos_list = [] pos_list = test_set['pos_list'] for i in sampled_idx: val_pos_list.append(pos_list[i]) val_set = { 'src': val_src, 'pos_list': val_pos_list } io.save_pickle(file_output, val_set) if __name__ == '__main__': main() Here is the corresponding shell script: .. code:: shell ###### sample from the test set python sample_from_test_set_for_validation.py \ --file_input $all_data_root"/dataset/instance_amazon-book/test.pkl" \ --file_output $all_data_root"/dataset/instance_amazon-book/val.pkl" \ --num_sample 3000 \ After the above processing, your data directory will look like this: .. code:: XGCN_data └── dataset ├── raw_amazon-book | ├── train.txt | └── test.txt └── instance_amazon-book ├── info.yaml ├── indices.pkl ├── indptr.pkl ├── val.pkl └── test.pkl The whole processing script can be found in ``script/examples/amazon-book/00-instance_generation.sh``. ----------------- Run LightGCN ----------------- XGCN provides a simple module - ``XGCN.main.run_model`` - to run models from command line. It has the following contents: .. code:: python import XGCN from XGCN.data import io from XGCN.utils.parse_arguments import parse_arguments import os.path as osp def main(): config = parse_arguments() model = XGCN.create_model(config) model.fit() test_results = model.test() print("test:", test_results) io.save_json(osp.join(config['results_root'], 'test_results.json'), test_results) if __name__ == '__main__': main() The following shell script runs LightGCN with ``XGCN.main.run_model`` module and reproduce the results on the amazon-book dataset: .. code:: shell # script/examples/amazon-book/01-run_LightGCN.sh # The results of the following running should be around: # r20:0.0409 || r50:0.0792 || r100:0.1252 || r300:0.2367 || n20:0.0316 || n50:0.0458 || n100:0.0606 || n300:0.0911 # 'r' for 'Recall@', 'n' for 'NDCG@' # set to your own path: all_data_root='/home/sxr/code/XGCN_and_data/XGCN_data' config_file_root='/home/sxr/code/XGCN_and_data/XGCN_library/config' dataset=amazon-book model=LightGCN seed=0 device="cuda:0" graph_device=$device emb_table_device=$device gnn_device=$device out_emb_table_device=$device data_root=$all_data_root/dataset/instance_$dataset results_root=$all_data_root/model_output/$dataset/$model/[seed$seed] # In LightGCN's official code (https://github.com/gusye1234/LightGCN-PyTorch), # for each epoch, there are num_edges samples. For each sample, firstly, a user # is randomly sampled. Then a neighbor (item) of the user is sampled as the positive node. # The amazon-book dataset has 52643 users and 2380730 interactions (edges). # 2380730 / 52643 = 45.22 # To reproduce the LightGCN's setting, in XGCN, we use the # NodeBased_ObservedEdges_Sampler, and set: # str_num_total_samples=num_users # epoch_sample_ratio=45.22 python -m XGCN.main.run_model --seed $seed \ --config_file $config_file_root/$model-full_graph-config.yaml \ --data_root $data_root --results_root $results_root \ --val_method multi_pos_whole_graph \ --file_val_set $data_root/test.pkl \ --test_method multi_pos_whole_graph \ --file_test_set $data_root/test.pkl \ --str_num_total_samples num_users \ --pos_sampler NodeBased_ObservedEdges_Sampler \ --neg_sampler StrictNeg_Sampler \ --epoch_sample_ratio 45.22 \ --num_gcn_layers 2 \ --L2_reg_weight 1e-4 --use_ego_emb_L2_reg 1 \ --emb_lr 0.001 \ --emb_dim 64 \ --train_batch_size 2048 \ --epochs 1000 --val_freq 20 \ --key_score_metric r20 --convergence_threshold 100 \ --graph_device $graph_device --emb_table_device $emb_table_device \ --gnn_device $gnn_device --out_emb_table_device $out_emb_table_device \ ----------------- Run xGCN ----------------- The following shell script runs xGCN with ``XGCN.main.run_model``: .. code:: shell # script/examples/amazon-book/01-run_xGCN.sh # The results of the following running should be around: # r20:0.0452 || r50:0.0844 || r100:0.1302 || r300:0.2398 || n20:0.0355 || n50:0.0501 || n100:0.0650 || n300:0.0951 # 'r' for 'Recall@', 'n' for 'NDCG@' # set to your own path: all_data_root='/home/sxr/code/XGCN_and_data/XGCN_data' config_file_root='/home/sxr/code/XGCN_and_data/XGCN_library/config' dataset=amazon-book model=xGCN seed=0 device='cuda:0' emb_table_device=$device forward_device=$device out_emb_table_device=$device data_root=$all_data_root/dataset/instance_$dataset results_root=$all_data_root/model_output/$dataset/$model/[seed$seed][epoch_sample_ratio1.0] python -m XGCN.main.run_model --seed $seed \ --config_file $config_file_root/$model-config.yaml \ --data_root $data_root --results_root $results_root \ --val_method multi_pos_whole_graph \ --file_val_set $data_root/val.pkl \ --test_method multi_pos_whole_graph \ --file_test_set $data_root/test.pkl \ --emb_table_device $emb_table_device \ --forward_device $forward_device \ --out_emb_table_device $out_emb_table_device \ --epochs 1000 --val_freq 1 --convergence_threshold 100 \ --key_score_metric r20 \ --epoch_sample_ratio 1.0 \ --dnn_arch "[nn.Linear(64, 1024), nn.Tanh(), nn.Linear(1024, 64)]" \ --use_scale_net 0 \ --L2_reg_weight 1e-4 \ --num_gcn_layers 1 \ --stack_layers 1 \ --renew_by_loading_best 1 \ --T 5 \ --K 99999 \ --tolerance 5 \ ----------------------- The Complete Scripts ----------------------- All the scripts of this running example can be found in ``script/examples/amazon-book``. Remember to modify ``all_data_root`` and ``config_file_root`` in the shell scripts to your own paths. After the raw data preparation, you can run all the code by: .. code:: bash cd script/examples/amazon-book bash 00-instance_generation.sh bash 01-run_LightGCN.sh bash 02-run_xGCN.sh