.. _user_guide-usage_examples-facbook: Example: facebook ====================== Let's begin with a social network dataset: facebook. It is very small (only contains 4039 nodes and 88234 edges) and is suitable for API tasting! If a powerful server is not available, you can just process it and run models on your laptop. The data is included in the XGCN repository: ``data/raw_facebook/facebook_combined.txt``. You can also download it from SNAP: http://snap.stanford.edu/data/facebook_combined.txt.gz . --------------------- Data Preparation --------------------- Before getting started ------------------------- We recommend to arrange the data with a clear directory structure. Before getting started, you may manually setup an ``XGCN_data`` (or other names you like) directory as follows: (It's recommended to put your ``XGCN_data`` somewhere else than in this repository.) .. code:: XGCN_data └── dataset └── raw_facebook └── facebook_combined.txt We'll use this directory to hold all the different datasets and models outputs. We refer to its path as ``all_data_root`` in our scripts. Evaluation sets generation ---------------------------- At first, we only have the graph data: ``facebook_combined.txt``, and want to generate a validation set and a test set. This can be done by using the ``XGCN.data.process.evaluation_set_generation`` module. Suppose we are going to generate a 'one_pos_k_neg' evaluation set for fast validation during training. We sample 500 positive edges and 300 negative nodes for each source node. The script is as follows (**remember to modify the paths in the scripts into your own**): .. code:: shell # script/examples/facebook/00-val_set_generation.sh # set to your own path: file_input_graph='/home/xxx/XGCN_data/dataset/raw_facebook/facebook_combined.txt' graph_type='homo' graph_format='edge_list' seed=1999 # random seed num_edge_samples=500 # number of edges to split min_src_out_degree=3 # guarantee the minimum out-degree of a source node after the split min_dst_in_degree=3 # guarantee the minimum in-degree of a destination node after the split # available evaluation_method: 'one_pos_k_neg', 'one_pos_whole_graph', 'multi_pos_whole_graph' eval_method='one_pos_k_neg' num_neg=300 # the num_neg argument is required when the eval_method='one_pos_k_neg' # the output graph will be saved as a text file in edge list format # set to your own path: file_output_graph='/home/xxx/XGCN_data/dataset/raw_facebook/train.txt' file_output_eval_set='/home/xxx/XGCN_data/dataset/raw_facebook/val-one_pos_k_neg.txt' python -m XGCN.data.process.evaluation_set_generation \ --file_input_graph $file_input_graph \ --file_output_graph $file_output_graph \ --file_output_eval_set $file_output_eval_set \ --seed $seed --graph_type $graph_type --graph_format $graph_format \ --num_edge_samples $num_edge_samples \ --min_src_out_degree $min_src_out_degree \ --min_dst_in_degree $min_dst_in_degree \ --eval_method $eval_method \ --num_neg $num_neg \ After running the script above, you'll get two new files: ``train.txt`` and ``val-one_pos_k_neg.txt``: .. code:: XGCN_data └── dataset └── raw_facebook ├── facebook_combined.txt ├── train.txt └── val-one_pos_k_neg.txt We further split some edges from the ``train.txt`` and generate a test set: .. code:: shell # script/examples/facebook/01-test_set_generation.sh # set to your own path: file_input_graph='/home/xxx/XGCN_data/dataset/raw_facebook/facebook_combined.txt' graph_type='homo' graph_format='edge_list' seed=2000 # random seed num_edge_samples=8000 # number of edges to split min_src_out_degree=3 # guarantee the minimum out-degree of a source node after the split min_dst_in_degree=3 # guarantee the minimum in-degree of a destination node after the split # available evaluation_method: 'one_pos_k_neg', 'one_pos_whole_graph', 'multi_pos_whole_graph' eval_method='multi_pos_whole_graph' # num_neg=300 # the num_neg argument is required when the eval_method='one_pos_k_neg' # the output graph will be saved as a text file in edge list format # set to your own path: file_output_graph='/home/xxx/XGCN_data/dataset/raw_facebook/train.txt' file_output_eval_set='/home/xxx/XGCN_data/dataset/raw_facebook/test-multi_pos_whole_graph.txt' python -m XGCN.data.process.evaluation_set_generation \ --file_input_graph $file_input_graph \ --file_output_graph $file_output_graph \ --file_output_eval_set $file_output_eval_set \ --seed $seed --graph_type $graph_type --graph_format $graph_format \ --num_edge_samples $num_edge_samples \ --min_src_out_degree $min_src_out_degree \ --min_dst_in_degree $min_dst_in_degree \ --eval_method $eval_method \ # --num_neg $num_neg \ This time we use the 'multi_pos_whole_graph' evaluation method and split 8000 edges for fine-grained testing. The output 'train.txt' will overwrite the previous one, so finally we get three files: ``train.txt``, ``val-one_pos_k_neg.txt``, and ``test-multi_pos_whole_graph.txt``: .. code:: XGCN_data └── dataset └── raw_facebook ├── facebook_combined.txt ├── train.txt ├── val-one_pos_k_neg.txt └── test-multi_pos_whole_graph.txt Dataset instance generation ----------------------------- Now we have the complete train/val/test text data, and are ready to process them into a dataset instance. First, let's process the graph: .. code:: shell # in script/examples/facebook/02-instance_generation.sh ###### process graph for training # set to your own path: file_input_graph='/home/xxx/XGCN_data/dataset/raw_facebook/train.txt' data_root='/home/xxx/XGCN_data/dataset/instance_facebook' mkdir -p $data_root # make sure to setup the directory graph_type='homo' graph_format='edge_list' python -m XGCN.data.process.process_int_graph \ --file_input_graph $file_input_graph --data_root $data_root \ --graph_type $graph_type --graph_format $graph_format \ Next, we process the validation set and the test set: .. code:: shell # in script/examples/facebook/02-instance_generation.sh ###### process validation set file_input='/home/xxx/XGCN_data/dataset/raw_facebook/val-one_pos_k_neg.txt' file_output='/home/xxx/XGCN_data/dataset/instance_facebook/val-one_pos_k_neg.pkl' evaluation_method='one_pos_k_neg' python -m XGCN.data.process.process_evaluation_set \ --file_input $file_input --file_output $file_output \ --evaluation_method $evaluation_method \ ###### process test set file_input='/home/xxx/XGCN_data/dataset/raw_facebook/test-multi_pos_whole_graph.txt' file_output='/home/xxx/XGCN_data/dataset/instance_facebook/test-multi_pos_whole_graph.pkl' evaluation_method='multi_pos_whole_graph' python -m XGCN.data.process.process_evaluation_set \ --file_input $file_input --file_output $file_output \ --evaluation_method $evaluation_method \ If you have done the above steps successfully, your data directory will look like this: .. code:: XGCN_data └── dataset ├── raw_facebook | ├── facebook_combined.txt | ├── train.txt | ├── val-one_pos_k_neg.txt | └── test-multi_pos_whole_graph.txt └── instance_facebook ├── info.yaml ├── indices.pkl ├── indptr.pkl ├── val-one_pos_k_neg.pkl └── test-multi_pos_whole_graph.pkl Congratulations! Now we have a complete dataset instance, and are able to run any models in XGCN! --------------------- Model Running --------------------- XGCN provides a simple module - ``XGCN.main.run_model`` - to run models from command line. It has the following contents: .. code:: python import XGCN from XGCN.data import io from XGCN.utils.parse_arguments import parse_arguments import os.path as osp def main(): config = parse_arguments() model = XGCN.create_model(config) model.fit() test_results = model.test() print("test:", test_results) io.save_json(osp.join(config['results_root'], 'test_results.json'), test_results) if __name__ == '__main__': main() Directory ``script/examples/facebook`` contains shell scripts to run all the models. For example, the ``run_xGCN.sh``: .. code:: shell # set to your own path: all_data_root='/home/sxr/code/XGCN_and_data/XGCN_data' config_file_root='/home/sxr/code/XGCN_and_data/XGCN_library/config' dataset=facebook model=xGCN seed=0 device='cuda:0' emb_table_device=$device forward_device=$device out_emb_table_device=$device data_root=$all_data_root/dataset/instance_$dataset results_root=$all_data_root/model_output/$dataset/$model/[seed$seed] # file_pretrained_emb=$all_data_root/model_output/$dataset/Node2vec/[seed$seed]/model/out_emb_table.pt python -m XGCN.main.run_model --seed $seed \ --config_file $config_file_root/$model-config.yaml \ --data_root $data_root --results_root $results_root \ --val_method one_pos_k_neg \ --file_val_set $data_root/val-one_pos_k_neg.pkl \ --key_score_metric r20 \ --test_method multi_pos_whole_graph \ --file_test_set $data_root/test-multi_pos_whole_graph.pkl \ --emb_table_device $emb_table_device \ --forward_device $forward_device \ --out_emb_table_device $out_emb_table_device \ # --from_pretrained 1 --file_pretrained_emb $file_pretrained_emb \ Modify the ``all_data_root`` and ``config_file_root`` to your own paths, and then you can run it! The ``results_root`` directory will be made automatically. When the training and testing is completed, you'll get the following contents: .. code:: XGCN_data └── model_output └── facebook └── xGCN └── [seed0] ├── model (directory) # the best model on the validation set ├── config.yaml # configurations of the running ├── mean_time.json # time consumption information in seconds ├── test_results.json # test results ├── train_record_best.json # validation results of the best epoch └── train_record.txt # validation results of all the epochs ----------------------- The Complete Scripts ----------------------- All the scripts of running examples can be found in ``script/examples/facebook``. Remember to modify the paths in the scripts.