Training ======================= Train Voxel-Dynamics graph constructor ---------------------------------------- The voxelized space(-time) Voronoi division & connectivity ``C-matrix`` estimator takes into account the dynamics of particle tracks (equations of motion) as described by the simulations (training data), where do they appear in space-time and also the detector geometry adaptively. For example: .. code-block:: none python src/trackml_pretrain_voxdyn.py --node2node hyper --ncell 65536 131072 --device cpu Once trained, the estimator's true (false) edge efficiency is pile-up invariant (!) by construction, but the purity is not. The Voronoi division quality saturates quickly with the number of training sample size (number of tracks ~ hits), but the connectivity C-matrix is estimated with much higher accuracy with larger samples. Train neural GNN + Transformer model ---------------------------------------- The training first proceeds only with the GNN edge predictor part of the network and once the AUC is over ``--auc_threshold``, then the neural clustering transformer training is activated end-to-end. The training for the track reconstruction problem is executed with: .. code-block:: none python src/trackml_train.py \ --param tune-5 \ --cluster "transformer" \ --soft_reset 0 \ --learning_rate 5e-4 \ --scheduler_type "warm-cos" \ --optimizer "AdamW" \ --epoch -1 \ --save_tag f-0p1-hyper-5 \ --rfactor_start 0.1 --rfactor_end 0.1 --noise_ratio 0.05 \ --node2node hyper --ncell 262144 \ --validate 0 \ --batch_size 1 \ --fp_dtype "float32" See more examples under ``/tests/train*`` for different pile-up scenarios. Training will take days (weeks) on a single Nvidia V100 GPU, and VRAM memory requirements grow at least linearly as a function of the track density if ``--ncell`` is scaled up accordingly. A GPU with 32+ GB VRAM is recommended (a must), to be able to increase the pile-up scenario with ``--rfactor_start`` and ``--rfactor_end``, use deeper models and possibly higher ``--batch_size`` parameter in the neural training. Training parameters ------------------------------ The training steering program parameters, such as the starting learning rate, can be seen with the command ``--help`` and the rest are descibed under ``/hypertrack/models/global_.py``. Model hyperparameters ------------------------------- The model hyperparameters are encoded in the file ``/hypertrack/models/global_.py``. One can create several different tunes or custom models, by choosing a ``--save_tag `` name and by copying existing models files into a new pair: .. code-block:: none /hypertrack/models/global_.py /hypertrack/models/models_.py before starting the training. Ground truth graph topology ------------------------------------------------------------- The ground truth adjacency of graph nodes (hits) is controlled with the command line parameter ``--node2node``. This needs to be consistent between ``trackml_pretrain_voxdyn``, ``trackml_train`` and ``trackml_inference``, but one can do mixed diagnostics. .. code-block:: none `hyper` for a fully connected hyperedge like 'lasso' over all nodes `eom` for a minimal spanning tree (equations of motions helix trajectory) `cricket` for double hops included eom trajectory This impacts directly the Voxel-Dynamics estimator ``C-matrix`` construction and indirectly the neural estimator training, especially the edge predictor part, because it defines its label target goal. The ``eom`` will favor to train an edge predictor which is *space-time local* and the ``hyper`` mode a fully *space-time non-local* one. In general, only ``hyper`` mode is fully compatible with the overall neural clustering goal, but one can use ``eom`` mode, e.g., to train only a highly performing GNN edge predictor. Set also ``--cluster none``, which inactivates the clustering transformer from training. The ground truth definition impacts also edge ROC/AUC (efficiency) between ``A_hat`` (estimate) and ``A`` (ground truth) adjacency. However, the final track clustering metrics such as ``efficiency`` or ``purity`` definition are topology independent, but naturally the chosen definition impacts the underlying model construction. Progressive transfer learning --------------------------------- To reduce overall GPU time and improve convergence, it is beneficial to train the model using a low pile-up scenario and once converged, use that model as a starting point for the higher pile-up scenario. This can be done progressively in multiple steps, e.g. by setting ``--rfactor_start`` and ``--rfactor_end`` first to ``0.01``, then both ``0.1`` and finally both to ``0.2`` (assuming you have enough VRAM). Similarly, one should increase the voxel space dimension ``ncell`` as a function of the mean pile-up. Starting the training using an existing model weights can be done simply by using ``--transfer_tag`` and ``--save_tag`` command line parameters when starting the new training. Remember to remove ``--transfer_tag`` afterwards. Learning rate ---------------------- Start with .. code-block:: none --learning_rate 5e-4 The learning rate can be crucial, e.g. with 1e-5 the GNN does not really enter learning phase. Then in the final stage, perhaps decrease to .. code-block:: none --learning_rate 1e-4 (e.g. Transformer more stable) In general, larger models with more GNN and transformer layers and higher latent dimensions may require smaller learning rates.