-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vertical Federated Learning RFC #8424
Comments
Thank you for the detailed RFC! I'm sure there are lots of excitement coming. I tagged the RFC as part of 2.0. Let me know if there's anything I can help. |
hello, Mr.rongou @rongou . I have much experience on the implementation vertical federated XGB/LGB. But my former version was coded by python and scala. I was trying to implement it based on lightgbm source codes, also added some c_api for python wrapper, but this work is very hard for me because of poor c coding experience. So how can I contribute my experience with you or your group to implement this work based on xgb src? |
Hi @peiji1981 , thank you for the offer! Do you have a paper or document on your work? |
Motivation
XGBoost 1.7.0 introduced the initial support for Federated Learning. However, only horizontal federated learning is supported. Training samples are assumed to be split horizontally, i.e. each participant has a subset of samples, but with all the features and labels. In many real world applications, data is split vertically: each participant has all the samples, but only a partial list of features, and not all participants have access to the labels. It would be useful to support vertical federated learning in XGBoost.
Goals
Non-Goals
Assumptions
In vertical federated learning, before model training, the participants need to jointly compute the common IDs of their private sets. This is called private set intersection (PSI). For XGBoost, we assume this is already done, and users may rely on some other library/framework for this step.
Similar to horizontal federated learning, we make some simplifying assumptions::
Risks
The current XGBoost codebase is fairly complicated and hard to modify. Some code refactoring needs to happen first, before support for vertical federated learning can be added. Care must be taken to not break existing functionality, or make regular training harder.
Design
LightGBM, a gradient boosting library similar to XGBoost, supports “feature parallel” distributed learning.
Conceptually, feature parallelism is similar to vertical federated learning. A possible design is to first enhance XGBoost distributed training to support feature parallelism, and then build vertical federated learning on top of it. This would benefit the wider user community, thus greatly reducing the risks involved in refactoring XGBoost’s code base.
Feature Parallelism
XGBoost has an internal training parameter called
DataSplitMode
, which can be set toauto
,col
, androw
. However, it’s currently not exposed to the end user, and can only be set torow
for distributed training.In order to support column-based data split for distributed training, we need to do the following:
weight
andqid
, effectively theMetaInfo
object). The resultingDMatrix
needs to keep track of which features belong to which worker.We may also want to consider implementing LightGBM’s voting parallel approach (paper) for more efficient communication.
Vertical Federated Learning
Assuming feature parallelism is implemented, vertical federated learning is a slight modification:
Federated Inference
In horizontal federated learning, since each participant has all the features and labels, trained models can be shared to run inference/prediction locally. In vertical federated learning, however, this is no longer feasible. All the participants need to be online and work collaboratively to run inference. For batch inference, a federated learning job can be set up (for example, using NVFlare) to produce predictions. For online inference, we would need to set up services at participating sites to jointly produce online predictions, which is out of the scope of this design.
Alternatives Considered
It may be possible to implement vertical federated learning without first adding support for column data split mode in distributed training. However, since this would require extensive refactoring of the XGBoost code base without producing any benefits for users not using federated learning, it may be too risky. For very wide datasets (relatively many features with moderate number of rows), column data split may be a useful feature for distributed training.
The text was updated successfully, but these errors were encountered: