Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vertical Federated Learning RFC #8424

Open
rongou opened this issue Nov 4, 2022 · 4 comments
Open

Vertical Federated Learning RFC #8424

rongou opened this issue Nov 4, 2022 · 4 comments

Comments

@rongou
Copy link
Contributor

rongou commented Nov 4, 2022

Motivation

XGBoost 1.7.0 introduced the initial support for Federated Learning. However, only horizontal federated learning is supported. Training samples are assumed to be split horizontally, i.e. each participant has a subset of samples, but with all the features and labels. In many real world applications, data is split vertically: each participant has all the samples, but only a partial list of features, and not all participants have access to the labels. It would be useful to support vertical federated learning in XGBoost.

Goals

  • Enhance XGBoost to support vertical federated learning.
  • Support using NVFlare to coordinate the learning process, but the design should be amenable to support other federated learning platforms.
  • Efficiency: training speed should be close to traditional distributed training.
  • Accuracy: should be close to centralized learning.

Non-Goals

  • Initially we will assume the federated environment is non-adversarial, and will not provide strong privacy guarantees. This will be improved upon in later iterations.
  • We will not support data owners dropping out during the learning process.

Assumptions

In vertical federated learning, before model training, the participants need to jointly compute the common IDs of their private sets. This is called private set intersection (PSI). For XGBoost, we assume this is already done, and users may rely on some other library/framework for this step.

Similar to horizontal federated learning, we make some simplifying assumptions::

  • A few trusted partners jointly train a model.
  • Reasonably fast network connection between each participant and a central trusted party.

Risks

The current XGBoost codebase is fairly complicated and hard to modify. Some code refactoring needs to happen first, before support for vertical federated learning can be added. Care must be taken to not break existing functionality, or make regular training harder.

Design

LightGBM, a gradient boosting library similar to XGBoost, supports “feature parallel” distributed learning.

feature_parallel

Conceptually, feature parallelism is similar to vertical federated learning. A possible design is to first enhance XGBoost distributed training to support feature parallelism, and then build vertical federated learning on top of it. This would benefit the wider user community, thus greatly reducing the risks involved in refactoring XGBoost’s code base.

Feature Parallelism

XGBoost has an internal training parameter called DataSplitMode, which can be set to auto, col, and row. However, it’s currently not exposed to the end user, and can only be set to row for distributed training.

In order to support column-based data split for distributed training, we need to do the following:

  • When initially loading data, support splitting by column. To keep it simple, we can have all workers keep a copy of the labels (and other things like weight and qid, effectively the MetaInfo object). The resulting DMatrix needs to keep track of which features belong to which worker.
  • When generating the prediction, participants need to work collaboratively: the worker owning the feature used to split a node needs to collect the left and right splits and broadcast the results. A naive implementation may incur too much communication overhead, but there is prior work to encode partial predictions in bitsets to make the process more efficient (see paper).
  • The worker owning the label should calculate the gradients and broadcast them to other workers.
  • When finding the best split, each worker finds the best local split based on the features it owns, and then performs an allreduce to find the global best split. Workers don’t need to access histogram from each other. The worker owning the feature for the best split then broadcasts the split results to others.

We may also want to consider implementing LightGBM’s voting parallel approach (paper) for more efficient communication.

Vertical Federated Learning

Assuming feature parallelism is implemented, vertical federated learning is a slight modification:

  • When loading data, no need to split the columns further, since we assume each worker only has a subset of the features.
  • We can no longer share labels between workers.
  • Communication needs to switch to the federated communicator.

Federated Inference

In horizontal federated learning, since each participant has all the features and labels, trained models can be shared to run inference/prediction locally. In vertical federated learning, however, this is no longer feasible. All the participants need to be online and work collaboratively to run inference. For batch inference, a federated learning job can be set up (for example, using NVFlare) to produce predictions. For online inference, we would need to set up services at participating sites to jointly produce online predictions, which is out of the scope of this design.

Alternatives Considered

It may be possible to implement vertical federated learning without first adding support for column data split mode in distributed training. However, since this would require extensive refactoring of the XGBoost code base without producing any benefits for users not using federated learning, it may be too risky. For very wide datasets (relatively many features with moderate number of rows), column data split may be a useful feature for distributed training.

@trivialfis
Copy link
Member

Thank you for the detailed RFC! I'm sure there are lots of excitement coming. I tagged the RFC as part of 2.0. Let me know if there's anything I can help.

@rongou
Copy link
Contributor Author

rongou commented Nov 9, 2022

Adding a task list to keep track of the progress:

@peiji1981
Copy link

hello, Mr.rongou @rongou . I have much experience on the implementation vertical federated XGB/LGB. But my former version was coded by python and scala. I was trying to implement it based on lightgbm source codes, also added some c_api for python wrapper, but this work is very hard for me because of poor c coding experience. So how can I contribute my experience with you or your group to implement this work based on xgb src?

@trivialfis
Copy link
Member

Hi @peiji1981 , thank you for the offer! Do you have a paper or document on your work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants