Vertical Federated Learning RFC #8424

rongou · 2022-11-04T21:31:55Z

Motivation

XGBoost 1.7.0 introduced the initial support for Federated Learning. However, only horizontal federated learning is supported. Training samples are assumed to be split horizontally, i.e. each participant has a subset of samples, but with all the features and labels. In many real world applications, data is split vertically: each participant has all the samples, but only a partial list of features, and not all participants have access to the labels. It would be useful to support vertical federated learning in XGBoost.

Goals

Enhance XGBoost to support vertical federated learning.
Support using NVFlare to coordinate the learning process, but the design should be amenable to support other federated learning platforms.
Efficiency: training speed should be close to traditional distributed training.
Accuracy: should be close to centralized learning.

Non-Goals

Initially we will assume the federated environment is non-adversarial, and will not provide strong privacy guarantees. This will be improved upon in later iterations.
We will not support data owners dropping out during the learning process.

Assumptions

In vertical federated learning, before model training, the participants need to jointly compute the common IDs of their private sets. This is called private set intersection (PSI). For XGBoost, we assume this is already done, and users may rely on some other library/framework for this step.

Similar to horizontal federated learning, we make some simplifying assumptions::

A few trusted partners jointly train a model.
Reasonably fast network connection between each participant and a central trusted party.

Risks

The current XGBoost codebase is fairly complicated and hard to modify. Some code refactoring needs to happen first, before support for vertical federated learning can be added. Care must be taken to not break existing functionality, or make regular training harder.

Design

LightGBM, a gradient boosting library similar to XGBoost, supports “feature parallel” distributed learning.

Conceptually, feature parallelism is similar to vertical federated learning. A possible design is to first enhance XGBoost distributed training to support feature parallelism, and then build vertical federated learning on top of it. This would benefit the wider user community, thus greatly reducing the risks involved in refactoring XGBoost’s code base.

Feature Parallelism

XGBoost has an internal training parameter called DataSplitMode, which can be set to auto, col, and row. However, it’s currently not exposed to the end user, and can only be set to row for distributed training.

In order to support column-based data split for distributed training, we need to do the following:

When initially loading data, support splitting by column. To keep it simple, we can have all workers keep a copy of the labels (and other things like weight and qid, effectively the MetaInfo object). The resulting DMatrix needs to keep track of which features belong to which worker.
When generating the prediction, participants need to work collaboratively: the worker owning the feature used to split a node needs to collect the left and right splits and broadcast the results. A naive implementation may incur too much communication overhead, but there is prior work to encode partial predictions in bitsets to make the process more efficient (see paper).
The worker owning the label should calculate the gradients and broadcast them to other workers.
When finding the best split, each worker finds the best local split based on the features it owns, and then performs an allreduce to find the global best split. Workers don’t need to access histogram from each other. The worker owning the feature for the best split then broadcasts the split results to others.

We may also want to consider implementing LightGBM’s voting parallel approach (paper) for more efficient communication.

Vertical Federated Learning

Assuming feature parallelism is implemented, vertical federated learning is a slight modification:

When loading data, no need to split the columns further, since we assume each worker only has a subset of the features.
We can no longer share labels between workers.
Communication needs to switch to the federated communicator.

Federated Inference

In horizontal federated learning, since each participant has all the features and labels, trained models can be shared to run inference/prediction locally. In vertical federated learning, however, this is no longer feasible. All the participants need to be online and work collaboratively to run inference. For batch inference, a federated learning job can be set up (for example, using NVFlare) to produce predictions. For online inference, we would need to set up services at participating sites to jointly produce online predictions, which is out of the scope of this design.

Alternatives Considered

It may be possible to implement vertical federated learning without first adding support for column data split mode in distributed training. However, since this would require extensive refactoring of the XGBoost code base without producing any benefits for users not using federated learning, it may be too risky. For very wide datasets (relatively many features with moderate number of rows), column data split may be a useful feature for distributed training.

The text was updated successfully, but these errors were encountered:

trivialfis · 2022-11-05T14:13:36Z

Thank you for the detailed RFC! I'm sure there are lots of excitement coming. I tagged the RFC as part of 2.0. Let me know if there's anything I can help.

rongou · 2022-11-09T19:22:48Z

peiji1981 · 2022-11-24T13:00:51Z

hello, Mr.rongou @rongou . I have much experience on the implementation vertical federated XGB/LGB. But my former version was coded by python and scala. I was trying to implement it based on lightgbm source codes, also added some c_api for python wrapper, but this work is very hard for me because of poor c coding experience. So how can I contribute my experience with you or your group to implement this work based on xgb src?

trivialfis · 2022-11-25T02:23:42Z

Hi @peiji1981 , thank you for the offer! Do you have a paper or document on your work?

rongou mentioned this issue Nov 7, 2022

Use DataSplitMode to configure data loading #8434

Merged

trivialfis added the feature-request label Nov 8, 2022

rongou mentioned this issue Nov 16, 2022

Initial support for column-wise data split #8468

Merged

rongou mentioned this issue Nov 29, 2022

Add an in-memory collective communicator #8494

Merged

rongou mentioned this issue Dec 7, 2022

Add data split mode to DMatrix MetaInfo #8568

Merged

rongou mentioned this issue Dec 21, 2022

Support bitwise allreduce operations in the communicator #8623

Merged

rongou mentioned this issue Jan 13, 2023

Initial support for column-split cpu predictor #8676

Merged

edumugi mentioned this issue Jul 5, 2023

Support np arrays for Vertical Federated Learning #9360

Closed

rongou mentioned this issue Sep 28, 2023

In-memory inputs for column split and vertical federated learning #9619

Open

ZiyueXu77 mentioned this issue Jan 15, 2024

Vertical Federated Learning with Secure Features (secure inference and encrypted training) RFC #9987

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vertical Federated Learning RFC #8424

Vertical Federated Learning RFC #8424

rongou commented Nov 4, 2022

trivialfis commented Nov 5, 2022

rongou commented Nov 9, 2022 •

edited

Loading

peiji1981 commented Nov 24, 2022

trivialfis commented Nov 25, 2022

Vertical Federated Learning RFC #8424

Vertical Federated Learning RFC #8424

Comments

rongou commented Nov 4, 2022

Motivation

Goals

Non-Goals

Assumptions

Risks

Design

Feature Parallelism

Vertical Federated Learning

Federated Inference

Alternatives Considered

trivialfis commented Nov 5, 2022

rongou commented Nov 9, 2022 • edited Loading

peiji1981 commented Nov 24, 2022

trivialfis commented Nov 25, 2022

rongou commented Nov 9, 2022 •

edited

Loading