New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Bert #158

Merged

yuzhou03 merged 45 commits into FlagOpen:main from ScoThunder:bert

Jul 20, 2023

Contributor

ScoThunder commented Jul 19, 2023 •

edited

Loading

日志：root@p-kunlunxin-r480-005:/data/dufeilei/dev/code/FlagPerf/training/benchmarks/bert/pytorch/log/train_1x8.log
数据集：root@p-kunlunxin-r480-005:/data/datasets_ckpt/bert/train

Feilei Du and others added 30 commits

May 19, 2023 07:56


          init

f738bac


          add efficientnet

b494ba5


          modify config

acfde41


          modify config

b4e9627


          modify config

c6fbea3


          add efficientnet

fce71f2


          modify config

ef390bc


          add efficientnet

51847e1


          bug fix

3f904db


          add efficientnet

48e835d


          Merge branch 'FlagOpen:main' into efficientnet

8eaa8a5


          add efficientnet

37d78be


          fix code style

98361a5


          fix code style

e6005bf


          fix code style

ae86109


          Revert "fix code style"

fe6a418

This reverts commit ae86109.


          fix code style

6684a5d


          fix code style

746377a


          fix code style

b3d9786


          fix code style

a70db8d


          fix code style

b672228


          Merge branch 'FlagOpen:main' into efficientnet

df3a2b2


          Merge branch 'FlagOpen:main' into main

565a35d


          Merge branch 'FlagOpen:main' into main

b21dde9


          Merge branch 'FlagOpen:main' into main

d9c089b


          bug fix

d3b3e57


          Merge branch 'FlagOpen:main' into main

ffda7ad


          Merge branch 'FlagOpen:main' into main

cb97a53


          Merge branch 'FlagOpen:main' into main

10faf97


          Merge branch 'FlagOpen:main' into main

d679833

Feilei Du and others added 14 commits

June 21, 2023 06:44


          add kunlunxin readme

2fad460


          Merge branch 'FlagOpen:main' into main

9db4cab


          Merge branch 'FlagOpen:main' into main

ba5aa39


          fix bert on gpu

01a1106


          fix bert on gpu

a61cc9a


          fix bert on gpu

c602432


          bug fix

072d81c


          bug fix

59c77e5


          bug fix

44715ca


          fix bert on gpu

e793bb6


          fix bert on gpu

714eaf5


          Merge remote-tracking branch 'upgrade/main' into bert

7242cb5


          fix bert on xpu

b467d58


          fix bert on xpu

1aac59d

ScoThunder commented

View reviewed changes

training/benchmarks/bert/pytorch/train/evaluator.py

@@ @@ -83,9 83,6 @@ def evaluate(self, trainer): @@
                               total_masked  = num_masked
                               #torch.cuda.synchronize()
                               dist_pytorch.barrier(config.vendor)
-                              if config.vendor == 'kunlunxin':
-                                  import torch_xmlir.core.xpu_model as xm
-                                  xm.mark_step()

Contributor Author

ScoThunder Jul 19, 2023

之前跑图模式需要这个xm.mark_step，现在是eager模式，不需要了

training/benchmarks/bert/pytorch/train/trainer_adapter.py

+                  from apex.parallel import DistributedDataParallel as APEX_DDP
+                  from apex.parallel.distributed import flat_dist_call
+              except ImportError:
+                  print("import apex error")

Contributor Author

ScoThunder Jul 19, 2023

kunlunxin的机器没有apex，import会报错，加入try catch

training/benchmarks/bert/pytorch/train/trainer_adapter.py

               from torch.cuda.amp import GradScaler
               from torch.nn.parallel import DistributedDataParallel as NativeDDP
               from torch.optim import Optimizer
               import utils
               import config
               #from converter import convert_model
-              from .distributed_fused_lamb import _pipeline_block_reductions_patched, _pipeline_step_patched

Contributor Author

ScoThunder Jul 19, 2023

同上

training/kunlunxin/bert-pytorch/extern/trainer_adapter.py

               from torch.optim import Optimizer
-              from torch_xmlir.optimizer import Lamb
+              from torch_xmlir.optimizer import FusedLAMB

Contributor Author

ScoThunder Jul 19, 2023

替换为fuse优化器

Contributor

yuzhou03 Jul 19, 2023

请提供kulunxin机器上跑通的截图

training/kunlunxin/bert-pytorch/extern/trainer_adapter.py

-                      #                                 e5m2_allgather=config.dwu_e5m2_allgather)
-                      #optimizer.set_global_scale(float(os.getenv("INIT_LOSS_SCALE", 2 ** 20)))
-                  else:
-                      optimizer = Lamb(optimizer_grouped_parameters,

Contributor Author

ScoThunder Jul 19, 2023

删除多余代码

training/kunlunxin/bert-pytorch/extern/trainer_adapter.py

-                  use_ddp = dist.is_initialized()
-                  if use_ddp and config.use_xpu:
-                      from torch_xmlir.distributed import DistributedDataParallel as DDP
-                      model = DDP(model)

Contributor Author

ScoThunder Jul 19, 2023

替换为torch原生的ddp，而不是自定的ddp

training/kunlunxin/bert-pytorch/extern/trainer_adapter.py

-                                          optimizer,
-                                          delay_overflow_check=self.config.
-                                          allreduce_post_accumulation) as scaled_loss:
-                          scaled_loss.backward()

Contributor Author

ScoThunder Jul 19, 2023

删除冗余代码

training/kunlunxin/bert-pytorch/extern/trainer_adapter.py

                   update_step = step % config.gradient_accumulation_steps == 0
                   if update_step:
                       update_model_params(loss, optimizer, grad_scaler)
-                  else:
-                      xm.mark_step()

Contributor Author

ScoThunder Jul 19, 2023

eager模式不用xm.mark_step

training/kunlunxin/bert-pytorch/extern/trainer_adapter.py

-                                  param.grad = None
-                  else:
-                      xm.optimizer_step(optimizer, barrier=True)

Contributor Author

ScoThunder Jul 19, 2023

删除多余代码

yuzhou03 assigned shh2000, upvenly and Ox7c000000

shh2000 approved these changes

View reviewed changes


          fix bert on xpu

2bc4922

shh2000 approved these changes

View reviewed changes

yuzhou03 approved these changes

View reviewed changes

yuzhou03 merged commit ddacd61 into FlagOpen:main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet