About "norm_dim_in" in self.to_out #10

buyeah1109 · 2024-10-28T10:27:30Z

Thanks for the great work. I notice that in the Attention and FFN, the output matrix (i.e., self.to_out) is normalized differently along the first dimension instead of the last dimension (normalizing along the last dimension is the default in your code, and this behavior is controlled by the flag "norm_dim_in").

I am wondering why the normalization is different for the output matrix. I was thinking that the author's goal of weight normalization is turning the computation of dot product to cosine similarity. But if we normalize along the first dimension of output matrix, then we are not calculating the cosine similarity between the intermediate state in FFN and weight vectors in output matrix. Correct me if I am wrong, really appreciate that.

lucidrains · 2024-10-28T13:29:34Z

@buyeah1109 hey Anthon! this is a good question which i don't really know the answer to

in the end, i went by this sentence and presumed embedding (model) dimension for all weight matrices

however, there are papers that do allude that your way of doing it may work (look up the post-activation layernorm in feedforward in normformer paper, as well as the norm on the aggregated values coming out of attention, but before combining the heads, named sub-ln)

buyeah1109 · 2024-10-28T14:58:37Z

Thanks for your answer! That helps a lot. May I understand you are persuming the normalizing dimension is consistent across the entire model? Thus the normalizations in the code are all based on the "dim" dimension.

There are some good news. I observed a significant speed-up from nGPT in pretraining on the Openwebtext dataset. To achieve the same loss, nGPT requires much fewer tokens or training iterations than standard GPT2 baselines.

lucidrains · 2024-10-28T14:59:43Z

@buyeah1109 wow! you may be the first (although a lot of researchers keep their cards close) to report a positive result! (at least for the speedup)

how significant was it and was it holding all hyperparameters constant?

lucidrains · 2024-10-28T15:02:07Z

Thanks for your answer! That helps a lot. May I understand you are persuming the normalizing dimension is consistent across the entire model? Thus the normalizations in the code are all based on the "dim" dimension.

There are some good news. I observed a significant speed-up from nGPT in pretraining on the Openwebtext dataset. To achieve the same loss, nGPT requires much fewer tokens or training iterations than standard GPT2 baselines.

yes that's what i'm assuming based on "all vectors forming the embedding dimension" phrase, but looking at some other papers (like the cosine sim network paper from 2017), it could be done the other way around when projecting out

in the end, you can only normalize one dimension of the weight matrix, and i can see both perspectives

lucidrains · 2024-10-28T15:04:25Z

@buyeah1109 maybe when i get some time i'll wire up the training script for openwebtext in this repo then... thanks for sharing your replication efforts

lucidrains · 2024-10-28T15:06:00Z

@buyeah1109 i could add some option to change up the normalizing dimension for those output projections in the GPTExperimental file for you to play around with, if you wish

buyeah1109 · 2024-10-28T15:22:34Z

@buyeah1109 wow! you may be the first (although a lot of researchers keep their cards close) to report a positive result! (at least for the speedup)

how significant was it and was it holding all hyperparameters constant?

Pretty good, it's about 3x faster than standard GPT2-124M. For hyperparameters, I just copy the default setting of standard GPT2 and tried different learning rates. I believe there will be a lot of space to improve by hyperparameter tuning.

lucidrains · 2024-10-28T15:25:02Z

@buyeah1109 so not without a ton of tuning.. that's great news.. 🙏

looks like i'll put some more work into this next month. research isn't done here yet

lucidrains · 2024-10-28T18:22:01Z

@buyeah1109 you didn't happen to train with mixed precision did you? also, your baseline was using rotary embeddings or no?

buyeah1109 · 2024-10-29T04:08:34Z

@buyeah1109 you didn't happen to train with mixed precision did you? also, your baseline was using rotary embeddings or no?

I trained with mixed percision. My GPU is Ampere so I think BF16 is supported and used in the mixed training. I used to train with V100s but ended up with NaN. For the baseline, i testify baseline both with and w/o the rotary embedding. RoPE slightly improved baseline' training losses.

lucidrains · 2024-10-29T13:15:44Z

@buyeah1109 wonderful! thank you!

alxndrTL · 2024-10-29T14:27:29Z

Pretty good, it's about 3x faster than standard GPT2-124M. For hyperparameters, I just copy the default setting of standard GPT2 and tried different learning rates. I believe there will be a lot of space to improve by hyperparameter tuning.

Hey, would you care to share your baseline and nGPT code ? I tried to reproduce the results on the 124M scale but only got results comparable for nGPT and GPT.

lucidrains · 2024-10-29T14:39:56Z

@alxndrTL hey Alexandre, thanks for chiming in

i'm also seeing only "comparable" results, but then again, i've never been that great of an experimentalist, so will reserve judgement for a bit longer

buyeah1109 · 2024-10-29T16:07:36Z

Pretty good, it's about 3x faster than standard GPT2-124M. For hyperparameters, I just copy the default setting of standard GPT2 and tried different learning rates. I believe there will be a lot of space to improve by hyperparameter tuning.

Hey, would you care to share your baseline and nGPT code ? I tried to reproduce the results on the 124M scale but only got results comparable for nGPT and GPT.

Sure! I used the code from nanoGPT to train the GPT2-124M baseline on openwebtext dataset. I use the default training configuration in nanoGPT for GPT2 except for the batchsize since I don't have 8x H100s lol. For nGPT, i keep using the nanoGPT training script and directly import model from this wonderful project and align the depth, width and dimension with GPT2. I didn't use QK-norm for nGPT.

buyeah1109 · 2024-10-29T16:18:01Z

I may also share some data points. GPT2-124M achieves training loss around 3.04 after training with 16B tokens. nGPT achieves 2.94 after training with 16B tokens. nGPT achieves 3.0 after training with 8B tokens.

buyeah1109 · 2024-10-29T16:23:10Z

I think your @alxndrTL "comparable" results also matters. You may compare with my setting and find the difference, which could enlight us which part of the nGPT implementation is the most significant one for the training speed-up. It would help a lot!

alxndrTL · 2024-10-30T21:02:31Z

Thanks for the details @buyeah1109 . Yes I will try again, what you saw gives me hope.

inspirit · 2024-10-31T08:17:44Z

In my experiments (different modalities including mel-spectorgrams and other embeddings, no text tokens) i noticed that setting norm_dim_in=False for input projection may render the whole model unusable, it still learns and reduce losses but predictions are far from acceptable. I ran few experiments changing only model first input projections to norm_dim_in=True and also setting everywhere in the model to always norm input dimension and it seems so far that first input projection very important in terms of which weight dimension to norm.

lucidrains · 2024-10-31T14:05:12Z

@inspirit yes that makes sense, as the magnitude in the continuous data will be lost without the network given some room to encode it as phase

i also got curious and spent the whole day yesterday trying out a normalized MLP for a small RL task, but while it learns and is stable, not really that much better than SOTA. while we are on this topic, i think i'll just also bring up the xval paper, where it shows that a transformer can make use of magnitude on a token to generalize better for numerical tasks. play devil's advocate for this approach

inspirit · 2024-10-31T14:36:56Z

@lucidrains I wonder what you would recommend as an input projection for continuous data? especially if data dimensionality is 2-3x larger than transformer dim. I now have 2 ideas, first I have already verified and it works by setting norm_dim_in=True for input layer, second idea would be using just normal Linear layer without weight norm and just l2norm output of the layer.
any thoughts?

inspirit · 2024-10-31T14:39:42Z

I guess another way is to use small MLP as input/output projection, the question is what weight norm options to use to project from data_dim to model_dim

lucidrains · 2024-10-31T15:11:45Z

@inspirit in the RL setting, what worked was just a linear followed by an activation, but if somehow heavily normalized networks like these take off, i'm sure people will be combing the lit for better ways to encode magnitude into phase. in other words, i don't know

lucidrains added the question Further information is requested label Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About "norm_dim_in" in self.to_out #10

About "norm_dim_in" in self.to_out #10

buyeah1109 commented Oct 28, 2024

lucidrains commented Oct 28, 2024 •

edited

Loading

buyeah1109 commented Oct 28, 2024

lucidrains commented Oct 28, 2024 •

edited

Loading

lucidrains commented Oct 28, 2024 •

edited

Loading

lucidrains commented Oct 28, 2024

lucidrains commented Oct 28, 2024

buyeah1109 commented Oct 28, 2024

lucidrains commented Oct 28, 2024

lucidrains commented Oct 28, 2024 •

edited

Loading

buyeah1109 commented Oct 29, 2024

lucidrains commented Oct 29, 2024

alxndrTL commented Oct 29, 2024 •

edited

Loading

lucidrains commented Oct 29, 2024 •

edited

Loading

buyeah1109 commented Oct 29, 2024

buyeah1109 commented Oct 29, 2024

buyeah1109 commented Oct 29, 2024 •

edited

Loading

alxndrTL commented Oct 30, 2024 •

edited

Loading

inspirit commented Oct 31, 2024

lucidrains commented Oct 31, 2024 •

edited

Loading

inspirit commented Oct 31, 2024 •

edited

Loading

inspirit commented Oct 31, 2024

lucidrains commented Oct 31, 2024 •

edited

Loading

About "norm_dim_in" in self.to_out #10

About "norm_dim_in" in self.to_out #10

Comments

buyeah1109 commented Oct 28, 2024

lucidrains commented Oct 28, 2024 • edited Loading

buyeah1109 commented Oct 28, 2024

lucidrains commented Oct 28, 2024 • edited Loading

lucidrains commented Oct 28, 2024 • edited Loading

lucidrains commented Oct 28, 2024

lucidrains commented Oct 28, 2024

buyeah1109 commented Oct 28, 2024

lucidrains commented Oct 28, 2024

lucidrains commented Oct 28, 2024 • edited Loading

buyeah1109 commented Oct 29, 2024

lucidrains commented Oct 29, 2024

alxndrTL commented Oct 29, 2024 • edited Loading

lucidrains commented Oct 29, 2024 • edited Loading

buyeah1109 commented Oct 29, 2024

buyeah1109 commented Oct 29, 2024

buyeah1109 commented Oct 29, 2024 • edited Loading

alxndrTL commented Oct 30, 2024 • edited Loading

inspirit commented Oct 31, 2024

lucidrains commented Oct 31, 2024 • edited Loading

inspirit commented Oct 31, 2024 • edited Loading

inspirit commented Oct 31, 2024

lucidrains commented Oct 31, 2024 • edited Loading

lucidrains commented Oct 28, 2024 •

edited

Loading

lucidrains commented Oct 28, 2024 •

edited

Loading

lucidrains commented Oct 28, 2024 •

edited

Loading

lucidrains commented Oct 28, 2024 •

edited

Loading

alxndrTL commented Oct 29, 2024 •

edited

Loading

lucidrains commented Oct 29, 2024 •

edited

Loading

buyeah1109 commented Oct 29, 2024 •

edited

Loading

alxndrTL commented Oct 30, 2024 •

edited

Loading

lucidrains commented Oct 31, 2024 •

edited

Loading

inspirit commented Oct 31, 2024 •

edited

Loading

lucidrains commented Oct 31, 2024 •

edited

Loading