Our Senior Machine Learning Engineer, Aakash Nain, released a new paper on the flow of information in a pre-trained transformer, in collaboration with Sakana AI. Read it now 👇
Senior ML Engineer | Keras Core collaborator | TensorFlow Addons Maintainer | Google Developers Expert in Machine Learning
Very happy to present our latest paper: Transformer Layers as Painters Through our paper, we aim to understand the flow of information in a pretrained transformer. We present a series of experiments for both decoder-only and encoder-only frozen transformer models. **Note that we do not perform any kind of fine-tuning on these pretrained models.** With a series of experiments done on a diverse set of datasets with both types (decoder-only e.g. Llama, Mistral like models) and (encoder-only e.g. BERT), we answer the following questions: 1. Do layers speak the same language? 2. Are all layers necessary? 3. Are the middle layers all doing the same thing? 4. Does the layer order matter? 5. Can we run the layers in parallel? 6. Does the order matter for some tasks more than others? 7. Does looping help parallelized layers? 8. Which design variants are least harmful? I would provide the link where for the full summary, and the paper in the comments section. This was an fun collaboration between Sakana AI and Emergence, and for me I take immense pride in collaborating with Marc Pickett Llion Jones and Qi sun. Enjoy reading the paper! 🍻