Skip to content

rahulunair/xpu_text_classifier

Repository files navigation

xpu_text_classifier: Custom Text Classification on Intel dGPUs

xpu_text_classifier allows you to fine-tune transformer models using custom datasets for multi-class or multi-label classification tasks. The models supported include popular transformer architectures like BERT, BART, DistilBERT, etc. This solution uses the Huggingface Trainer to handle the training and leverages Intel Extension for PyTorch to run on Intel dGPUs.

Table of Contents

Installation

Before you start, ensure you have PyTorch and Intel Extension for PyTorch installed.

To install xpu_text_classifier:

  1. Clone the transformers_xpu repository from GitHub:

    git clone https://github.com/rahulunair/transformers_xpu.git
    cd transformers_xpu
  2. Install the package:

    python setup.py install
  3. Install the required dependencies:

    pip install datasets scikit-learn
  4. Optionally, install Weights & Biases to monitor your training process:

    pip install wandb

Preparing Your Dataset

The dataset should be in a format compatible with the Hugging Face's load_dataset function, which includes CSV, JSON, and several others. The dataset should have two columns 'text' and 'label'. For multi-class classification tasks, each label is a single integer. For multi-label classification tasks, each label is a list of integers.

Multi-Class Classification Example:

text label
This is text 1 0
This is text 2 2
This is text 3 1

Multi-Label Classification Example:

text label
This is text 1 [0, 1]
This is text 2 [1, 2]
This is text 3 [0, 2]

After preparing your dataset, save it in a format such as JSON or CSV in a directory. The name of this directory will be used as the dataset_name parameter when using the TextClassifier.

Usage

The script custom_finetune.py in the root directory is your entry point for training a model. By default, it uses the 'distilbert-base-uncased' model and Gutenberg dataset with 30 labels.

You can either tweak the custom_finetune.py file or create a new python file with these details:

Import TextClassifier from classifier module

import torch
import intel_extension_for_pytorch

from classifier import Text Classifier

Instantiate the classifier:

classifier = TextClassifier(
    model_name="distilbert-base-uncased",
    dataset_name="path/to/your/dataset_directory",  # use the name of the directory where you saved your dataset
    num_labels=2,
    task_type="multi_class",
)

Start Training:

classifier.train(epochs=10, batch=16, use_bf16=False)

You can specify the model name, number of labels(classes), number of epochs, batch size, and whether to use BF16 precision with the train function as shown in the file custom_finetune.py.

To train on a single GPU:

python custom_finetune.py

To train using all available GPUs:

export MASTER_ADDR=127.0.0.1
source /home/orange/pytorch_xpu_orange/lib/python3.10/site-packages/oneccl_bindings_for_pytorch/env/setvars.sh
mpirun -n 4 python custom_finetune.py

Replace 4 with the number of GPUs available in your system.

Monitoring GPU Usage

To monitor the GPU usage:

xpu-smi dump -m5,18  # VRAM utilization

Additional Details

The custom_finetune.py script fetches an e-book from Gutenberg and prepares a dataset for the training task. The dataset is stored in the directory specified by dataset_name as a csv file with two columns: text and label.

Please note, the transformers expect the labels to be integers. If your labels are strings, make sure to encode them into integers before passing them to the TextClassifier:

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
labels = ['cat', 'mat', 'bat', 'cat', 'bat']
encoded_labels = le.fit_transform(labels)

For more details on the TextClassifier, refer to classifier.py.

Remember to check the script and adjust the parameters (model type, dataset, epochs, batch size, etc.) according to your needs.

Happy fine-tuning!

About

Compare different text classifiers on Intel discrete GPUs

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages