Skip to content

A Label Studio plugin with InstanceGM for improving data labels for machine learning with machine learning

License

Notifications You must be signed in to change notification settings

aiml-au/noisylabels

Repository files navigation

NoisyLabels

NoisyLabels is a Label Studio plugin for image classification tasks with two modes of operation:

  • Mode 1: Assisted-Labeling with tolerance for labeling errors,
  • Mode 2: Label-Correction of a noisy dataset (e.g. scraped data).

Prior Work

Based on the paper InstanceGM: Instance-Dependent Noisy Label Learning via Graphical Modelling (IEEE/CVF WACV 2023 Round 1), also on Arxiv.

Badges at the time of acceptance 2022

PWC

PWC

PWC

PWC

PWC

PWC

PWC

PWC

PWC

PWC

PWC

PWC

Installation

We use Docker to run Noisy Labels in a container.

First, you will need a Label Studio installation. Noisy Labels has been tested with version 1.8.2. Login to Label Studio and note your API key (under Account and Settings):

API_key

Open docker-compose.yml and change LABEL_STUDIO_HOST to point at the address of your Label Studio installation; the default value assumes Label Studio is running on the same host as the docker container hosting Noisy Labels. Set LABEL_STUDIO_API_KEY to your API key.

Run docker compose up --build to start the Noisy Labels container.

Open the settings for a Label Studio project:

LS_project_settings

Go to Machine Learning and Add Model:

ML_add_model

Fill in the details:

add_model_modal

Assisted labeling

Set the ML-assisted labeling options in Label Studio and Noisy Labels will now be invoked by Label Studio to

  • train a model when new annotations are submitted
  • pre-fill examples with model predictions so that annotators can just accept them when they are correct.

Correcting noisy labels from existing dataset

Given an existing dataset defined by a simple folder structure like

dataset/
|- cat/
   |- image001.jpg
   |- image002.jpg
   |- ...
|- dog/
   |- image101.jpg
   |- image102.jpg
   |- ...
|- ...

A script is provided to import the dataset into Label Studio with the appropriate format. First, create a virtualenv or conda environment, and install the dependencies:

conda create -p ./env python=3.9
conda activate ./env
pip install -r requirements.txt

To generate an example dataset, there is a script to create one:

python dataset_cats_and_dogs/cats_and_dogs_noisy.py

This will create a dataset of cats and dogs in dataset_cats_and_dogs/intermediate/.

Then run the import script, providing your Label Studio API key:

 LABEL_STUDIO_API_KEY="5d39cf3f3a2fcda4cbde231dc710467b4b43ecc4" python labelstudio_import.py --dataset path/to/dataset/ --title "Label Studio project title"

Once imported, navigate to the project in Label Studio and setup Noisy Labels as above. After the settings are saved, Label Studio should trigger Noisy Labels to train:

ml-backend  | [2024-02-04 23:06:08,794] [INFO] [root::fit::180] Using 10000 training images with 10000 labels
ml-backend  | [2024-02-04 23:06:08,794] [INFO] [root::fit::182] Train model...
Epochs:   0%|          | 0/25 [00:00<?, ?it/s][2024-02-04 23:06:56,956] [INFO] [root::main::460] Warmup Net1
Project Created from SDK: Noisy Dogs and Cats:0.5-instance | Epoch [  0/ 25] Iter[ 79/ 79]       CE-loss: 0.5515[2024-02-04 23:07:02,672] [INFO] [root::main::462] Warmup Net2
Project Created from SDK: Noisy Dogs and Cats:0.5-instance | Epoch [  0/ 25] Iter[ 79/ 79]       CE-loss: 0.6847[2024-02-04 23:07:08,600] [INFO] [root::test::267]
...
Epochs: 100%|██████████| 25/25 [38:17<00:00, 132.87s/it][2024-02-04 23:45:06,148] [INFO] [root::main::470] Train Net1
Project Created from SDK: Noisy Dogs and Cats:0.5-instance | Epoch [ 25/ 25] Iter[125/125]       Labeled loss: 0.70  Unlabeled loss: 0.04[2024-02-04 23:46:04,489] [INFO] [root::main::489] Train Net2
Project Created from SDK: Noisy Dogs and Cats:0.5-instance | Epoch [ 25/ 25] Iter[122/122]       Labeled loss: 0.53  Unlabeled loss: 0.04[2024-02-04 23:47:04,809] [INFO] [root::test::267]
ml-backend  | [2024-02-04 23:47:19,252] [INFO] [root::fit::196] New model version: Project Created from SDK: Noisy Dogs and Cats_2024-02-04_23:06:08
ml-backend  | [2024-02-04 23:47:19,252] [INFO] [root::fit::198] fit() completed successfully.

Corrections are now available in JSON format:

curl -O localhost:9090/corrections.json

Generate a corrected dataset, applying only corrections above a certain score threshold:

python correct_dataset.py --threshold 0.95 corrections.json path/to/original/dataset/ path/to/corrected/dataset/

Licence

NoisyLabels is available for non-commercial internal research use by academic institutions or not-for-profit organisations only, free of charge. Please, see the license for further details. To the extent permitted by applicable law, your use is at your own risk and our liability is limited. Interested in a commercial license? For commercial queries, please email [email protected] with subject line "NoisyLabels Commercial License".

This is an AIML Shop project.

About

A Label Studio plugin with InstanceGM for improving data labels for machine learning with machine learning

Topics

Resources

License

Stars

Watchers

Forks

Languages