Skip to content

Synthetic identity documents dataset

License

Notifications You must be signed in to change notification settings

QuickSign/docxpand

Repository files navigation

DocXPand tool

Requirements

Functionalities

This repository exposes functions to generate documents using templates and generators, contained in docxpand/templates:

  • Templates are SVG files, containing information about the appearence of the documents to generate, i.e. their backgrounds, the fields contained in the document, the positions of these fields etc.
  • Generators are JSON files, containing information on how to generate the fields content.

This repository allows to :

  • Generate documents for known templates (id_card_td1_a, id_card_td1_b, id_card_td2_a, id_card_td2_b, pp_td3_a, pp_td3_b, pp_td3_c, rp_card_td1 and rp_card_td2 ), by filling the templates with random fake information.
  • Integrate generated document in some scenes, to replace other documents originally present in the scenes.
    • It implies you have some dataset of background scenes usable for this task, with coordinates of original documents to replace by generated fake documents.
    • To integrate documents, use the insert_generated_documents_in_scenes.py script, that takes as input the directory containing the generated document images, a JSON dataset containing information obout those document images (generated by above script), the directory containing "scene" (background) images, a JSON dataset containing localization information, and an output directory to store the final images. The background scene images must contain images that are present in the docxpand/specimens directory. See the SOURCES.md file for more information.
    • All JSON datasets must follow the DocFakerDataset format, defined in docxpand/dataset.py.

Installation

Run

poetry install

Usage

To generate documents, run:

poetry run python scripts/generate_fake_structured_documents.py -n <number_to_generate> -o <output_directory> -t <template_to_use> -w <path_to_chrome_driver_path>

To insert document in target images, run:

poetry run python scripts/insert_generated_documents_in_scenes.py -di <document_images_directory> -dd <documents_dataset> -si <scene_images_directory> -sd <scenes_dataset> -o <output_directory>

DocXPand-25k dataset

The synthetic ID document images dataset ("DocXPand-25k"), released alongside this tool, is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-sa/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

You can download the dataset from this release. It's split into 12 parts (DocXPand-25k.tar.gz.xx, from 00 to 11). Once you've downloaded all 12 binary files, you can extract the content using the following command : cat DocXPand-25k.tar.gz.* | tar xzvf -. The labels are stored in a JSON format, which is readable using the DocFakerDataset class. The document images are stored in the images/ folder, which contains one sub-folder per-class. The original image fields (identity photos, ghost images, barcodes, datamatrices) integrated in the documents are stored in the fields/ sub-folder.