Create dataset loader for Indo_MultiModal_CC_12M #307

SamuelCahyawijaya · 2022-10-02T16:07:11Z

Dataset	id_mm_cc_12m
Description	Conceptual 12M (CC12M) is a dataset with 12 million image-text pairs specifically meant to be used for visionand-language pre-training. Its data collection pipeline is a relaxed version of the one used in Conceptual Captions 3M (CC3M). Indo_MultiModal_CC_12M is the Indonesian language version.
License	The dataset may be freely used for any purpose, although acknowledgement of Google LLC ("Google") as the data source would be appreciated. The dataset is provided "AS IS" without any warranty, express or implied. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

acul3 · 2022-10-04T07:06:38Z

#self-assign

SamuelCahyawijaya added this to Nusantara Dataset Initiative Oct 2, 2022

muhsatrio added the hacktoberfest label Oct 3, 2022

github-actions bot assigned acul3 Oct 4, 2022

Provide feedback