dar or دار which means house in Arabic is a simple semi-supervised approach for creating huggingface data script loaders. Here is an example of creating a loading script for a simple dataset:
demo.mp4
The main interface can be ran using the following commands
streamlit run app.py
Mainly after entering the dataset name, the user will be prompted to enter the dataset link. The user can either enter one link or mutliple links separated by comma. The supported links are the following
GitHub
The user can enter a link from GitHub without usingraw
as it will be converted automatically. The user can also either choose to enter a link to the repository in the following formathttps://github.com/user/repo
which will download and extract the full directory orhttps://github.com/user/repo/foulder
which will download all the files from that foulders as individual links.Google Drive
The user can also enter a link from google drive in the following formhttps://drive.google.com/file/d/id/view
which will directly download and extract the foulder to the local disk. Google sheets could also be used, the can be provided in the same formathttps://docs.google.com/spreedshots/d/id/view
. You can test with the following gooel sheet example.Direct links
The user can enter direct links for files i.ehttps://domain/**.ext
with any extension and the file will be downloaded. Multilple links can be concatenated using commahttps://domain/file1.ext,https://domain/file2.ext,...,https://domain/filen.ext
The user can use glob structures to filter out some files from being used in the dataset. For example, the user is prompted with Enter an input structure
the user can enter something like 'foulder/**.txt
which will include text files with the extension .txt
from the foulder
. The user will be prompted to enter multiple glob structures unless an empty Enter
key is pressed. Multiple glob strucutres are used for datasets that have inputs and multiple targets like machine translation, summarization, speech transcription, etc.
The user will enter the file type when asked for File Type
, the supported file formats can be one of the following
txt
mainly for reading the file as a whole or separated by lines. To differentiate between such options the user entersSet Lines
which ify
the file will be separated into multiple lines or ifn
it will be read as a whole.csv
this is used for files with sepcial separator, for example.tsv
and.txt
can be part of such family if a special separator is used. The program will try to guess the best separator but the user can also choose the separator using the commandCSV Separator
. The user can enter the separtors astab
,,
,;
,|
, etc.json
can be used for dictionary like files. The use can chooseSet Lines
as well which will decide whether to split the file by new lines or read the file as a whole. Also some datasets can have a parent dictionary for example{'data':{'col1': [...], 'col2': [...]}}
, to support that the user canJson Key
whcih isdata
in this example.xml
can be used for files that contain tags, for examplehtml
files. The user will be prompted to enter the column names for example<s>this is good</s><l>positive</l> .....
then upon getting the promptXML Columns
the user can chooses,l
such tags as columns.xlsx
used forexcel
file formats.wav
this is used for audio files likemp3,wav
files. Upon choosing that the program will automatically create the following features as columns{'audio':np.array(...)}
jpg
this is used for image files likejpg,png
files. Upon choosing that the program will automatically create the following features as columns{'image':np.array(...)}
All the files will be processed using pandas
. The user can modify some contents when prompted to
Skipped rows
this is used to skip some lines from the beginning of all the files. Mainly used to remove some metadata that is usually put as the header of files. The user can enter0
which indicates that no lines will be skipped.Headers
used to deal with files that have no column names, the user can set thatFalse
and enter the column names in the next step.New Column Names
used to creat different names for the columns or add columns if non exist.Label Column Name
used to choose the column that contains the labels. For example in sentiment anlaysis we will have the contents in a column aspositive
ornegative
. The user can put the name of that column to recognize that as the label.datasets
will convert that to an integer which can be easily procssed by nlp model pipleines.push to hub:
used to upload the dataset to hub. The file will uploaded to the following directoryhf/DATASET_NAME
wherehf
can be specified using the argument--hf