This project enables you to convert the Million Song Dataset into an Elasticsearch index.
Elasticsearch is a distributed, RESTful search and analytics engine that allows powerful text searches. Although MSD is an audio-featured focused dataset, it also contains metadata that one wants to make research with.
You need the Python elasticsearch and tables packages. I suggest you to work in a Python virtual environment, it's a good practice.
Set up your virtualenv:
pip install virtualenv
virtualenv ~/.env/elasticmsd
source ~/.env/elasticmsd/bin/activate
Install dependencies:
git clone https://github.com/deezer/elasticmsd
cd elasticmsd
pip install -r requirements.txt
Install hdf5_getters.py from from tbertinmahieux/MsongDB repository. You must then run a pt2to3 on this file (program shipped with tables package) even if you stay in Python2. hdf5_getters uses an old tables convention:
wget https://raw.githubusercontent.com/tbertinmahieux/MSongsDB/master/PythonSrc/hdf5_getters.py -O hdf5_getters_2.py
pt2to3 hdf5_getters_2.py > hdf5_getters.py
rm hdf5_getters_2.py
Download MSD summary file (~300Mo):
wget http://labrosa.ee.columbia.edu/millionsong/sites/default/files/AdditionalFiles/msd_summary_file.h5 -O msd_summary_file.h5
Or Download the full MSD (~200Go) from OSCD:
rsync -avzuP publicdata.opensciencedatacloud.org::ark:/31807/osdc-c1c763e4/ /path/to/local_copy
If you need so, you can install a local instance of an Elasticsearch server via docker:
docker run --rm -p 9200:9200 -p 9300:9300 -d --name=local_elasticsearch elasticsearch:2.3
This command will browse the MSD summary file (a big h5 file) to an Elasticsearch index.
Note: If you want to browse the entire dataset and not just the summary, use the -d argument like
-d /path/to/local/msd
python msd_to_es.py \
-H localhost \
-p 9200 \
-i research_msd \
-f \
-m msd_summary_file.h5
Output logs will look like:
2018-03-13 11:01:13,702 Found 1000000 songs in summary file
2018-03-13 11:01:17,037 1000 files read. Bulk ingest.
2018-03-13 11:01:17,037 Last MSD id read: TRMMENV12903CDDA6A
2018-03-13 11:01:22,221 2000 files read. Bulk ingest.
2018-03-13 11:01:22,221 Last MSD id read: TRMWQUX12903CD7496
python msd_to_es.py -h
usage: msd_to_es.py [-h] [-H ESHOST] [-p ESPORT] [-i ESINDEX] [-t ESTYPE]
[-m MSDSUMMARYFILE] [-f]
optional arguments:
-h, --help show this help message and exit
-H ESHOST, --eshost ESHOST
Host of elasticsearch.
-p ESPORT, --esport ESPORT
Port of elasticsearch host.
-i ESINDEX, --esindex ESINDEX
Name of index to store to.
-t ESTYPE, --estype ESTYPE
Type of index to store to.
-m MSDSUMMARYFILE, --msdsummaryfile MSDSUMMARYFILE
MSD summary file (one h5 file for 1M songs)
-d MSDDIRECTORY, --msddirectory MSDDIRECTORY
MSD directory strucutre (one h5 file per song)
-f, --force Force writing in existing ES index.
The Document in Elasticsearch will look like this:
{
"msd_tempo" : 120.299,
"msd_artist_name" : "Darrell Scott",
"msd_artist_mbid" : "98063361-cdd8-4a9e-b95c-1f29bff780d6",
"msd_title" : "Shattered Cross",
"msd_artist_id" : "ARZKPUC1187B99052C",
"msd_year" : 2006,
"msd_duration" : 325.53751,
"msd_mode" : 1,
"msd_artist_location" : "London, KY",
"msd_release" : "Transatlantic Sessions - Series 3: Volume One",
"msd_key" : 9
}