Contributing¶
We encourage contributions to mirdata, especially new dataset loaders. To contribute a new loader, follow the steps indicated below and create a Pull Request (PR) to the github repository. For any doubt or comment about your contribution, you can always submit an issue or open a discussion in the repository.
To reduce friction, we may make commits on top of contributor’s PRs. If you do not want us
to, please tag your PR with please-do-not-edit
.
Installing mirdata for development purposes¶
To install mirdata
for development purposes:
First run:
git clone https://github.com/mir-dataset-loaders/mirdata.git
Then, after opening source data library you have to install the dependencies for updating the documentation and running tests:
pip install . pip install .[tests] pip install .[docs] pip install .[dali]
We recommend to install pyenv to manage your Python versions
and install all mirdata
requirements. You will want to install the latest versions of Python 3.6 and 3.7.
Once pyenv
and the Python versions are configured, install pytest
. Make sure you installed all the pytest
plugins to automatically test your code successfully. Finally, run:
pytest tests/ --local
All tests should pass!
Writing a new dataset loader¶
The steps to add a new dataset loader to mirdata
are:
Before starting, check if your dataset falls into one of these non-standard cases:
Is the dataset not freely downloadable? If so, see this section
Does the dataset require dependencies not currently in mirdata? If so, see this section
Does the dataset have multiple versions? If so, see this section
1. Create an index¶
mirdata
’s structure relies on indexes. Indexes are dictionaries contain information about the structure of the
dataset which is necessary for the loading and validating functionalities of mirdata
. In particular, indexes contain
information about the files included in the dataset, their location and checksums. The necessary steps are:
To create an index, first cereate a script in
scripts/
, asmake_dataset_index.py
, which generates an index file.Then run the script on the canonical versions of the dataset and save the index in
mirdata/datasets/indexes/
asdataset_index.json
.
Here there is an example of an index to use as guideline:
Example Make Index Script
import argparse
import glob
import json
import os
from mirdata.validate import md5
DATASET_INDEX_PATH = "../mirdata/datasets/indexes/dataset_index.json"
def make_dataset_index(dataset_data_path):
annotation_dir = os.path.join(dataset_data_path, "annotation")
annotation_files = glob.glob(os.path.join(annotation_dir, "*.lab"))
track_ids = sorted([os.path.basename(f).split(".")[0] for f in annotation_files])
# top-key level metadata
metadata_checksum = md5(os.path.join(dataset_data_path, "id_mapping.txt"))
index_metadata = {"metadata": {"id_mapping": ("id_mapping.txt", metadata_checksum)}}
# top-key level tracks
index_tracks = {}
for track_id in track_ids:
audio_checksum = md5(
os.path.join(dataset_data_path, "Wavfile/{}.wav".format(track_id))
)
annotation_checksum = md5(
os.path.join(dataset_data_path, "annotation/{}.lab".format(track_id))
)
index_tracks[track_id] = {
"audio": ("Wavfile/{}.wav".format(track_id), audio_checksum),
"annotation": ("annotation/{}.lab".format(track_id), annotation_checksum),
}
# top-key level version
dataset_index = {"version": None}
# combine all in dataset index
dataset_index.update(index_metadata)
dataset_index.update({"tracks": index_tracks})
with open(DATASET_INDEX_PATH, "w") as fhandle:
json.dump(dataset_index, fhandle, indent=2)
def main(args):
make_dataset_index(args.dataset_data_path)
if __name__ == "__main__":
PARSER = argparse.ArgumentParser(description="Make dataset index file.")
PARSER.add_argument(
"dataset_data_path", type=str, help="Path to dataset data folder."
)
main(PARSER.parse_args())
More examples of scripts used to create dataset indexes can be found in the scripts folder.
tracks¶
Most MIR datasets are organized as a collection of tracks and annotations. In such case, the index should make use of the tracks
top-level key. A dictionary should be stored under the tracks
top-level key where the keys are the unique track ids of the dataset.
The values are a dictionary of files associated with a track id, along with their checksums. These files can be for instance audio files
or annotations related to the track id. File paths are relative to the top level directory of a dataset.
Index Examples - Tracks
If the version 1.0 of a given dataset has the structure:
> Example_Dataset/
> audio/
track1.wav
track2.wav
track3.wav
> annotations/
track1.csv
Track2.csv
track3.csv
> metadata/
metadata_file.csv
The top level directory is Example_Dataset
and the relative path for track1.wav
would be audio/track1.wav
. Any unavailable fields are indicated with null. A possible index file for this example would be:
{ "version": "1.0",
"tracks":
"track1": {
"audio": [
"audio/track1.wav", // the relative path for track1's audio file
"912ec803b2ce49e4a541068d495ab570" // track1.wav's md5 checksum
],
"annotation": [
"annotations/track1.csv", // the relative path for track1's annotation
"2cf33591c3b28b382668952e236cccd5" // track1.csv's md5 checksum
]
},
"track2": {
"audio": [
"audio/track2.wav",
"65d671ec9787b32cfb7e33188be32ff7"
],
"annotation": [
"annotations/Track2.csv",
"e1964798cfe86e914af895f8d0291812"
]
},
"track3": {
"audio": [
"audio/track3.wav",
"60edeb51dc4041c47c031c4bfb456b76"
],
"annotation": [
"annotations/track3.csv",
"06cb006cc7b61de6be6361ff904654b3"
]
},
}
"metadata": {
"metadata_file": [
"metadata/metadata_file.csv",
"7a41b280c7b74e2ddac5184708f9525b"
]
}
}
Note
In this example there is a (purposeful) mismatch between the name of the audio file track2.wav
and its corresponding annotation file, Track2.csv
, compared with the other pairs. This mismatch should be included in the index. This type of slight difference in filenames happens often in publicly available datasets, making pairing audio and annotation files more difficult. We use a fixed, version-controlled index to account for this kind of mismatch, rather than relying on string parsing on load.
multitracks¶
Index Examples - Multitracks :class: dropdown, warning
If the version 1.0 of a given multitrack dataset has the structure:
> Example_Dataset/ > audio/ multitrack1-voice1.wav multitrack1-voice2.wav multitrack1-accompaniment.wav multitrack1-mix.wav multitrack2-voice1.wav multitrack2-voice2.wav multitrack2-accompaniment.wav multitrack2-mix.wav > annotations/ multitrack1-voice-f0.csv multitrack2-voice-f0.csv multitrack1-f0.csv multitrack2-f0.csv > metadata/ metadata_file.csvThe top level directory is
Example_Dataset
and the relative path formultitrack1-voice1
would beaudio/multitrack1-voice1.wav
. Any unavailable fields are indicated with null. A possible index file for this example would be:
{
"version": 1,
"tracks": {
"multitrack1-voice": {
"audio_voice1": ('audio/multitrack1-voice1.wav', checksum),
"audio_voice2": ('audio/multitrack1-voice1.wav', checksum),
"voice-f0": ('annotations/multitrack1-voice-f0.csv', checksum)
}
"multitrack1-accompaniment": {
"audio_accompaniment": ('audio/multitrack1-accompaniment.wav', checksum)
}
"multitrack2-voice" : {...}
...
},
"multitracks": {
"multitrack1": {
"tracks": ['multitrack1-voice', 'multitrack1-accompaniment'],
"audio": ('audio/multitrack1-mix.wav', checksum)
"f0": ('annotations/multitrack1-f0.csv', checksum)
}
"multitrack2": ...
},
"metadata": {
"metadata_file": [
"metadata/metadata_file.csv",
"7a41b280c7b74e2ddac5184708f9525b"
]
}
}
Note that in this examples we group audio_voice1
and audio_voice2
in a single Track because the annotation voice-f0
annotation corresponds to their mixture. In contrast, the annotation voice-f0
is extracted from the multitrack mix and it is stored in the multitracks
group. The multitrack multitrack1
has an additional track multitrack1-mix.wav
which may be the master track, the final mix, the recording of multitrack1
with another microphone.
records¶
Index Examples - Records
Coming soon
2. Create a module¶
Once the index is created you can create the loader. For that, we suggest you use the following template and adjust it for your dataset. To quickstart a new module:
Copy the example below and save it to
mirdata/datasets/<your_dataset_name>.py
Find & Replace
Example
with the <your_dataset_name>.Remove any lines beginning with # – which are there as guidelines.
Example Module
"""Example Dataset Loader
.. admonition:: Dataset Info
:class: dropdown
Please include the following information at the top level docstring for the dataset's module `dataset.py`:
1. Describe annotations included in the dataset
2. Indicate the size of the datasets (e.g. number files and duration, hours)
3. Mention the origin of the dataset (e.g. creator, institution)
4. Describe the type of music included in the dataset
5. Indicate any relevant papers related to the dataset
6. Include a description about how the data can be accessed and the license it uses (if applicable)
"""
import csv
import logging
import json
import os
from typing import BinaryIO, Optional, TextIO, Tuple
import librosa
import numpy as np
# -- import whatever you need here and remove
# -- example imports you won't use
from mirdata import download_utils, jams_utils, core, annotations
# -- Add any relevant citations here
BIBTEX = """
@article{article-minimal,
author = "L[eslie] B. Lamport",
title = "The Gnats and Gnus Document Preparation System",
journal = "G-Animal's Journal",
year = "1986"
},
@article{article-minimal2,
author = "L[eslie] B. Lamport",
title = "The Gnats and Gnus Document Preparation System 2",
journal = "G-Animal's Journal",
year = "1987"
}
"""
# -- REMOTES is a dictionary containing all files that need to be downloaded.
# -- The keys should be descriptive (e.g. 'annotations', 'audio').
# -- When having data that can be partially downloaded, remember to set up
# -- correctly destination_dir to download the files following the correct structure.
REMOTES = {
'remote_data': download_utils.RemoteFileMetadata(
filename='a_zip_file.zip',
url='http://website/hosting/the/zipfile.zip',
checksum='00000000000000000000000000000000', # -- the md5 checksum
destination_dir='path/to/unzip' # -- relative path for where to unzip the data, or None
),
}
# -- Include any information that should be printed when downloading
# -- remove this variable if you don't need to print anything during download
DOWNLOAD_INFO = """
Include any information you want to be printed when dataset.download() is called.
These can be instructions for how to download the dataset (e.g. request access on zenodo),
caveats about the download, etc
"""
# -- Include the dataset's license information
LICENSE_INFO = """
The dataset's license information goes here.
"""
class Track(core.Track):
"""Example track class
# -- YOU CAN AUTOMATICALLY GENERATE THIS DOCSTRING BY CALLING THE SCRIPT:
# -- `scripts/print_track_docstring.py my_dataset`
# -- note that you'll first need to have a test track (see "Adding tests to your dataset" below)
Args:
track_id (str): track id of the track
Attributes:
track_id (str): track id
# -- Add any of the dataset specific attributes here
"""
def __init__(self, track_id, data_home, dataset_name, index, metadata):
# -- this sets the following attributes:
# -- * track_id
# -- * _dataset_name
# -- * _data_home
# -- * _track_paths
# -- * _track_metadata
super().__init__(
track_id,
data_home,
dataset_name=dataset_name,
index=index,
metadata=metadata,
)
# -- add any dataset specific attributes here
self.audio_path = self.get_path("audio")
self.annotation_path = self.get_path("annotation")
# -- If the dataset has metadata that needs to be accessed by Tracks,
# -- such as a table mapping track ids to composers for the full dataset,
# -- add them as properties like instead of in the __init__.
@property
def composer(self) -> Optional[str]:
return self._track_metadata.get("composer")
# -- `annotation` will behave like an attribute, but it will only be loaded
# -- and saved when someone accesses it. Useful when loading slightly
# -- bigger files or for bigger datasets. By default, we make any time
# -- series data loaded from a file a cached property
@core.cached_property
def annotation(self) -> Optional[annotations.EventData]:
"""output type: description of output"""
return load_annotation(self.annotation_path)
# -- `audio` will behave like an attribute, but it will only be loaded
# -- when someone accesses it and it won't be stored. By default, we make
# -- any memory heavy information (like audio) properties
@property
def audio(self) -> Optional[Tuple[np.ndarray, float]]:
"""(np.ndarray, float): DESCRIPTION audio signal, sample rate"""
return load_audio(self.audio_path)
# -- we use the to_jams function to convert all the annotations in the JAMS format.
# -- The converter takes as input all the annotations in the proper format (e.g. beats
# -- will be fed as beat_data=[(self.beats, None)], see jams_utils), and returns a jams
# -- object with the annotations.
def to_jams(self):
"""Jams: the track's data in jams format"""
return jams_utils.jams_converter(
audio_path=self.audio_path,
annotation_data=[(self.annotation, None)],
metadata=self._metadata,
)
# -- see the documentation for `jams_utils.jams_converter for all fields
# -- if the dataset contains multitracks, you can define a MultiTrack similar to a Track
# -- you can delete the block of code below if the dataset has no multitracks
class MultiTrack(core.MultiTrack):
"""Example multitrack class
Args:
mtrack_id (str): multitrack id
data_home (str): Local path where the dataset is stored.
If `None`, looks for the data in the default directory, `~/mir_datasets/Example`
Attributes:
mtrack_id (str): track id
tracks (dict): {track_id: Track}
track_audio_attribute (str): the name of the attribute of Track which
returns the audio to be mixed
# -- Add any of the dataset specific attributes here
"""
def __init__(self, mtrack_id, data_home):
self.mtrack_id = mtrack_id
self._data_home = data_home
# these three attributes below must have exactly these names
self.track_ids = [...] # define which track_ids should be part of the multitrack
self.tracks = {t: Track(t, self._data_home) for t in self.track_ids}
self.track_audio_property = "audio" # the property of Track which returns the relevant audio file for mixing
# -- optionally add any multitrack specific attributes here
self.mix_path = ... # this can be called whatever makes sense for the datasets
self.annotation_path = ...
# -- multitracks can optionally have mix-level cached properties and properties
@core.cached_property
def annotation(self) -> Optional[annotations.EventData]:
"""output type: description of output"""
return load_annotation(self.annotation_path)
@property
def audio(self) -> Optional[Tuple[np.ndarray, float]]:
"""(np.ndarray, float): DESCRIPTION audio signal, sample rate"""
return load_audio(self.audio_path)
# -- multitrack classes are themselves Tracks, and also need a to_jams method
# -- for any mixture-level annotations
def to_jams(self):
"""Jams: the track's data in jams format"""
return jams_utils.jams_converter(
audio_path=self.mix_path,
annotation_data=[(self.annotation, None)],
...
)
# -- see the documentation for `jams_utils.jams_converter for all fields
# -- this decorator allows this function to take a string or an open bytes file as input
# -- and in either case converts it to an open file handle.
# -- It also checks if the file exists
# -- and, if None is passed, None will be returned
@io.coerce_to_bytes_io
def load_audio(fhandle: BinaryIO) -> Tuple[np.ndarray, float]:
"""Load a Example audio file.
Args:
fhandle (str or file-like): path or file-like object pointing to an audio file
Returns:
* np.ndarray - the audio signal
* float - The sample rate of the audio file
"""
# -- for example, the code below. This should be dataset specific!
# -- By default we load to mono
# -- change this if it doesn't make sense for your dataset.
return librosa.load(audio_path, sr=None, mono=True)
# -- Write any necessary loader functions for loading the dataset's data
# -- this decorator allows this function to take a string or an open file as input
# -- and in either case converts it to an open file handle.
# -- It also checks if the file exists
# -- and, if None is passed, None will be returned
@io.coerce_to_string_io
def load_annotation(fhandle: TextIO) -> Optional[annotations.EventData]:
# -- because of the decorator, the file is already open
reader = csv.reader(fhandle, delimiter=' ')
intervals = []
annotation = []
for line in reader:
intervals.append([float(line[0]), float(line[1])])
annotation.append(line[2])
annotation_data = annotations.EventData(
np.array(intervals), np.array(annotation)
)
return annotation_data
# -- use this decorator so the docs are complete
@core.docstring_inherit(core.Dataset)
class Dataset(core.Dataset):
"""The Example dataset
"""
def __init__(self, data_home=None):
super().__init__(
data_home,
name=NAME,
track_class=Track,
bibtex=BIBTEX,
remotes=REMOTES,
download_info=DOWNLOAD_INFO,
license_info=LICENSE_INFO,
)
# -- Copy any loaders you wrote that should be part of the Dataset class
# -- use this core.copy_docs decorator to copy the docs from the original
# -- load_ function
@core.copy_docs(load_audio)
def load_audio(self, *args, **kwargs):
return load_audio(*args, **kwargs)
@core.copy_docs(load_annotation)
def load_annotation(self, *args, **kwargs):
return load_annotation(*args, **kwargs)
# -- if your dataset has a top-level metadata file, write a loader for it here
# -- you do not have to include this function if there is no metadata
@core.cached_property
def _metadata(self):
metadata_path = os.path.join(self.data_home, 'example_metadta.csv')
# load metadata however makes sense for your dataset
metadata_path = os.path.join(data_home, 'example_metadata.json')
with open(metadata_path, 'r') as fhandle:
metadata = json.load(fhandle)
return metadata
# -- if your dataset needs to overwrite the default download logic, do it here.
# -- this function is usually not necessary unless you need very custom download logic
def download(
self, partial_download=None, force_overwrite=False, cleanup=False
):
"""Download the dataset
Args:
partial_download (list or None):
A list of keys of remotes to partially download.
If None, all data is downloaded
force_overwrite (bool):
If True, existing files are overwritten by the downloaded files.
cleanup (bool):
Whether to delete any zip/tar files after extracting.
Raises:
ValueError: if invalid keys are passed to partial_download
IOError: if a downloaded file's checksum is different from expected
"""
# see download_utils.downloader for basic usage - if you only need to call downloader
# once, you do not need this function at all.
# only write a custom function if you need it!
You may find these examples useful as references:
For many more examples, see the datasets folder.
3. Add tests¶
To finish your contribution, include tests that check the integrity of your loader. For this, follow these steps:
Make a toy version of the dataset in the tests folder
tests/resources/mir_datasets/my_dataset/
, so you can test against little data. For example:Include all audio and annotation files for one track of the dataset
For each audio/annotation file, reduce the audio length to 1-2 seconds and remove all but a few of the annotations.
If the dataset has a metadata file, reduce the length to a few lines.
Test all of the dataset specific code, e.g. the public attributes of the Track class, the load functions and any other custom functions you wrote. See the tests folder for reference. If your loader has a custom download function, add tests similar to this loader.
Locally run
pytest -s tests/test_full_dataset.py --local --dataset my_dataset
before submitting your loader to make sure everything is working.
Note
We have written automated tests for all loader’s cite
, download
, validate
, load
, track_ids
functions,
as well as some basic edge cases of the Track
class, so you don’t need to write tests for these!
Example Test File
import numpy as np
from mirdata import annotations
from mirdata.datasets import example
from tests.test_utils import run_track_tests
def test_track():
default_trackid = "some_id"
data_home = "tests/resources/mir_datasets/dataset"
dataset = example.Dataset(data_home)
track = dataset.track(default_trackid)
expected_attributes = {
"track_id": "some_id",
"audio_path": "tests/resources/mir_datasets/example/" + "Wavfile/some_id.wav",
"song_id": "some_id",
"annotation_path": "tests/resources/mir_datasets/example/annotation/some_id.pv",
}
expected_property_types = {"annotation": annotations.XData}
assert track._track_paths == {
"audio": ["Wavfile/some_id.wav", "278ae003cb0d323e99b9a643c0f2eeda"],
"annotation": ["Annotation/some_id.pv", "0d93a011a9e668fd80673049089bbb14"],
}
run_track_tests(track, expected_attributes, expected_property_types)
# test audio loading functions
audio, sr = track.audio
assert sr == 44100
assert audio.shape == (44100 * 2,)
def test_to_jams():
default_trackid = "some_id"
data_home = "tests/resources/mir_datasets/dataset"
dataset = example.Dataset(data_home)
track = dataset.track(default_trackid)
jam = track.to_jams()
annotations = jam.search(namespace="annotation")[0]["data"]
assert [annotation.time for annotation in annotations] == [0.027, 0.232]
assert [annotation.duration for annotation in annotations] == [
0.20500000000000002,
0.736,
]
# ... etc
def test_load_annotation():
# load a file which exists
annotation_path = "tests/resources/mir_datasets/dataset/Annotation/some_id.pv"
annotation_data = example.load_annotation(annotation_path)
# check types
assert type(annotation_data) == annotations.XData
assert type(annotation_data.times) is np.ndarray
# ... etc
# check values
assert np.array_equal(annotation_data.times, np.array([0.016, 0.048]))
# ... etc
def test_metadata():
data_home = "tests/resources/mir_datasets/dataset"
dataset = example.Dataset(data_home)
metadata = dataset._metadata
assert metadata["some_id"] == "something"
Running your tests locally¶
Before creating a PR, you should run all the tests locally like this:
pytest tests/ --local
The –local flag skips tests that are built to run only on the remote testing environment.
To run one specific test file:
pytest tests/test_ikala.py
Finally, there is one local test you should run, which we can’t easily run in our testing environment.
pytest -s tests/test_full_dataset.py --local --dataset dataset
Where dataset
is the name of the module of the dataset you added. The -s
tells pytest not to skip print
statments, which is useful here for seeing the download progress bar when testing the download function.
This tests that your dataset downloads, validates, and loads properly for every track. This test takes a long time for some datasets, but it’s important to ensure the integrity of the library.
We’ve added one extra convenience flag for this test, for getting the tests running when the download is very slow:
pytest -s tests/test_full_dataset.py --local --dataset my_dataset --skip-download
which will skip the downloading step. Note that this is just for convenience during debugging - the tests should eventually all pass without this flag.
Working with big datasets¶
In the development of large datasets, it is advisable to create an index as small as possible to optimize the implementation process of the dataset loader and pass the tests.
Working with remote indexes¶
For the end-user there is no difference between the remote and local indexes. However, indexes can get large when there are a lot of tracks in the dataset. In these cases, storing and accessing an index remotely can be convenient. Large indexes can be added to REMOTES, and will be downloaded with the rest of the dataset. For example:
"index": download_utils.RemoteFileMetadata(
filename="acousticbrainz_genre_index.json.zip",
url="https://zenodo.org/record/4298580/files/acousticbrainz_genre_index.json.zip?download=1",
checksum="810f1c003f53cbe58002ba96e6d4d138",
)
Unlike local indexes, the remote indexes will live in the data_home
directory. When creating the Dataset
object, specify the custom_index_path
to where the index will be downloaded (as a relative path to data_home
).
Reducing the testing space usage¶
We are trying to keep the test resources folder size as small as possible, because it can get really heavy as new loaders are added. We kindly ask the contributors to reduce the size of the testing data if possible (e.g. trimming the audio tracks, keeping just two rows for csv files).
4. Submit your loader¶
Before you submit your loader make sure to:
Add your module to
docs/source/mirdata.rst
following an alphabetical orderAdd your module to
docs/source/table.rst
following an alphabetical order as follows:
* - Dataset
- Downloadable?
- Annotation Types
- Tracks
- License
An example of this for the Beatport EDM key
dataset:
* - Beatport EDM key
- - audio: ✅
- annotations: ✅
- - global :ref:`key`
- 1486
- .. image:: https://licensebuttons.net/l/by-sa/3.0/88x31.png
:target: https://creativecommons.org/licenses/by-sa/4.0
(you can check that this was done correctly by clicking on the readthedocs check when you open a PR). You can find license badges images and links here.
Pull Request template¶
When starting your PR please use the new_loader.md template,
it will simplify the reviewing process and also help you make a complete PR. You can do that by adding
&template=new_loader.md
at the end of the url when you are creating the PR :
...mir-dataset-loaders/mirdata/compare?expand=1
will become
...mir-dataset-loaders/mirdata/compare?expand=1&template=new_loader.md
.
Docs¶
Staged docs for every new PR are built, and you can look at them by clicking on the “readthedocs” test in a PR.
To quickly troubleshoot any issues, you can build the docs locally by nagivating to the docs
folder, and running
make html
(note, you must have sphinx
installed). Then open the generated _build/source/index.html
file in your web browser to view.
Troubleshooting¶
If github shows a red X
next to your latest commit, it means one of our checks is not passing. This could mean:
running
black
has failed – this means that your code is not formatted according toblack
’s code-style. To fix this, simply run the following from inside the top level folder of the repository:
black --target-version py38 mirdata/ tests/
the test coverage is too low – this means that there are too many new lines of code introduced that are not tested.
the docs build has failed – this means that one of the changes you made to the documentation has caused the build to fail. Check the formatting in your changes and make sure they are consistent.
the tests have failed – this means at least one of the tests is failing. Run the tests locally to make sure they are passing. If they are passing locally but failing in the check, open an issue and we can help debug.
Common non-standard cases¶
Not fully-downloadable datasets¶
Sometimes, parts of music datasets are not freely available due to e.g. copyright restrictions. In these cases, we aim to make sure that the version used in mirdata is the original one, and not a variant.
Before starting a PR, if a dataset is not fully downloadable:
Contact the mirdata team by opening an issue or PR so we can discuss how to proceed with the closed dataset.
Show that the version used to create the checksum is the “canonical” one, either by getting the version from the dataset creator, or by verifying equivalence with several other copies of the dataset.
Datasets needing extra dependencies¶
If a new dataset requires a library that is not included setup.py, please open an issue. In general, if the new library will be useful for many future datasets, we will add it as a dependency. If it is specific to one dataset, we will add it as an optional dependency.
To add an optional dependency, add the dataset name as a key in extras_require in setup.py, and list any additional dependencies. When importing these optional dependencies in the dataset module, use a try/except clause and log instructions if the user hasn’t installed the extra requriements.
For example, if a module called example_dataset requires a module called asdf, it should be imported as follows:
try:
import asdf
except ImportError:
logging.error(
"In order to use example_dataset you must have asdf installed. "
"Please reinstall mirdata using `pip install 'mirdata[example_dataset]'"
)
raise ImportError
Datasets with multiple versions¶
We do not currently support datasets with multiple versions, however we are actively working on supporting them. For the latest progress, see the open issue.
Documentation¶
This documentation is in rst format. It is built using Sphinx and hosted on readthedocs. The API documentation is built using autodoc, which autogenerates documentation from the code’s docstrings. We use the napoleon plugin for building docs in Google docstring style. See the next section for docstring conventions.
mirdata uses Google’s Docstring formatting style. Here are some common examples.
Note
The small formatting details in these examples are important. Differences in new lines, indentation, and spacing make
a difference in how the documentation is rendered. For example writing Returns:
will render correctly, but Returns
or Returns :
will not.
Functions:
def add_to_list(list_of_numbers, scalar):
"""Add a scalar to every element of a list.
You can write a continuation of the function description here on the next line.
You can optionally write more about the function here. If you want to add an example
of how this function can be used, you can do it like below.
Example:
.. code-block:: python
foo = add_to_list([1, 2, 3], 2)
Args:
list_of_numbers (list): A short description that fits on one line.
scalar (float):
Description of the second parameter. If there is a lot to say you can
overflow to a second line.
Returns:
list: Description of the return. The type here is not in parentheses
"""
return [x + scalar for x in list_of_numbers]
Functions with more than one return value:
def multiple_returns():
"""This function has no arguments, but more than one return value. Autodoc with napoleon doesn't handle this well,
and we use this formatting as a workaround.
Returns:
* int - the first return value
* bool - the second return value
"""
return 42, True
One-line docstrings
def some_function():
"""
One line docstrings must be on their own separate line, or autodoc does not build them properly
"""
...
Objects
"""Description of the class
overflowing to a second line if it's long
Some more details here
Args:
foo (str): First argument to the __init__ method
bar (int): Second argument to the __init__ method
Attributes:
foobar (str): First track attribute
barfoo (bool): Second track attribute
Cached Properties:
foofoo (list): Cached properties are special mirdata attributes
barbar (None): They are lazy loaded properties.
barf (bool): Document them with this special header.
"""
Conventions¶
Loading from files¶
We use the following libraries for loading data from files:
Format |
library |
---|---|
audio (wav, mp3, …) |
librosa |
midi |
pretty_midi |
json |
json |
csv |
csv |
jams |
jams |
If a file format needed for a dataset is not included in this list, please see the extra dependencies section. # TODO
Track Attributes¶
Custom track attributes should be global, track-level data.
For some datasets, there is a separate, dataset-level metadata file
with track-level metadata, e.g. as a csv. When a single file is needed
for more than one track, we recommend using writing a _metadata
cached property (which
returns a dictionary, either keyed by track_id or freeform)
in the Dataset class (see the dataset module example code above). When this is specified,
it will populate a track’s hidden _track_metadata
field, which can be accessed from
the Track class.
For example, if _metadata
returns a dictionary of the form:
{
'track1': {
'artist': 'A',
'genre': 'Z'
},
'track2': {
'artist': 'B',
'genre': 'Y'
}
}
the _track metadata
for track_id=track2
will be:
{
'artist': 'B',
'genre': 'Y'
}
Load methods vs Track properties¶
Track properties and cached properties should be trivial, and directly call a load_*
method.
There should be no additional logic in a track property/cached property, and instead all logic
should be done in the load method. We separate these because the track properties are only usable
when data is available locally - when data is remote, the load methods are used instead.
Missing Data¶
If a Track has a property, for example a type of annotation, that is present for some tracks and not others, the property should be set to None when it isn’t available.
The index should only contain key-values for files that exist.
Custom Decorators¶
cached_property¶
This is used primarily for Track classes.
This decorator causes an Object’s function to behave like
an attribute (aka, like the @property
decorator), but caches
the value in memory after it is first accessed. This is used
for data which is relatively large and loaded from files.
docstring_inherit¶
This decorator is used for children of the Dataset class, and copies the Attributes from the parent class to the docstring of the child. This gives us clear and complete docs without a lot of copy-paste.
copy_docs¶
This decorator is used mainly for a dataset’s load_
functions, which
are attached to a loader’s Dataset class. The attached function is identical,
and this decorator simply copies the docstring from another function.
coerce_to_bytes_io/coerce_to_string_io¶
These are two decorators used to simplify the loading of various Track members in addition to giving users the ability to use file streams instead of paths in case the data is in a remote location e.g. GCS. The decorators modify the function to:
Return None if None if passed in.
Open a file if a string path is passed in either ‘w’ mode for string_io or wb for bytes_io and pass the file handle to the decorated function.
Pass the file handle to the decorated function if a file-like object is passed.
This cannot be used if the function to be decorated takes multiple arguments. coerce_to_bytes_io should not be used if trying to load an mp3 with librosa as libsndfile does not support mp3 yet and audioread expects a path.