Contributing

We encourage contributions to mirdata, especially new dataset loaders. To contribute a new loader, follow the steps indicated below and create a Pull Request (PR) to the github repository. For any doubt or comment about your contribution, you can always submit an issue or open a discussion in the repository.

To reduce friction, we may make commits on top of contributor’s PRs. If you do not want us to, please tag your PR with please-do-not-edit.

Installing mirdata for development purposes

To install mirdata for development purposes:

  • First run:

    git clone https://github.com/mir-dataset-loaders/mirdata.git
    
  • Then, after opening source data library you have to install the dependencies for updating the documentation and running tests:

    pip install .
    pip install ."[tests]"
    pip install ."[docs]"
    pip install ."[dali]"
    pip install ."[haydn_op20]"
    

We recommend to install pyenv to manage your Python versions and install all mirdata requirements. You will want to install the latest supported Python versions (see README.md). Once pyenv and the Python versions are configured, install pytest. Make sure you installed all the necessary pytest plugins to automatically test your code successfully (e.g. pytest-cov). Finally, run:

Before running the tests, make sure to have formatted mirdata/ and tests/ with black.

black mirdata/ tests/

Also, make sure that they pass flake8 and mypy tests specified in lint-python.yml github action workflow.

flake8 mirdata --count --select=E9,F63,F7,F82 --show-source --statistics
python -m mypy mirdata --ignore-missing-imports --allow-subclassing-any

Finally, run:

pytest -vv --cov-report term-missing --cov-report=xml --cov=mirdata tests/ --local

All tests should pass!

Writing a new dataset loader

The steps to add a new dataset loader to mirdata are:

  1. Create an index

  2. Create a module

  3. Add tests

  4. Submit your loader

Before starting, check if your dataset falls into one of these non-standard cases:

  • Is the dataset not freely downloadable? If so, see this section

  • Does the dataset require dependencies not currently in mirdata? If so, see this section

  • Does the dataset have multiple versions? If so, see this section

  • Is the index large (e.g. > 5 MB)? If so, see this section

1. Create an index

mirdata’s structure relies on indexes. Indexes are dictionaries contain information about the structure of the dataset which is necessary for the loading and validating functionalities of mirdata. In particular, indexes contain information about the files included in the dataset, their location and checksums. The necessary steps are:

  1. To create an index, first create a script in scripts/, as make_dataset_index.py, which generates an index file.

  2. Then run the script on the dataset and save the index in mirdata/datasets/indexes/ as dataset_index_<version>.json. where <version> indicates which version of the dataset was used (e.g. 1.0).

Here there is an example of an index to use as guideline:

More examples of scripts used to create dataset indexes can be found in the scripts folder.

tracks

Most MIR datasets are organized as a collection of tracks and annotations. In such case, the index should make use of the tracks top-level key. A dictionary should be stored under the tracks top-level key where the keys are the unique track ids of the dataset. The values are a dictionary of files associated with a track id, along with their checksums. These files can be for instance audio files or annotations related to the track id. File paths are relative to the top level directory of a dataset.

multitracks

records

2. Create a module

Once the index is created you can create the loader. For that, we suggest you use the following template and adjust it for your dataset. To quickstart a new module:

  1. Copy the example below and save it to mirdata/datasets/<your_dataset_name>.py

  2. Find & Replace Example with the <your_dataset_name>.

  3. Remove any lines beginning with # – which are there as guidelines.

You may find these examples useful as references:

For many more examples, see the datasets folder.

3. Add tests

To finish your contribution, include tests that check the integrity of your loader. For this, follow these steps:

  1. Make a toy version of the dataset in the tests folder tests/resources/mir_datasets/my_dataset/, so you can test against little data. For example:

    • Include all audio and annotation files for one track of the dataset

    • For each audio/annotation file, reduce the audio length to 1-2 seconds and remove all but a few of the annotations.

    • If the dataset has a metadata file, reduce the length to a few lines.

  2. Test all of the dataset specific code, e.g. the public attributes of the Track class, the load functions and any other custom functions you wrote. See the tests folder for reference. If your loader has a custom download function, add tests similar to this loader.

  3. Locally run pytest -s tests/test_full_dataset.py --local --dataset my_dataset before submitting your loader to make sure everything is working. If your dataset has multiple versions, test each (non-default) version by running pytest -s tests/test_full_dataset.py --local --dataset my_dataset --dataset-version my_version.

Note

We have written automated tests for all loader’s cite, download, validate, load, track_ids functions, as well as some basic edge cases of the Track class, so you don’t need to write tests for these!

Running your tests locally

Before creating a PR, you should run all the tests. But before that, make sure to have formatted mirdata/ and tests/ with black.

black mirdata/ tests/

Also, make sure that they pass flake8 and mypy tests specified in lint-python.yml github action workflow.

flake8 mirdata --count --select=E9,F63,F7,F82 --show-source --statistics
python -m mypy mirdata --ignore-missing-imports --allow-subclassing-any

Finally, run all the tests locally like this:

pytest -vv --cov-report term-missing --cov-report=xml --cov=mirdata --black tests/ --local

The –local flag skips tests that are built to run only on the remote testing environment.

To run one specific test file:

pytest tests/datasets/test_ikala.py

Finally, there is one local test you should run, which we can’t easily run in our testing environment.

pytest -s tests/test_full_dataset.py --local --dataset dataset

Where dataset is the name of the module of the dataset you added. The -s tells pytest not to skip print statements, which is useful here for seeing the download progress bar when testing the download function.

This tests that your dataset downloads, validates, and loads properly for every track. This test takes a long time for some datasets, but it’s important to ensure the integrity of the library.

The --skip-download flag can be added to pytest command to run the tests skipping the download. This will skip the downloading step. Note that this is just for convenience during debugging - the tests should eventually all pass without this flag.

Reducing the testing space usage

We are trying to keep the test resources folder size as small as possible, because it can get really heavy as new loaders are added. We kindly ask the contributors to reduce the size of the testing data if possible (e.g. trimming the audio tracks, keeping just two rows for csv files).

4. Submit your loader

Before you submit your loader make sure to:

  1. Add your module to docs/source/mirdata.rst following an alphabetical order

  2. Add your module to docs/source/table.rst following an alphabetical order as follows:

* - Dataset
  - Downloadable?
  - Annotation Types
  - Tracks
  - License

An example of this for the Beatport EDM key dataset:

* - Beatport EDM key
  - - audio: ✅
    - annotations: ✅
  - - global :ref:`key`
  - 1486
  - .. image:: https://licensebuttons.net/l/by-sa/3.0/88x31.png
       :target: https://creativecommons.org/licenses/by-sa/4.0

(you can check that this was done correctly by clicking on the readthedocs check when you open a PR). You can find license badges images and links here.

Pull Request template

When starting your PR please use the new_loader.md template, it will simplify the reviewing process and also help you make a complete PR. You can do that by adding &template=new_loader.md at the end of the url when you are creating the PR :

...mir-dataset-loaders/mirdata/compare?expand=1 will become ...mir-dataset-loaders/mirdata/compare?expand=1&template=new_loader.md.

Docs

Staged docs for every new PR are built, and you can look at them by clicking on the “readthedocs” test in a PR. To quickly troubleshoot any issues, you can build the docs locally by navigating to the docs folder, and running make html (note, you must have sphinx installed). Then open the generated _build/source/index.html file in your web browser to view.

Troubleshooting

If github shows a red X next to your latest commit, it means one of our checks is not passing. This could mean:

  1. running black has failed – this means that your code is not formatted according to black’s code-style. To fix this, simply run the following from inside the top level folder of the repository:

black mirdata/ tests/
  1. Your code does not pass flake8 test.

flake8 mirdata --count --select=E9,F63,F7,F82 --show-source --statistics
  1. Your code does not pass mypy test.

python -m mypy mirdata --ignore-missing-imports --allow-subclassing-any
  1. the test coverage is too low – this means that there are too many new lines of code introduced that are not tested.

  2. the docs build has failed – this means that one of the changes you made to the documentation has caused the build to fail. Check the formatting in your changes and make sure they are consistent.

  3. the tests have failed – this means at least one of the tests is failing. Run the tests locally to make sure they are passing. If they are passing locally but failing in the check, open an issue and we can help debug.

Common non-standard cases

Not fully-downloadable datasets

Sometimes, parts of music datasets are not freely available due to e.g. copyright restrictions. In these cases, we aim to make sure that the version used in mirdata is the original one, and not a variant.

Before starting a PR, if a dataset is not fully downloadable:

  1. Contact the mirdata team by opening an issue or PR so we can discuss how to proceed with the closed dataset.

  2. Show that the version used to create the checksum is the “canonical” one, either by getting the version from the dataset creator, or by verifying equivalence with several other copies of the dataset.

Datasets needing extra dependencies

If a new dataset requires a library that is not included setup.py, please open an issue. In general, if the new library will be useful for many future datasets, we will add it as a dependency. If it is specific to one dataset, we will add it as an optional dependency.

To add an optional dependency, add the dataset name as a key in extras_require in setup.py, and list any additional dependencies. Additionally, mock the dependencies in docs/conf.py by adding it to the autodoc_mock_imports list.

When importing these optional dependencies in the dataset module, use a try/except clause and log instructions if the user hasn’t installed the extra requirements.

For example, if a module called example_dataset requires a module called asdf, it should be imported as follows:

try:
    import asdf
except ImportError:
    logging.error(
        "In order to use example_dataset you must have asdf installed. "
        "Please reinstall mirdata using `pip install 'mirdata[example_dataset]'"
    )
    raise ImportError

Datasets with multiple versions

There are some datasets where the loading code is the same, but there are multiple versions of the data (e.g. updated annotations, or an additional set of tracks which follow the same paradigm). In this case, only one loader should be written, and multiple versions can be defined by creating additional indexes. Indexes follow the naming convention <datasetname>_index_<version>.json, thus a dataset with two versions simply has two index files. Different versions are tracked using the INDEXES variable:

INDEXES = {
    "default": "1.0",
    "test": "sample",
    "1.0": core.Index(filename="example_index_1.0.json"),
    "2.0": core.Index(filename="example_index_2.0.json"),
    "sample": core.Index(filename="example_index_sample.json")
}

By default, mirdata loads the version specified as default in INDEXES when running mirdata.initialize('example'), but a specific version can be loaded by running mirdata.initialize('example', version='2.0').

Different indexes can refer to different subsets of the same larger dataset, or can reference completely different data. All data needed for all versions should be specified via keys in REMOTES, and by default, mirdata will download everything. If one version only needs a subset of the data in REMOTES, it can be specified using the partial_download argument of core.Index. For example, if REMOTES has the keys ['audio', 'v1-annotations', 'v2-annotations'], the INDEXES dictionary could look like:

INDEXES = {
    "default": "1.0",
    "test": "1.0",
    "1.0": core.Index(filename="example_index_1.0.json", partial_download=['audio', 'v1-annotations']),
    "2.0": core.Index(filename="example_index_2.0.json", partial_download=['audio', 'v2-annotations']),
}

Datasets with large indexes

Large indexes should be stored remotely, rather than checked in to the mirdata repository. mirdata has a zenodo community where larger indexes can be uploaded as “datasets”.

When defining a remote index in INDEXES, simply also pass the arguments url and checksum to the Index class:

"1.0": core.Index(
    filename="example_index_1.0.json",  # the name of the index file
    url=<url>,  # the download link
    checksum=<checksum>,  # the md5 checksum
)

Remote indexes get downloaded along with the data when calling .download(), and are stored in <data_home>/mirdata_indexes.

Documentation

This documentation is in rst format. It is built using Sphinx and hosted on readthedocs. The API documentation is built using autodoc, which autogenerates documentation from the code’s docstrings. We use the napoleon plugin for building docs in Google docstring style. See the next section for docstring conventions.

mirdata uses Google’s Docstring formatting style. Here are some common examples.

Note

The small formatting details in these examples are important. Differences in new lines, indentation, and spacing make a difference in how the documentation is rendered. For example writing Returns: will render correctly, but Returns or Returns : will not.

Functions:

def add_to_list(list_of_numbers, scalar):
    """Add a scalar to every element of a list.
    You can write a continuation of the function description here on the next line.

    You can optionally write more about the function here. If you want to add an example
    of how this function can be used, you can do it like below.

    Example:
        .. code-block:: python

        foo = add_to_list([1, 2, 3], 2)

    Args:
        list_of_numbers (list): A short description that fits on one line.
        scalar (float):
            Description of the second parameter. If there is a lot to say you can
            overflow to a second line.

    Returns:
        list: Description of the return. The type here is not in parentheses

    """
    return [x + scalar for x in list_of_numbers]

Functions with more than one return value:

def multiple_returns():
    """This function has no arguments, but more than one return value. Autodoc with napoleon doesn't handle this well,
    and we use this formatting as a workaround.

    Returns:
        * int - the first return value
        * bool - the second return value

    """
    return 42, True

One-line docstrings

def some_function():
    """
    One line docstrings must be on their own separate line, or autodoc does not build them properly
    """
    ...

Objects

"""Description of the class
overflowing to a second line if it's long

Some more details here

Args:
    foo (str): First argument to the __init__ method
    bar (int): Second argument to the __init__ method

Attributes:
    foobar (str): First track attribute
    barfoo (bool): Second track attribute

Cached Properties:
    foofoo (list): Cached properties are special mirdata attributes
    barbar (None): They are lazy loaded properties.
    barf (bool): Document them with this special header.

"""

Conventions

Opening files

Mirdata uses the smart_open library under the hood in order to support reading data from remote filesystems. If your loader needs to either call the python open command, or if it needs to use os.path.exists, you’ll need to include the line

from smart_open import open

at the top of your dataset module and use open as you normally would. Sometimes dependency libraries accept file paths as input to certain functions and open the files internally - whenever possible mirdata avoids this, and passes in file-objects directly.

If you just need os.path.exists, you’ll need to replace it with a try/except:

# original code that uses os.path.exists
file_path = "flululu.txt"
if not os.path.exists(file_path):
    raise FileNotFoundError(f"{file_path} not found, did you run .download?")

with open(file_path, "r") as fhandle:
    ...

# replacement code that is compatible with remote filesystems
try:
    with open(file_path, "r") as fhandle:
        ...
except FileNotFoundError:
    raise FileNotFoundError(f"{file_path} not found, did you run .download?")

Loading from files

We use the following libraries for loading data from files:

Format

library

audio (wav, mp3, …)

librosa

midi

pretty_midi

json

json

csv

csv

jams

jams

yaml

pyyaml

hdf5 / h5

h5py

If a file format needed for a dataset is not included in this list, please see this section

Track Attributes

If the dataset has an official e.g. train/test split, use the reserved attribute Track.split, or MultiTrack.split which will enable some dataset-level helper functions like dataset.get_track_splits. If there is no official split, do not use this attribute.

Custom track attributes should be global, track-level data. For some datasets, there is a separate, dataset-level metadata file with track-level metadata, e.g. as a csv. When a single file is needed for more than one track, we recommend using writing a _metadata cached property (which returns a dictionary, either keyed by track_id or freeform) in the Dataset class (see the dataset module example code above). When this is specified, it will populate a track’s hidden _track_metadata field, which can be accessed from the Track class.

For example, if _metadata returns a dictionary of the form:

{
    'track1': {
        'artist': 'A',
        'genre': 'Z'
    },
    'track2': {
        'artist': 'B',
        'genre': 'Y'
    }
}

the _track metadata for track_id=track2 will be:

{
    'artist': 'B',
    'genre': 'Y'
}

Missing Data

If a Track has a property, for example a type of annotation, that is present for some tracks and not others, the property should be set to None when it isn’t available.

The index should only contain key-values for files that exist.

Custom Decorators

cached_property

This is used primarily for Track classes.

This decorator causes an Object’s function to behave like an attribute (aka, like the @property decorator), but caches the value in memory after it is first accessed. This is used for data which is relatively large and loaded from files.

docstring_inherit

This decorator is used for children of the Dataset class, and copies the Attributes from the parent class to the docstring of the child. This gives us clear and complete docs without a lot of copy-paste.

coerce_to_bytes_io/coerce_to_string_io

These are two decorators used to simplify the loading of various Track members in addition to giving users the ability to use file streams instead of paths in case the data is in a remote location e.g. GCS. The decorators modify the function to:

  • Return None if None if passed in.

  • Open a file if a string path is passed in either 'w' mode for string_io or wb for bytes_io and pass the file handle to the decorated function.

  • Pass the file handle to the decorated function if a file-like object is passed.

This cannot be used if the function to be decorated takes multiple arguments. coerce_to_bytes_io should not be used if trying to load an mp3 with librosa as libsndfile does not support mp3 yet and audioread expects a path.