Skip to main content
Version: 0.6.12

The dataset.yaml Specification

tip

dataset.yaml is optional for the swcli dataset build command.

Building Starwhale Dataset uses dataset.yaml. Omitting dataset.yaml allows describing related configurations in swcli dataset build command line parameters. dataset.yaml can be considered as a file-based representation of the build command line configuration.

YAML Field Descriptions​

FieldDescriptionRequiredTypeDefault
nameName of the Starwhale DatasetYesString
handlerImportable address of a class that inherits starwhale.SWDSBinBuildExecutor, starwhale.UserRawBuildExecutor or starwhale.BuildExecutor, or a function that returns a Generator or iterable object. Format is {module path}:{class name\|function name}YesString
descDataset descriptionNoString""
versiondataset.yaml format version, currently only "1.0" is supportedNoString1.0
attrDataset build parametersNoDict
attr.volume_sizeSize of each data file in the swds-bin dataset. Can be a number in bytes, or a number plus unit like 64M, 1GB etc.NoInt or Str64MB
attr.alignment_sizeData alignment size of each data block in the swds-bin dataset. If set to 4k, and a data block is 7.9K, 0.1K padding will be added to make the block size a multiple of alignment_size, improving page size and read efficiency.NoInteger or String128

Examples​

Simplest Example​

name: helloworld
handler: dataset:ExampleProcessExecutor

The helloworld dataset uses the ExampleProcessExecutor class in dataset.py of the dataset.yaml directory to build data.

MNIST Dataset Build Example​

name: mnist
handler: mnist.dataset:DatasetProcessExecutor
desc: MNIST data and label test dataset
attr:
alignment_size: 128
volume_size: 4M

Example with handler as a generator function​

dataset.yaml contents:

name: helloworld
handler: dataset:iter_item

dataset.py contents:

def iter_item():
for i in range(10):
yield {"img": f"image-{i}".encode(), "label": i}