Starwhale Model Evaluation
Design Overviewâ
Starwhale Evaluation Positioningâ
The goal of Starwhale Evaluation is to provide end-to-end management for model evaluation, including creating Jobs, distributing Tasks, viewing model evaluation reports and basic management. Starwhale Evaluation is a specific application of Starwhale Model, Starwhale Dataset, and Starwhale Runtime in the model evaluation scenario. Starwhale Evaluation is part of the MLOps toolchain built by Starwhale. More applications like Starwhale Model Serving, Starwhale Training will be included in the future.
Core Featuresâ
- Visualization: Both - swcliand the Web UI provide visualization of model evaluation results, supporting comparison of multiple results. Users can also customize logging of intermediate processes.
- Multi-scenario Adaptation: Whether it's a notebook, desktop or distributed cluster environment, the same commands, Python scripts, artifacts and operations can be used for model evaluation. This satisfies different computational power and data volume requirements. 
- Seamless Starwhale Integration: Leverage - Starwhale Runtimefor the runtime environment,- Starwhale Datasetas data input, and run models from- Starwhale Model. Configuration is simple whether using- swcli, Python SDK or Cloud/Server instance Web UI.
Key Elementsâ
- swcli model run: Command line for bulk offline model evaluation.
- swcli model serve: Command line for online model evaluation.
Best Practicesâ
Command Line Groupingâ
From the perspective of completing an end-to-end Starwhale Evaluation workflow, commands can be grouped as:
- Preparation Stage- swcli dataset buildor Starwhale Dataset Python SDK
- swcli model buildor Starwhale Model Python SDK
- swcli runtime build
 
- Evaluation Stage- swcli model run
- swcli model serve
 
- Results Stage- swcli job info
 
- Basic Management- swcli job list
- swcli job remove
- swcli job recover
 
Abstraction job-step-taskâ
- job: A model evaluation task is a- job, which contains one or more- steps.
- step: A- stepcorresponds to a stage in the evaluation process. With the default PipelineHandler,- stepsare- predictand- evaluate. For custom evaluation processes using- @handler,- @evaluation.predict,- @evaluation.evaluatedecorators,- stepsare the decorated functions.- Stepscan have dependencies, forming a DAG. A- stepcontains one or more- tasks.- Tasksin the same- stephave the same logic but different inputs. A common approach is to split the dataset into multiple parts, with each part passed to a- task.- Taskscan run in parallel.
- task: A- taskis the final running entity. In Cloud/Server instances, a- taskis a container in a Pod. In Standalone instances, a- taskis a Python Thread.
The job-step-task abstraction is the basis for implementing distributed runs in Starwhale Evaluation.