Starwhale Model Evaluation
Design Overviewâ
Starwhale Evaluation Positioningâ
The goal of Starwhale Evaluation is to provide end-to-end management for model evaluation, including creating Jobs, distributing Tasks, viewing model evaluation reports and basic management. Starwhale Evaluation is a specific application of Starwhale Model, Starwhale Dataset, and Starwhale Runtime in the model evaluation scenario. Starwhale Evaluation is part of the MLOps toolchain built by Starwhale. More applications like Starwhale Model Serving, Starwhale Training will be included in the future.
Core Featuresâ
Visualization: Both
swcliand the Web UI provide visualization of model evaluation results, supporting comparison of multiple results. Users can also customize logging of intermediate processes.Multi-scenario Adaptation: Whether it's a notebook, desktop or distributed cluster environment, the same commands, Python scripts, artifacts and operations can be used for model evaluation. This satisfies different computational power and data volume requirements.
Seamless Starwhale Integration: Leverage
Starwhale Runtimefor the runtime environment,Starwhale Datasetas data input, and run models fromStarwhale Model. Configuration is simple whether usingswcli, Python SDK or Cloud/Server instance Web UI.
Key Elementsâ
swcli model run: Command line for bulk offline model evaluation.swcli model serve: Command line for online model evaluation.
Best Practicesâ
Command Line Groupingâ
From the perspective of completing an end-to-end Starwhale Evaluation workflow, commands can be grouped as:
- Preparation Stage
swcli dataset buildor Starwhale Dataset Python SDKswcli model buildor Starwhale Model Python SDKswcli runtime build
- Evaluation Stage
swcli model runswcli model serve
- Results Stage
swcli job info
- Basic Management
swcli job listswcli job removeswcli job recover
Abstraction job-step-taskâ
job: A model evaluation task is ajob, which contains one or moresteps.step: Astepcorresponds to a stage in the evaluation process. With the default PipelineHandler,stepsarepredictandevaluate. For custom evaluation processes using@handler,@evaluation.predict,@evaluation.evaluatedecorators,stepsare the decorated functions.Stepscan have dependencies, forming a DAG. Astepcontains one or moretasks.Tasksin the samestephave the same logic but different inputs. A common approach is to split the dataset into multiple parts, with each part passed to atask.Taskscan run in parallel.task: Ataskis the final running entity. In Cloud/Server instances, ataskis a container in a Pod. In Standalone instances, ataskis a Python Thread.
The job-step-task abstraction is the basis for implementing distributed runs in Starwhale Evaluation.