Spine Toolbox: A ﬂexible open-source workﬂow management system with scenario and data management

Workﬂow management tools do not usually have an interest in the data they pass beyond the need to ﬁlter, aggregate and validate. However, in several use cases, the modiﬁcation of both parameter values and data structures are an important part of the work. Models used to support decision making must account for uncertainty and rely on scenarios containing alternative values. Open-source Python-based Spine Toolbox addresses these use cases with an integrated platform for data acquisition and data processing, and advanced functionalities for enabling the chaining of tools and models into complex shareable workﬂows. The software helps users to import and manage data, deﬁne models and scenarios and orchestrate projects. It conveniently facilitates the linking of models with diﬀerent scopes, or spatio-temporal resolutions, through the user interface. This paper presents the Spine Toolbox user interface, the Spine data structure, and the various interfaces for exter-nal tools and models.


Motivation and significance
Modelling and simulation are crucial methods for many scientific and engineering endeavours, many of which being highly complex.Models often require various sources of input data and some of that data may include complex processing before the data is model compatible.Many topics are multi-faceted and multi-scale requiring several models and processing tools to be adequately represented.Finally, uncertainties can be diverse yet critical for robust decision making.Tackling these intricacies in a repeatable and dependable fashion requires not just reliable models but also reliable workflow management that can deal with the versatile processes and data.
There are several existing open source workflow management tools, each with their strengths and weaknesses.Some of the present authors are experienced power and energy system modellers who understand the modelling workflow challenges in their domain.One specific challenge is to manage not just the execution of the tools but also to represent manifold data and scenarios.Most generic workflow management tools do not have explicit means to represent data as part of the workflow.With the resources to design and implement something new, this was a focal point for Spine Toolbox that will be introduced below.At the same time, workflow tools (or 'modelling frameworks') in the energy domain typically fall short in their capabilities to orchestrate complex workflows.In the next paragraphs we will highlight these aspects using a diverse sample of workflow tools sourced both from the generic domain and from the energy systems specific domain.For a more comprehensive review, Atkinson et al. [1] gives an overview of the workflow management tool development over the past ten years while Ringkjøb et al. [2] provides an overview of recent energy system modelling tools.
Workflow management tools (or systems/software) can be used to facilitate the design and execution of workflows.Many of them are geared towards business processes.However, there are also some open source tools that focus on research and analysis workflows.Pegasus WMS [3] is an execution framework built with Java and Python that allows the users to define workflows as processing plans built from a selection of components.Pegasus, or other similar frameworks, can be enriched with semantic workflow structures like WINGS, which allow the incorporation of more metadata and guidance for the user, and allow the developers to chain components into ready-made template workflows.However, Pegasus only transfers data 'replicas' from the data 'catalogue'.It does not offer a unified interface to the data and the capability for the user to create scenarios based on alternative parameter values, which was one of the design criteria for our purposes.Similarly, Apache Airflow, which executes directed acyclic graphs (DAG) in pure Python, does not offer data interfaces and it does not support passing data between tools in the DAGs, which is relevant for our application.There are also tools with a data analytics focus like the Java-based KNIME and the Scala-based Apache Spark.These also lack the capability to easily build scenarios that can be executed in optimization and simulation models.However, they have powerful data cleaning and processing capabilities and could serve as the first steps also in a Spine Toolbox -based modelling chain.In the energy system domain, many modelling frameworks have their own system of managing data, which is often quite rigid and limited to the immediate inputs and outputs of the tool (e.g.TIMES, OSeMOSYS and Calliope).In some cases, the modelling community has adopted lower level script chaining tools -like Snakemake in the case of PyPSA.This allows for much more freedom, but requires that the user understands the code if the workflow needs to be changed.
The design criteria for Spine Toolbox was largely drawn from the requirements of modelling that tries to support decision making under uncertainty.
As a consequence, Spine Toolbox offers not just workflow management, but also versatile data structures and scenario management.The users can have multiple models and tools using the same database or databases, which allows groups of modellers to perform integrated scenario analysis using a suite of specialised tools.As illustrated in Table 2, the characteristics of Spine Toolbox are similar to the software among data analytics workflow management systems such as Knime [4] or Alteryx [5] but, in comparison with these widely adopted tools, the novelty of Spine resides in a set of features specialised for decision making under uncertainty.Furthermore, Spine Toolbox is written in Python to allow for the easy integration of Python based tools that are widespread in the research community.On the other hand, Spine Toolbox is a new entrant and lacks many specific data processing capabilities present in the more mature tools.However, Spine Toolbox workflows can incorporate other data processing tools available in the open source community.Fully supported, ( ) Partially supported, Supported with additional software or a plugin, × Not supported.

Spine
Toolbox is an open source Python package to manage data, scenarios and workflows for modelling and simulation.Users can have local workflows, but work as a team through version control and SQL databases.
The major features of Spine Toolbox are the following: • It implements an application which enables and orchestrates the data acquisition from multiple (and diverse) data sources and provides the mechanisms to validate and associate those data.
• It provides a generic data model, the Spine Data Structure, using a generic Entity -Attribute -Value approach with Classes and Relationships implemented as SQL databases through SQLAlchemy.The abstract classes and the relationships between them enable the formulation of diverse models in an object-oriented manner.
• Entities can hold parameters that can be constants, time series, arrays and multi-dimensional maps.The interface facilitates the viewing and editing of data.
• Scenarios can be built from the alternative parameter values.It provides all the required interfaces for performing calculations on the data.
• There are importer, exporter and data manipulation tools that allow conversion of data between different data structures and formats (e.g.csv, xlsx, gdf, sql).This facilitates a wide range of possible analyses.
• A Python based API is available for querying the Spine data.A similar interface is available for Julia with additional capabilities for direct model building in Julia/JuMP.
• Tool specifications are used to define what different tools require to be executed.
• Tool specifications can be turned into plug-ins.An example plug-in is the SpineOpt.jlJulia package, which operates on a Spine dataset, by generating and simulating the optimisation model.
• Execution has been separated from the interface, which enables parallelization of the workflow and scenarios as well as remote execution using the inbuilt server-client system.Spine Toolbox can be used to design and execute workflows consisting of multiple tools and models [11].It supports problem independent data and scenarios with user interfaces that automatically support different data structures as long as they conform with the entity-attribute-value with relationships and classes data model.The user interface is built using the Model-View-Controller architectural pattern.The high level structure strives for modularity both inside the Spine Toolbox but also for the workflows it can execute.For the workflows, an equivalence between a composite model and a Direct Acyclic Graph (DAG) computational workflow is assumed, where each node has four main elements: input from the previous node, an output to the successor, some internal operations (workflow step) and an access to external data sources.As illustrated in Figure 1, a computational node needs to receive the input from the predecessor node, or an empty value from the root (node 0) and during the execution interact with the external data source, both in reading and writing mode.At the end of the execution, the node will push the output data to the successor node.The composition of various nodes will result in a computational workflow equivalent to a DAG.
The software architecture of Spine follows the workflow control architectural pattern, using a tight integration [12], between the toolbox and the other components (described in the following sections).

Spine Data Structure
The Spine data structure is an entity-attribute-value with classes and relationships data model for a structured yet flexible storage of data.The classes and relationships define the structure between different data elements with a strong resemblance to graphs.Graph-like structures are prevalent in modelling and optimisation, which makes the data structure well suited for the intended purpose.The data structure is further augmented with the capability to contain multiple alternative values for the same parameter (i.e.attribute).Sets of alternatives can then be used to create scenarios.The data structure is an integral part of Spine Toolbox because it enables the users to work with a single dataset serving multiple models and thus enabling an efficient and more reliable way of working.

Spine DB API
Spine DB API holds the low-level Python functions that can access and modify all the different parts of the Entity-Attribute-Value data structure while maintaining structural integrity.Spine's inbuilt importer, exporter, data manipulators as well as the SpineInterface.jlall use Spine DB API.
Users can create their own direct connections to the Spine database using the Spine DB API.

Spine Toolbox user interface
Spine Toolbox provides the Graphical User Interface (GUI) application that enables the definition, management, and execution of energy system models.It gives the user the ability to collect, create, organise, and validate model input data, execute a model with selected data and finally archive and visualise results/output data.Spine Toolbox is designed to support the creation and execution of optimization and simulation models as well as data processing tools.It can also be used for other purposes where the inbuilt data structure proves useful.

Spine DB editor
Part of the GUI of Spine Toolbox is the Spine DB editor.The editor allows users to view and modify data, add alternative values, build scenarios, and to view metadata.It has undo functionality and data changes are submitted through a commit that asks for a commit message.The editor consists of four main components: Entity tree shows entities organised according to the object and relationship classes and allows filtering in the following data views.

List view represents data as a list of entities and their parameter values.
It is the most straightforward interface for viewing and manipulating both the entities and the parameter values.A version of the list view also shows the available parameter definitions for the classes and facilitates their editing.
Tabular view provides a table which can be used to view and manipulate parameter values of a given entity class.The user controls the data that the two axes contain, and data can also be filtered.The axes can take structural dimensions (classes), index dimensions from the parameter values as well as alternative and scenario dimensions.
Graph view provides a visual presentation of the data structure.The various entities of a Spine dataset are represented as nodes, with vertices signifying the relationships between them.The end user can choose classes and entities to be displayed.Parameter data for selected entities can be seen and manipulated in a separate list view below the graph.

Spine Engine
Spine Engine provides the functionality for workflow execution.The objective of this component is to execute a part of the workflow or the whole workflow either locally or using a client-server setup.
During the workflow execution, each of the items in the workflow are executed and their outputs passed to the successor node(s).The main inputs of the Spine Engine are items, specifications, successors, and execution permits.
These inputs are parsed from a project file which describes the workflow to be executed.The project file is parsed before the Spine Engine object is instantiated, allowing for the relevant inputs to be made available to the Spine Engine.The inputs for the Engine are described as: An Item in the context of a Spine workflow is an object of computation which can be connected with other items to form a workflow.The Spine Toolbox provides the interface where the user can choose the relevant items to construct the desired workflow.Each item available in the Toolbox interface should provide a distinct but complimentary functionality to other available items.The functionality of these items includes the execution of program files (with direct support for Python, Julia, and GAMS code while other tools can be run through shell executables), importing data, exporting data, storing data, executing Jupyter notebooks, and more.
Tool specifications describe item attributes such as input files, output files, executable location, etc.They are instantiated as items in the workflow.
Successors are mappings from an item name to a list of successor item names which describe the dependencies between items in the workflow.
Execution Permits are mappings of an item name to a Boolean value, describing which items in the workflow should be executed.If the Boolean is false then the item will not execute but its resources will be collected.The interface allows the selection of any number of items for execution.
Based on the aforementioned inputs, the engine will construct a set of objects called solids.Solids are defined by the items belonging to the workflow.
From the set of solids, a pipeline is constructed.The use of "solids" and "pipelines" in describing the inner workings of the Spine Engine is derived from the Dagster [13] library, which the engine utilises to efficiently execute a workflow.Once the pipeline is constructed it is available for execution.
Workflow execution is performed in parallel with items in the workflow being executed concurrently where item dependencies allow.Parallelisation of workflow execution is widely used by established workflow engines such as openDIEL [14] and Chiron [15] to decrease the execution time of a workflow.
However, Spine Engine can also parallelise the execution of scenarios that have been built in the Spine DB manager.With the growing availability of multi-core CPU's for desktop and laptops, most user hardware should be capable of taking advantage of this Spine engine feature.
The engine also offers two modes of execution, the first of which is performed from within the Spine Toolbox interface.When the engine is executed from the toolbox the user is provided with a log of item executions along with animations for items which provides the user with a visual representation of item execution progress.The second mode is performed in a headless manner where the execution is performed from the command line completely separately from the user interface.When executed in the headless mode, the user is provided with the execution log from within the terminal window.
Spine Engine also has client-server capability for remote execution.A Spine Toolbox instance on the client side can send a workflow to a Spine Engine instance on the server side.Messaging is based on Zero-MQ and the messaging can include the data required by the workflow.However, when possible, it is better to use server-based SQL that the Spine Engine server can access directly given the instructions from the client.This remote execution feature is relatively new and is likely to evolve.

Spine Interface
SpineInterface.jl is a Julia package for interfacing with the Spine Data Structure within a Julia session.It relies on the Spine DB API and given the URL of a Spine database, it creates a series of convenience functions to retrieve the contents of that database in the Julia module or session where it is called.It allows users to rapidly build Julia tools that use Spine databases.
It is especially useful for building optimisation models in Julia JuMP.Spi-neOpt.jlenergy system model is an example of this.

SpineOpt
SpineOpt.jl is an open-source energy system modelling framework that uses Spine Toolbox data structures directly through SpineInterface.jl.It is a Spine Toolbox plugin and can be efficiently integrated into Spine Toolbox workflows.Through a commodity-agnostic and problem-independent formulation, SpineOpt facilitates the modelling of integrated energy systems and can also incorporate other phenomena that can be represented with conversions and transfers between nodes.The data-driven approach of SpineOpt enables user-flexibility to add additional parameters and constraints.Ihlelmann et al. [16] provide an detailed overview of SpineOpt.

Illustrative Examples
Spine project performed thirteen case studies to validate, expand and demonstrate the capabilities of Spine Toolbox and SpineOpt.A workflow has been created for each case study, linking all the required data and tools necessary for its completion.The workflows are available as Spine projects at https://github.com/orgs/Spine-project/repositories.Note that while Spine Project has tried to use open access data where possible, in some instances, some of the original data was not publicly available and has been omitted or replaced with dummy data.In the following we will highlight two case studies.
In the first example (Fig 2, Case study A3), input data is first processed with two Python scripts to perform specific calculations.This is useful when the capabilities of the Spine importer are not sufficient for the required manipulations.The scripts save data as csv files, which are then imported to a Spine data store using the Spine importer, which takes tabular data and converts it into the Spine data store format using the user-made specifications in the importer interface.SpineOpt uses SpineInterface to interact with the input and output databases based on the resource URLs that the workflow passes to SpineOpt.Finally, the 'Convert Results' Python script gets data both from input and output databases in order to show results that can relate the inputs to the outputs.
The second example (Case study A5) shows how the graph view of the database editor displays the structure of the data as a graph (Fig 3).All entities and parameters can be edited.

Impact
Workflow management tools have had an enormous impact on the scientific process and they are also extensively used in businesses and public administration.Supporting decision making through modelling creates specific requirements for the workflow management tool and Spine Toolbox strives to address those.Such decision making is widespread -models are used to support pandemic response, city planning, process design, as well as energy system planning and operation to name a few examples.The common denominator in these tasks is the need to consider uncertainty -not all factors

Conclusions
Spine Toolbox provides a workflow, data, and scenario management framework that can combine multiple data sources and tools while giving the user a full view of the workflow from sources to outcomes.It allows groups of users to work together on large-scale problems that require data curation as well as multiple tools and models.The toolbox and the data structure features are capable of performing a wide range of data processing and modelling tasks with a specific focus on data consolidation and scenario analysis.

Conflict of Interest
We confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.

Figure 1 :
Figure 1: Spine Toolbox workflow equivalence to a DAG and execution queue

Figure 2 :Figure 3 :
Figure 2: Spine Toolbox workflow for an energy system model of a district heating grid.On the left, there are input data processing scripts and Spine importers that serve the main Input database.On the right, SpineOpt model feeds results to a processing script, which also takes input data into account.

Table 2 :
Analysis of the offered functionality of commercial/open source packages versus the Spine solution.Legend: