PyQBench: a Python library for benchmarking gate-based quantum computers

We introduce PyQBench, an innovative open-source framework for benchmarking gate-based quantum computers. PyQBench can benchmark NISQ devices by verifying their capability of discriminating between two von Neumann measurements. PyQBench offers a simplified, ready-to-use, command line interface (CLI) for running benchmarks using a predefined parametrized Fourier family of measurements. For more advanced scenarios, PyQBench offers a way of employing user-defined measurements instead of predefined ones.


Motivation and significance
Noisy Intermediate-Scale Quantum (NISQ) [1] devices are storming the market, with a wide selection of devices based on different architectures and accompanying software solutions.Among hardware providers offering public access to their gate-based devices, one could mention Rigetti [2], IBM [3], Oxford Quantum Group [4], IonQ [5] or Xanadu [6].Other vendors offer devices operating in different paradigms.Notably, one could mention D-Wave [7] and their quantum annealers, or QuEra devices [8] based on neural atoms.Most vendors provide their own software stack and application programming interface for accessing their devices.To name a few, Rigetti's computers are available through their Forest SDK [9] and PyQuil library [10] and IBM Q [3] computers can be accessed through Qiskit [11] or IBM Quantum Experience web interface [12].Some cloud services, like Amazon Braket [13], offer access to several quantum devices under a unified API.On top of that, several libraries and frameworks can integrate with multiple hardware vendors.Examples of such frameworks include IBM Q's Qiskit or Zapata Computing's Orquestra [14].
It is well known that NISQ devices have their limitations [15].The question is to what extent those devices can perform meaningful computations?To answer this question, one has to devise a methodology for benchmarking them.For gate-based computers, on which this paper focuses, there already exist several approaches.One could mention randomized benchmarking [16,17,18,19,20], benchmarks based on the quantum volume [21,22,23].
In this paper, we introduce a different approach to benchmarking gatebased devices with a simple operational interpretation.In our method, we test how well the given device is at guessing which of the two known von Neumann measurements were performed during the experiment.We implemented our approach in an open-source Python library called PyQBench.The library supports any device available through the Qiskit library, and thus can be used with providers such as IBM Q or Amazon Braket.Along with the library, the PyQBench package contains a command line tool for running most common benchmarking scenarios.

Existing benchmarking methodologies and software
Unsurprisingly, PyQBench is not the only software package for benchmarking gate-based devices.While we believe that our approach has significant benefits over other benchmarking techniques, for completeness, in this section we discuss some of the currently available similar software.
Probably the simplest benchmarking method one could devise is simply running known algorithms and comparing outputs with the expected ones.Analyzing the frequency of the correct outputs, or the deviation between actual and expected outputs distribution provides then a metric of the performance of a given device.Libraries such as Munich Quantum Toolkit (MQT) [24,25] or SupermarQ [26,27] contain benchmarks leveraging multiple algorithms, such as Shor's algorithm or Grover's algorithm.Despite being intuitive and easily interpretable, such benchmarks may have some problems.Most importantly, they assess the usefulness of a quantum device only for a very particular algorithm, and it might be hard to extrapolate their results to other algorithms and applications.For instance, the inability of a device to consistently find factorizations using Shor's algorithms does not tell anything about its usefulness in Variational Quantum Algorithm's.
Another possible approach to benchmarking quantum computers is randomized benchmarking.In this approach, one samples circuits to be run from some predefined set of gates (e.g. from the Clifford group) and tests how much the output distribution obtained from the device running these circuits differs from the ideal one.It is also common to concatenate randomly chosen circuits with their inverses (which should yield the identity circuit) and run those concatenated circuits on the device.Libraries implementing this approach include Qiskit [28] or PyQuil [29].
Another quantity used for benchmarking NISQ devices is quantum volume.The quantum volume characterizes capacity of a device for solving computational problems.It takes into account multiple factors like number of qubits, connectivity and measurement errors.The Qiskit library allows one to measure quantum volume of a device by using its qiskit.ignis.verification.quantumvolume.Other implementations of Quantum Volume can be found as well, see e.g.[30].

Preliminaries and discrimination scheme approach
In this section we describe how the benchmarking process in PyQBench works.To do so, we first discuss necessary mathematical preliminaries.Then, we present the general form of the discrimination scheme used in PyQBench and practical considerations on how to implement it taking into account limitations of the current NISQ devices.

Mathematical preliminaries
Let us first recall the definition of a von Neumann measurement, which is the only type of measurement used in PyQBench.A von Neumann measurement P is a collection of rank-one projectors {|u 0 u 0 |, . . ., |u d−1 u d−1 |}, called effects, that sum up to identity, i.e.
If U is a unitary matrix of size d, one can construct a von Neumann measurement P U by taking projectors onto its columns.In this case we say that P U is described by the matrix U .
Typically, NISQ devices can only perform measurements in computational Z-basis, i.e.U = 1l.To implement an arbitrary von Neumann measurement P U , one has to first apply U † to the measured system and then follow with Z-basis measurement.This process, depicted in Fig. 1, can be viewed as performing a change of basis in which measurement is performed prior to measurement in the computational basis.

Discrimination scheme
Benchmarks in PyQBench work by experimentally determining the probability of correct discrimination between two von Neumann measurements by the device under test and comparing the result with the ideal, theoretical predictions.
Without loss of generality1 , we consider discrimination task between single qubit measurements P 1 l , performed in the computational Z-basis, and an alternative measurement P U performed in the basis U .Note, however, that the discrimination scheme described below can work regardless of dimensionality of the system, see [31] for details.
In general, the discrimination scheme presented in Fig. 2, requires an auxiliary qubit.First, the joint system is prepared in some state |ψ 0 .Then, one of the measurements, either P U or P 1 l , is performed on the first part of the system.Based on its outcome i, we choose another POVM P V i and perform it on the second qubit, obtaining the output in j.Finally, if j = 0, we say that the performed measurement is P U , otherwise we say that it was P 1 l .Naturally, we need to repeat the same procedure multiple times for both measurements to obtain a reliable estimate of the underlying probability distribution.In PyQBench, we assume that the experiment is repeated the same number of times for both P U and P 1 l .
Unsurprisingly, both the |ψ 0 and the final measurements P V i have to be chosen specifically for given U to maximize the probability of a correct guess.The detailed description how these choices are made in [32], and for now we will focus only how this scheme can be implemented on the actual devices, assuming that all the components are known.

Implementation of discrimination scheme on actual NISQ devices
Current NISQ devices are unable to perform conditional measurements, which is the biggest obstacle to implementing our scheme on real hardware.However, we circumvent this problem by slightly adjusting our scheme so that it only uses components available on current devices.For this purpose, we use two possible options: using a postselection or a direct sum

Scheme 1. (Postselection)
The first idea uses a postselection scheme.In the original scheme, we measure the first qubit and only then determine which measurement should be performed on the second one.Instead of doing this choice, we can run two circuits, one with P V 0 and one with P V 1 and measure both qubits.We then discard the results of the circuit for which label i does not match measurement label k.Hence, the circuit for postselection looks as depicted in Fig. 3.
Figure 3: A schematic representation of the setup for distinguishing measurements P U and P 1 l using postselection approach.In postselection scheme, one runs such circuits for both k = 0, 1 and discards results for cases when there is a mismatch between k and i.
To perform the benchmark, one needs to run multiple copies of the postselection circuit, with both P U and P 1 l .Each circuit has to be run in both variants, one with final measurement P V 0 and the second with the final measurement P V 1 .The experiments can thus be grouped into classes identified by tuples of the form (Q, k, i, j), where Q ∈ {P U , P 1 l } denotes the chosen measurement, k ∈ {0, 1} designates the final measurement used, and i ∈ {0, 1} and j ∈ {0, 1} being the labels of outcomes as presented in Fig. 3.We then discard all the experiments for which i = k.The total number of valid experiments is thus: Finally, we count the valid experiments resulting in successful discrimination.If we have chosen P U , then we guess correctly iff j = 0. Similarly, for P 1 l , we guess correctly iff j = 1.If we define (2) then the empirical success probability can be computed as The p succ is the quantity reported to the user as the result of the benchmark.

Scheme 2. (Direct sum)
The second idea uses the direct sum Here, instead of performing a conditional measurement P V k , where k ∈ {0, 1}, we run circuits presented in Fig. 4.
One can see why such a circuit is equivalent to the original discrimination scheme.If we rewrite the block-diagonal matrix V † 0 ⊕ V † 1 as follows: we can see that the direct sum in Eq. ( 5) commutes with the measurement on the first qubit.Thanks to this, we can switch the order of operations to obtain the circuit from Fig. 5. Now, depending on the outcome i, one of the summands in Eq. ( 5) vanishes, and we end up performing exactly the same operations as in the original scheme.
In this scheme, the experiment can be characterized by a pair (Q, i, j), where Q = {P U , P 1 l } and i, j ∈ {0, 1} are the output labels.The number of successful trials for U and 1l, respectively, can be written as Then, the probability of correct discrimination between P U and P 1 l is given by where N total is the number of trials.

Importance of choosing the optimal discrimination scheme
In principle, the schemes described in the previous section could be used with any choice of |ψ 0 and final measurements P V i .However, we argue that it is best to choose those components in such a way that they maximize the probability of correct discrimination.To see that, suppose that some choice of |ψ 0 , P V 0 , P V 1 yields the theoretical upper bound of discriminating between two measurements of one, i.e. on a perfect quantum computer you will always make a correct guess.Then, on real hardware, we might obtain any empirical value in range 1 2 , 1 .On the other hand, if we choose the components of our scheme such that the successful discrimination probability is only 3 5 , the possible range of empirically obtainable probabilities is only 1 2 , 3 5 .Hence, in the second case, the discrepancy between theoretical and empirical results will be less pronounced.

Constructing optimal discrimination scheme
To construct the optimal discrimination scheme, one starts by calculating the probability of correct discrimination.Using the celebrated result by Helstrom [33], one finds that the optimal probability of correct discrimination between two quantum measurements, P and Q, is where The quantum state |ψ 0 maximizing the diamond norm above is called the discriminator, and can be computed e.g. using semidefinite programming (SDP) [32,34].Furthermore, using the proof of the Holevo-Helstrom theorem, it is possible to construct corresponding unitaries V 0 , V 1 to create the optimal discrimination strategy.For brevity, we do not describe this procedure here.Instead, we refer the interested reader to [32].

Discrimination scheme for parameterized Fourier family and implementation
So far, we only discussed how the discrimination is performed assuming that all needed components |ψ 0 , V 0 , and V 1 are known.In this section, we provide a concrete example using parametrized Fourier family of measurements.
The parametrized Fourier family of measurements is defined as a set of the measurements {P U φ : φ ∈ [0, 2π]}, where and H is the Hadamard matrix of dimension two.For each element of this set, the discriminator is a Bell state: Observe that |ψ 0 does not depend on the angle φ.However, the unitaries V 0 , V 1 depend on φ and take the following form: Finally, the theoretical probability of correct discrimination between von Neumann measurements P U φ and P 1 l is given by We explore the construction of |ψ 0 , V 0 and V 1 for parametrized Fourier family of measurements in Appendix C.

Software description
This section is divided into two parts.In Section 5.1 we describe functionalities of PyQBench package.Next, in Section 5.2, we give a general overview of the software architecture.

Software Functionalities
The PyQBench can be used in two modes: as a Python library and as a CLI script.When used as a library, PyQBench allows the customization of discrimination scheme.The user provides a unitary matrix U defining the measurement to be discriminated, the discriminator |ψ 0 , and unitaries V 0 and V 1 describing the final measurement.The PyQBench library provides then the following functionalities.
1. Assembling circuits for both postselection and direct sum-based discrimination schemes.2. Executing the whole benchmarking scenario on specified backend (either real hardware or software simulator).3. Interpreting the obtained outputs in terms of discrimination probabilities.
Note that the execution of circuits by PyQBench is optional.Instead, the user might want to opt in for fine-grained control over the execution of the circuits.For instance, suppose the user wants to simulate the discrimination experiment on a noisy simulator.In such a case, they can define the necessary components and assemble the circuits using PyQBench.The circuits can then be altered, e.g. to add noise to particular gates, and then run using any Qiskit backend by the user.Finally, PyQBench can be used to interpret the measurements to obtain discrimination probability.The PyQBench library also contains a readily available implementation of all necessary components needed to run discrimination experiments for parametrized Fourier family of measurements, defined previously in Section 4.However, if one only wishes to use this particular family of measurements in their benchmarks, then using PyQBench as a command line tool might be more straightforward.PyQBench's command line interface allows running the benchmarking process without writing Python code.The configuration of CLI is done by YAML [35] files describing the benchmark to be performed and the description of the backend on which the benchmark should be run.Notably, the YAML configuration files are reusable.The same benchmark can be used with different backends and vice versa.
The following section describes important architectural decisions taken when creating PyQBench, and how they affect the end-user experience.

Software Architecture 5.2.1. Overview of the software structure
As already described, PyQBench can be used both as a library and a CLI.Both functionalities are implemented as a part of qbench Python package.The exposed CLI tool is also named qbench.For brevity, we do not discuss the exact structure of the package here, and instead refer an interested reader to the source code available at GitHub [36] or at the reference manual [37].
PyQBench can be installed from official Python Package Index (PyPI) by running pip install pyqbench.In a properly configured Python environment the installation process should also make the qbench command available to the user without a need for further configuration.

Integration with hardware providers and software simulators
PyQBench is built around the Qiskit [11] ecosystem.Hence, both the CLI tool and the qbench library can use any Qiskit-compatible backend.This includes, IBM Q backends (available by default in Qiskit) and Amazon Braket devices and simulators (available through qiskit-braket-provider package [38,39]).
When using PyQBench as library, instances of Qiskit backends can be passed to functions that expect them as parameters.However, in CLI mode, the user has to provide a YAML file describing the backend.An example of such file can be found in Section 6, and the detailed description of the expected format can be found at PyQBench's documentation.

Command Line Interface
The Command Line Interface (CLI) of PyQBench has nested structure.The general form of the CLI invocation is shown in listing 1.Currently, PyQBench's CLI supports only one type of benchmark (discrimination of parametrized Fourier family of measurements), but we decided on structuring the CLI in a hierarchical fashion to allow for future extensions.Thus, the only accepted value of <benchmark-type> is disc-fourier.The qbench disc-fourier command has four subcommands: • benchmark: run benchmarks.This creates either a result YAML file containing the measurements or an intermediate YAML file for asynchronous experiments.
• status: query status of experiments submitted for given benchmark.This command is only valid for asynchronous experiments.
• resolve: query the results of asynchronously submitted experiments and write the result YAML file.The output of this command is almost identical to the result obtained from synchronous experiments.
• tabulate: interpret the results of a benchmark and summarize them in the CSV file.
We present usage of each of the above commands later in section 6.

Asynchronous vs. synchronous execution
PyQBench's CLI can be used in synchronous and asynchronous modes.The mode of execution is defined in the YAML file describing the backend (see Section 6 for an example of this configuration).We decided to couple the mode of execution to the backend description because some backends cannot work in asynchronous mode.
When running qbench disc-fourier benchmark in asynchronous mode, the PyQBench submits all the circuits needed to perform a benchmark and then writes an intermediate YAML file containing metadata of submitted experiments.In particular, this metadata contains information on correlating submitted job identifiers with particular circuits.The intermediate file can be used to query the status of the submitted jobs or to resolve them, i.e. to wait for their completion and get the measurement outcomes.
In synchronous mode, PyQBench first submits all jobs required to run the benchmark and then immediately waits for their completion.The advantage of this approach is that no separate invocation of qbench command is needed to actually download the measurement outcomes.The downside, however, is that if the script is interrupted while the command is running, the intermediate results will be lost.Therefore, we recommend using asynchronous mode whenever possible.

Illustrative examples
In this section, we present two examples demonstrating the usage of PyQBench.In the first example, we show how to implement a discrimination scheme for a user-defined measurement and possible ways of using this scheme with qbench library.The second example demonstrates the usage of the CLI.We show how to prepare the input files for the benchmark and how to run it using the qbench tool.

Using user-defined measurement with qbench package
In this example, we will demonstrate how qbench package can be used with user-defined measurement.For this purpose, we will use U = H (the Hadamard gate).The detailed calculations that lead to the particular form of the discriminator and final measurements can be found in Appendix B. The explicit formula for discriminator in this example reads: with final measurements being equal to and where To use the above benchmarking scheme in PyQBench, we first need to construct circuits that can be executed by actual hardware.To this end, we need to represent each of the unitaries as a sequence of standard gates, keeping in mind that quantum circuits start execution from the |00 state.The circuit taking |00 to the Bell state |ψ 0 comprises the Hadamard gate followed by CNOT gate on both qubits (see Fig. 6).For V 0 and V 1 observe that V 0 = RY 3 4 π , where RY is rotation gate around the Y axis defined by

H
• |00 To obtain V 1 we need only to swap the columns, i.e.
Finally, the optimal probability of correct discrimination is equal to We will now demonstrate how to implement this theoretical scheme in PyQBench.
For this example we will use the Qiskit Aer simulator [40].First, we import the necessary functions and classes from PyQBench and Qiskit.We also import numpy for the definition of np.pi constant and the exponential function.
The exact purpose of the imported functions will be described at the point of their usage.
Listing 2: Imports needed for running benchmarking example import numpy as np from qiskit import QuantumCircuit, Aer from qbench.schemes.postselectionimport benchmark_using_postselection from qbench.schemes.direct_sumimport benchmark_using_direct_sum To implement the discrimination scheme in PyQBench, we need to define all the necessary components as Qiskit instructions.We can do so by constructing a circuit object acting on qubits 0 and 1 and then converting them using to instruction() method.We now construct a backend object, which in this case is an instance of Aer simulator.
Listing 4: Defining a backend simulator = Aer.get_backend("aer_simulator") In the simplest scenario, when one does not want to tweak execution details and simply wishes to run the experiment on a given backend, everything that is required is now to run benchmark using postselection or benchmark using direct sum function, depending on the user preference.The postselection result and direct sum result variables contain now the empirical probabilities of correct discrimination.We can compare them to the theoretical value and compute the absolute error.In the example presented above we used functions that automate the whole process -from the circuit assembly, through running the simulations to interpreting the results.But what if we want more control over some parts of this process?One possibility would be to add some additional parameters to benchmark using xyz functions, but this approach is not scalable.Moreover, anticipating all possible uses cases isimpossible.Therefore, we decided on another approach.PyQBench provides functions performing: 1. Assembly of circuits needed for experiment, provided the components discussed above.

Interpretation of the obtained measurements.
The difference between the two approaches is illustrated on the diagrams in Fig. 7.
For the rest of this example we focus only on the postselection case, as the direct sum case is analogous.We continue by importing two more functions from PyQBench.
Listing 8: Assembling circuits from qbench.schemes.postselectionimport ( assemble_postselection_circuits, compute_probabilities_from_postselection_measurements, ) circuits = assemble_postselection_circuits( target=0, ancilla=1, Recall that for a postselection scheme we have two possible choices of the "unknown" measurement and two possible choices of a final measurement, which gives a total of four circuits needed to run the benchmark.The function assemble postselection circuits creates all four circuits and places them in a dictionary with keys "id v0", "id v1", "u v0", "u v1".
We will now run our circuits using noisy and noiseless simulation.We start by creating a noise model using Qiskit.Once we have our noise model ready, we can execute the circuits with and without noise.To this end, we will use Qiskit's execute function.One caveat is that we have to keep track which measurements correspond to  which circuit.We do so by fixing an ordering on the keys in the circuits dictionary.
Listing 11: Computation probabilities We can now examine the results.As an example, in one of our runs, we obtained prob succ noiseless = 0.8524401115559386 and prob succ noisy = 0.5017958400693446.As expected, for noisy simulations, the result lies further away from the target value of 0.8535533905932737.
This concludes our example.In the next section, we will show how to use PyQBench's CLI.

Using qbench CLI
Using PyQBench as a library allows one to conduct a two-qubits benchmark with arbitrary von Neumann measurement.However, as discussed in the previous guide, it requires writing some amount of code.For a Fourier parametrized family of measurements, PyQBench offers a simplified way of conducting benchmarks using a Command Line Interface (CLI).The workflow with PyQBench's CLI can be summarized as the following list of steps:: 1. Preparing configuration files describing the backend and the experiment scenario.2. Submitting/running experiments.Depending on the experiment scenario, execution can be synchronous, or asynchronous.3. (optional) Checking the status of the submitted jobs if the execution is asynchronous.4. Resolving asynchronous jobs into the actual measurement outcomes.5. Converting obtained measurement outcomes into tabulated form.

Preparing configuration files
The configuration of PyQBench CLI is driven by YAML files.The first configuration file describes the experiment scenario to be executed.The second file describes the backend.Typically, this backend will correspond to the physical device to be benchmarked, but for testing purposes one might as well use any other Qiskit-compatible backend including simulators.Let us first describe the experiment configuration file, which might look as follow.
• qubits: a list enumerating pairs of qubits on which the experiment should be run.For configuration in listing 12, the benchmark will run on two pairs of qubits.The first pair is 0 and 1, and the second one is 1 and 2. We decided to describe a pair by using target and ancilla keys rather than using a plain list to emphasize that the role of qubits in the experiment is not symmetric.
• angles: an object describing the range of angles for Fourier parameterized family.The described range is always uniform, starts at the start, ends at stop and contains num steps points, including both start and stop.The start and stop can be arithmetic expressions using pi literal.For instance, the range defined in listing 12 contains three points: 0, π and 2π.
• gateset: a string describing the set of gates used in the decomposition of circuits in the experiment.The PyQBench contains explicit implementations of circuits The possible options are [ibmq, lucy, rigetti], corresponding to decompositions compatible with IBM Q devices, OQC Lucy device, and Rigetti devices.Alternatively, one might wish to turn off the decomposition by using a special value generic.However, for this to work a backend used for the experiment must natively implement all the gates needed for the experiment, as described in 4.
• method: a string, either postselection or direct sum determining which implementation of the conditional measurement is used.
• num shots: an integer defines how many shots are performed in the experiment for a particular angle, qubit pair and circuit.Note that if one wishes to compute the total number of shots in the experiment, it is necessary to take into account that the postselection method uses twice as many circuits as the direct sum method.
The second configuration file describes the backend.We decided to decouple the experiment and the backend files because it facilitates their reuse.For instance, the same experiment file can be used to run benchmarks on multiple backends, and the same backend description file can be used with multiple experiments.
Different Qiskit backends typically require different data for their initialization.Hence, there are multiple possible formats of the backend configuration files understood by PyQBench.We refer the interested reader to the PyQBench's documentation.Below we describe an example YAML file describing IBM Q backend named Quito.IBMQ backends typically require an access token to IBM Quantum Experience.Since it would be unsafe to store it in plain text, the token has to be configured separately in IBMQ TOKEN environmental variable.

Remarks on using the asynchronous flag
For backends supporting asynchronous execution, the asynchronous setting can be configured to toggle it.For asynchronous execution to work, the following conditions have to be met: • Jobs returned by the backend have unique job id.
• Jobs are retrievable from the backend using the backend.retrievejob method, even from another process (e.g. if the original process running the experiment has finished).
Since PyQBench cannot determine if the job retrieval works for a given backend, it is the user's responsibility to ensure that this is the case before setting asynchronous to true.

Running the experiment and collecting measurements data
After preparing YAML files defining experiment and backend, running the benchmark can be launched by using the following command line invocation: qbench disc-fourier benchmark experiment_file.ymlbackend_file.yml The output file will be printed to stdout.Optionally, the --output OUTPUT parameter might be provided to write the output to the OUTPUT file instead.
qbench disc-fourier benchmark experiment_file.ymlbackend_file.yml--output async_results.yml The result of running the above command can be twofold: • If backend is asynchronous, the output will contain intermediate data containing, amongst others, job ids correlated with the circuit they correspond to.
• If the backend is synchronous, the output will contain measurement outcomes (bitstrings) for each of the circuits run.
For synchronous experiment, the part of output looks similar to the one below.The whole YAML file can be seen in Appendix E. The data includes target, ancilla, phi, and results per circuit.The first three pieces of information have already been described.The last data results per circuit gives us the following additional information: • name: the information which measurement is used during experiment, either string "u" for P U or string "id" for P 1 l .In this example we consider P 1 l .
• histogram: the dictionary with measurements' outcomes.The keys represent possible bitstrings, whereas the values are the number of occurrences.
• mitigation info: for some backends (notably for backends corresponding to IBM Q devices), backends.properties().qubitscontains information that might be used for error mitigation using the MThree method [41,42].If this info is available, it will be stored in the mitigation info field, otherwise this field will be absent.
• mitigated histogram: the histogram with measurements' outcomes after the error mitigation.

(Optional) Getting status of asynchronous jobs
PyQBench provides also a helper command that will fetch the statuses of asynchronous jobs.The command is: qbench disc-fourier status async_results.yml and it will display dictionary with histogram of statuses.

Resolving asynchronous jobs
For asynchronous experiments, the stored intermediate data has to be resolved in actual measurements' outcomes.The following command will wait until all jobs are completed and then write a result file.
qbench disc-fourier resolve async-results.ymlresolved.yml The resolved results, stored in resolved.yml,would look just like if the experiment was run synchronously.Therefore, the final results will look the same no matter in which mode the benchmark was run, and hence in both cases the final output file is suitable for being an input for the command computing the discrimination probabilities.

Computing probabilities
As a last step in the processing workflow, the results file has to be passed to tabulate command: qbench disc-fourier tabulate results.ymlresults.csvA sample CSV file is provided below:

Impact
With the surge of availability of quantum computing architectures in recent years it becomes increasingly difficult to keep track of their relative performance.To make this case even more difficult, various providers give target ancilla phi ideal prob disc prob mit disc prob  access to different figures of merit for their architectures.Our package allows the user to test various architectures, available through qiskit and Amazon BraKet using problems with simple operational interpretation.We provide one example built-in in the package.Furthermore, we provide a powerful tool for the users to extend the range of available problems in a way that suits their needs.Due to this possibility of extension, the users are able to test specific aspects of their architecture of interest.For example, if their problem is related to the amount of coherence (the sum of absolute value of off-diagonal elements) of the states present during computation, they are able to quickly prepare a custom experiment, launch it on desired architectures, gather the result, based on which they can decide which specific architecture they should use.
Finally, we provide the source code of PyQBench on GitHub [36] under an open source license which will allow users to utilize and extend our package in their specific applications.

Conclusions
In this study, we develop a Python library PyQBench, an innovative open-source framework for benchmarking gate-based quantum computers.
PyQBench can benchmark NISQ devices by verifying their capability of discriminating between two von Neumann measurements.PyQBench offers a simplified, ready-to-use, command line interface (CLI) for running benchmarks using a predefined parameterized Fourier family of measurements.For more advanced scenarios, PyQBench offers a way of employing user-defined measurements instead of predefined ones.

Conflict of Interest
We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its out-come.
We need to also briefly discuss about the distance between quantum operations.From [31, Theorem 1], the distance between measurements P U and P 1 l can be expressed in the notion of diamond norm, that is To express the distance between unitary channels, we need to introduce the definition of numerical range [43].The set is called the numerical range of a given matrix A ∈ M d .The detailed properties of the numerical range and its generalizations we can read on the website [44].
Due to the definition of W (A), the distance between two unitary channels Φ U and Φ 1 l can be written as where ν = min x∈W (U † ) |x|.

Appendix B. Discrimination task for Hadamard gate
For the discrimination task between von Neumann measurements P U and P 1 l , where U = H (the Hadamard gate), the key is to calculate the diamond norm P H − P 1 l and determine the discriminator |ψ 0 .Using semidefinite programming [34], we obtain From [45] we have where Φ U is a unitary channel and E 0 is of the form Next, in order to construct the discriminator |ψ 0 we use Lemma 5 and the proof of Theorem 1 in [31].We show that there exist states ρ 1 and ρ 2 of the form , respectively.Thus, we construct the quantum state ρ 0 as follows: According to the Lemma 5 and the proof of Theorem 1 in [31] we assume that |ψ 0 = ρ 0 .(B.5) It directly implies that Next, from Holevo-Helstrom theorem [32], we determine the final measurement From the Hahn-Jordan decomposition [32], let us note where P, Q ≥ 0. Let us define projectors Π P and Π Q onto im(P ) and im(Q), respectively.Observe, that P and Q are block-diagonal.Then, Π P and Π Q have the following forms For the discrimination task between P H and P 1 l the explicit form of V 0 and V 1 is given as follows (see also mathematics/optimal final measurement discrimination.nb in the source code repository): where H is the Hadamard matrix of dimension two and φ ∈ [0, 2π).In this section we present theoretical probability of correct discrimination between these measurements.To do that, we will present an auxiliary lemma.where ν E = min x∈W (U † E) |x|.
The celebrated Hausdorf-Töplitz theorem [46,47] states that W (A) of any matrix A ∈ M d is a convex set, and therefore we have where for eigenvalue λ 0 = 1, the corresponding eigenvector is of the form , whereas for λ 1 = e iφ we have . For Hermitianpreserving maps [32] the diamond norm may be expressed as Φ = max

Figure 1 :
Figure 1: Implementation of a von Neumann measurement using measurement in computational basis.The upper circuit shows a symbolic representation of a von Neumann measurement P U .The bottom, equivalent circuit depicts its decomposition into a change of basis followed by measurement in the Z basis.

Figure 2 :
Figure 2: Theoretical scheme of discrimination between von Neumann measurements P U and P 1 l .

Figure 4 :
Figure4: A schematic representation of the setup for distinguishing measurements P U and P 1 l using the V † 0 ⊕ V † 1 direct sum.

Figure 5 :
Figure5: Rewritten representation of the setup for distinguishing measurements P U and P 1 l using the V † 0 ⊕ V † 1 direct sum.

Figure 7 :
Figure 7: Differences between simplified (top) and user-controlled (bottom) execution of benchmarks in PyQBench.Compared to simplified benchmarking, in user-controlled benchmarks the user has direct access to the circuits being run, and hence can alter them (e.g. by adding noise) and/or choose the parameters used for executing them on the backend.

16 )
Appendix C. Optimal probability for parameterized Fourier familyLet us focus on single-qubit von Neumann measurements P 1 l and P U .Assume that the unitary matrix U is of the formU = H 1 0 0 e iφ H † (C.1)

Table 1 :
Code metadata

Table 2 :
The 42]ulting CSV file contains table with columns target, ancilla, phi, ideal prob, disc prob and, optionally, mit disc prob.Each row in the table describes results for a tuple of (target, ancilla, phi).The reference optimal value of discrimination probability is present in ideal prob column, whereas the obtained, empirical discrimination probability can be found in the disc prob column.The mit disc prob column contains empirical discrimination probability after applying the Mthree error mitigation[41,42], if it was applied.