MPSLIB: A C++ class for sequential simulation of multiple-point statistical models

Geostatistical simulation methods allow simulation of spatial structures and patterns based on a choice of statistical model. In the last few decades multiple-point statistics (MPS) has been developed that allows inferring the statistical model from a training image. This allows for a simpler quantification of the statistical model, and simulation of more realistic (Earth) structures. A number of different algorithms for MPS based simulation have been proposed, each associated with a unique set of pros or cons. MPSLIB is a C++ class that provides a framework for implementing most of the currently proposed multiple-point simulation methods based on sequential simulation. A number of the most widely used methods are provided as an example. The single normal equation simulation (SNESIM) method is implemented using both a tree and a list structure. A new generalized ENESIM (GENESIM) algorithm is proposed that can act as (in one extreme) the ENESIM algorithm, and (in another extreme) similar to the direct sampling algorithm. MPSLIB aims to be easy to compile on most platforms (standard C++11 is the only requirement) and is released under the Open Source LGPLv3 License to encourage reuse and further development. c ⃝ 2016 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/ by/4.0/).


Motivation and significance
Geostatistics is a type of statistics with a focus on describing geo-spatial (Earth) structures in a probabilistic framework through a probability distribution.A 'geostatistical model' refer to a selection of a probability distribution to reflect a specific Earth model.Geostatistical models are typically used to characterize subsurface reservoir models used for groundwater, hydrocarbon or heat storage, for both exploration, exploitation, and management.A geostatistical model describes infinitely many single Earth models, consistent with the chosen probability function.The variability of these Earth models reflect the uncertainty related to the choice of statistical model.Such uncertainty quantification is the key advantage of using geostatistical models, as opposed to considering only one, in some sense, optimal model.Geostatistical modeling is a two-step process.First a statistical model must be selected or described, typically through a probability distribution f (m).Once a model has been established, actual realizations from the probability distribution is generated using 'simulation algorithms', which is the topic of this manuscript.
Traditionally, geostatistical simulation algorithms are based on Gaussian statistics, typically based on statistics only between pairs of model parameters, and hence referred to as 2-point statistics.These methods have been, and are still, widely used [1].However, 2-point based statistical models are limited with respect to the spatial structures they can represent.With the introduction of multiple-point statistical (MPS) models, more geologically realistic spatial structures can be quantified [2,3].This has led to the development of a number of simulation algorithms for multiple-point simulation, e.g.[2][3][4][5].For MPS based statistical models, there is usually no closed form analytical expression of f (m).Instead, a 'training image' or a 'sample model' is assumed available from which multiple-point statistical events can be inferred.The goal of MPS methods is to allow sampling from the unknown f (m) such that realizations are consistent with the statistics that can be inferred from the training image.For an overview of MPS based simulation algorithms see [6].
Many of the proposed MPS algorithms are implemented in various forms.However, some of these codes are either not maintained [7,8], not available on multiple platforms [8], not available under an open source license [9][10][11], or does only implement one specific type of MPS algorithm [12].
Here a C++ library, MPSLIB, designed specifically for multiple-point simulation is presented, that is released under the GPLv3 license.MPSLIB implements the core functionalities needed to implement any multiple-point simulation algorithm, based on sequential simulation [13].Note that this does not include methods based on pattern matching [14] and image-synthesis [15].Implementations of the 'extended normal equations simulation' (ENESIM) [2] and the 'single normal equation simulation' (SNESIM) [3,4] algorithms are provided as examples.Further, a new algorithm GENESIM, based on the ENESIM algorithm, is proposed that can act similar to both the ENESIM and the direct sampling algorithms [5].

Software description
MPSLIB is written in C++ [16] using c++11 [17].It consists of a C++ class that provides the framework for applying sequential simulation to sample from multiple-point statistical models.It also contains a number of algorithms implemented using MPSLIB.
At the core of the library is an implementation of sequential simulation, that can be briefly described as follows: Say a probability distribution describes the relation between M model parameters through f (m) = f (m 1 , m 2 , . . ., m M ) that are typically ordered by some nodes on a grid.Then a realization of f (m) can be generated using sequential simulation as follows: , ] from f (m).See more details in e.g.[18].
The major challenge implementing the sequential simulation algorithm is to generate a realization from the conditional distribution at each iteration.If the conditional data are given by m * c , then the conditional distribution can in general be given by If uncertain co-located information about m i is available as f (m i soft ) (often referred to as 'soft' data in geostatistical literature) then it is accounted for by drawing from For MPS based models there will, in general, be no analytical form of f (m i |m * c ) available.Instead, the key idea utilized in most simulation algorithms is to infer information about the conditional distribution, by scanning the training image for the conditional data event, m * c .There are (at least) two different approaches for sampling from the conditional distribution in Eq. (1).In one approach (ENESIM type algorithms), the training image is scanned for matching conditional data events at each iteration of the sequential simulation algorithm, from which the conditional distribution can be estimated [2], or a realization of the conditional distribution directly obtained [5].In another approach, the training image is scanned prior to running the simulation, and the conditional statistics for a number of predefined data templates are stored in memory.The conditional distribution, Eq. ( 1), can then be retrieved from memory during simulation [3,4].Usually, storing conditional statistics in memory requires the use of so-called multiple-grids [3] in order to allow simulation of structures with correlations over longer distances, while at the same time imposing manageable memory and CPU requirements.Using multiple-grids, simulation is performed starting in a coarse grid that is refined until simulation is performed on the finest, and requested, grid size.The use of multiple-grids poses a challenge to conditional simulation (when the value of some model parameters are known before simulation starts).One approach to handle this is to make use of data-relocation.When simulating on coarser grids, conditional data on finer grids are relocated to the closest node in the coarse grid.When conditional simulation has been performed on a coarser grid, re-located data on the coarse grid are removed, and simulation is then performed on a finer grid.For details on multiple-grids and the use of data-relocation see [19,3,20].

Software architecture
MPS is a namespace that contains different classes.The main class is the MPSalgorithm class, which implements the sequential simulation algorithm, methods for reading and writing 3D gridded data, methods for reading known data values (known as hard and soft data), and methods for establishing a data neighborhood, and controlling the simulation path.In addition, MPSAlgorithm allows multiple grids, including handling of conditional data.Specifically, grid re-location of hard data, as proposed in [3], is implemented.
Also, two abstract member functions are defined, but not implemented: MPSAlgorithm::readConfig and MPS-Algorithm::simulate That is, in order to use the MPS class, a new C++ class that inherits the MPSAlgorithm class needs to be defined, that implements (at least) the two member functions.MPSAlgorithm::readConfig should set all the parameters needed to run the simulation, for example by reading from a parameter file.MPSAlgorithm::simulate should implement a method that allows generating a realization from Eq. (1).As discussed above, the main difference between proposed multiple-point simulation algorithms is related to how this function is implemented, and can be divided into two families of algorithms, those that are similar to ENESIM and those that are similar to SNESIM.Therefore, two general classes, MPSEnesim and MPSSnesim, have been implemented to handle operations specific to ENESIM and SNESIM type simulation.
ENESIM.The ENESIM subclass extends the main MPSAlgorithm class with methods functions specifically designed to use in case the training image is scanned at each iteration.This includes methods for scanning the whole training image in order to estimate the conditional distribution, Eq. ( 1), from which a realization can be drawn.It also includes methods for scanning only parts of the training image, and optionally only scan the training image until a maximum number of conditional data events has been found.

SNESIM.
The SNESIM subclass extends the main MPS class with methods and functions specifically designed in case the statistics from the training images is scanned prior to running the simulation algorithm and stored in memory.
The SNESIM and ENESIM subclasses form the base of the proposed C++ framework.A key idea is that it should be possible to implement any MPS based sequential simulation algorithm using the C++ Class.Fig. 1 illustrates the architecture of MPSLIB.

MPS simulation algorithms
To demonstrate the versatility of MPSLIB a number of the most well-known multiple-point simulation algorithms, as well as a new variant, based on sequential simulation has been implemented.mps genesim.The mps genesim algorithm is a generalized implementation of the MPS algorithm named ENESIM proposed by [2].
In the simplest form of the ENESIM algorithm, the whole training image is scanned to establish the conditional distribution in Eq. ( 1) at each iteration.This is relatively easy to implement, and is void of problems related to the use of multiple grids.However, it is also computationally extremely demanding, to the point where it is not practical to use.The reason is due to the fact the full training image is scanned at each iteration, in order to obtain a conditional distribution from which a realization can be drawn.In order to alleviate this we suggest to scan the training image only for a maximum number N max of conditional data events, which means the conditional distribution will only be based on (at the most) N max counts.
MPSAlgorithm::simulate is implemented by scanning the training image for a maximum of N max replicates, from a random starting point.This provides an approximation to the full conditional distribution, from which a realization is generated.
At one extreme, N max = ∞, this algorithm will produce the same results as the classical ENESIM algorithm.
At another extreme, N max = 1, this algorithm will provide essentially the same results as the direct sampling algorithm, with the specific difference that using mps genesim the conditional distribution is in fact estimated (even if based on only one count), from which a realization is generated.Using direct sampling the conditional distribution is never obtained, and instead a realization from Eq. ( 1) is taken directly from the training image as the first matching conditional data event [5].In this case, when the conditional distribution is based on only one observed conditional event, conditioning to uncertain data, as in Eq. ( 2), cannot be done.
For N max > 1 and N max < ∞ mps genesim leads to an algorithm that can approximately account for uncertain data using (2), while at the same avoiding the high CPU requirements associated to scanning the full training image.mps snesim tree.The main contribution of SNESIM is that conditional patterns are scanned from the training image prior to simulation, and stored in a search tree [3].Therefore a member function is implemented that allows scanning a training image for a set of predefined data templates.The corresponding conditional statistics are stored in a binary search tree.MPSAlgorithm::simulate is implemented such that the conditional distribution in Eq. ( 1) is obtained by searching through the binary search tree.Then a realization is generated from Eq. (1).mps snesim list.mps snesim list is very similar to mps snesim snesim.The main difference is that mps snesim list stores conditional statistics using a list rather than a search tree.This leads to less memory requirements, but slower sequential search for the conditional distribution, compared to using a binary search tree [4].
The two SNESIM based algorithms will though not produce exactly the same result as they differ slightly in how a missing data event is handled.

Future development
The algorithms described above implements the core ideas of the SNESIM, ENESIM, and GENESIM algorithms.Many variations of the methods are available, and many types of post-processing have been presented, some of which may be included in MPSLIB in the future [8,21].A natural feature to implement next is to allow simulation of non-stationary fields, by using multiple training images and/or scaling and rotation of on training image, which has not yet been considered [8,22].The use of parallel computing and GPUs will also be a natural development [23,24].

Examples
As an example, based on the training image in Fig. 2(a), the implemented algorithms will be used to generate realizations in a 2D 80 × 50 pixel grid, using the hard and soft data shown in Fig. 2(b).
The implemented algorithms are based on the same core functionality, and therefore the parameter files needed to run the algorithms are very similar.The following text file is part of the parameter file for all simulations algorithms: The parameter file above defines a 2D regular grid (80 by 50 pixels) where each pixel has a size of 1 × 1 and the coordinates of the first pixel are (0, 0).The training image must be given in an ASCII formatted GSLIB file (a type of simplified Geo-EAS formatted file, see [1]) called 'ti.dat', where the first line contains the dimension of the training image.For example, the first line for a 100 × 100 × 10 training image must be '100 100 10'.One can optionally make use of hard data (data known with no uncertainty), or soft data (data that reflect uncertain observations), both in GSLIB format.The number of threads is currently not used, but kept to allow the use of multi threading in the future.Finally the debug mode defines the amount of information written to disk and screen during simulation.

SNESIM
Both mps snesim tree and mps snesim list make use of the same additional information in the parameter file: The first line sets the number of multiple grids used.To disable the use of multiple-grids, "0" should be chosen.The next three lines indicate the search template size, which is here set as 7 × 7 × 1.This means that any conditional data event within a 7 × 7 grid size is scanned, and its associated conditional statistics stored, prior to running the simulation algorithm.The next line sets the maximum number of conditional data within the template to consider.If set to zero all conditional data within the template will be used.
The last line sets the minimum number of replicates needed in order to use an obtained conditional distribution.If the number of conditional data is below this number then one conditional data is discarded, until enough replicates are found.

GENESIM
The mps genesim algorithm needs the following additional parameters defined: Max number of conditional point, N_cond # 49 Maximum number of counts for conditional pdf, N_max # 1 Max number of iterations, N_max_ite # 10000 The first line defines the maximum number of conditional points needed during simulation.This is related to the template size for the SNESIM type algorithms.However, as no multiplegrids are used, conditioning can be effective over ranges spanning the whole simulation grid.The next two lines define properties specifically related to the GENESIM algorithm.Line 2 defines the maximum number of points, N max , used to set up the conditional distribution Eq. ( 1).However, even if not  enough conditional points have been obtained after N max ite = 10000 iterations of scanning for a match (as defined in the last line), scanning is stopped and the conditional distribution used as is.
In case N max = ∞ and N max ite = ∞, then mps genesim will behave just as the ENESIM algorithm.N max = 1 and N max ite = ∞ will lead to an algorithm similar direct sampling, in the sense that if needed, the whole training image will   be scanned for a replicate.The original implementation of direct sampling does not allow for conditioning to 'soft' uncertain data in a manner similar to the ENESIM and SNESIM algorithm.However, the GENESIM algorithm, using a finite N max ite has the potential to decrease the simulation time significantly compared to a traditional ENESIM implementation, while at the same time allowing for conditioning to uncertain data through Eq. ( 2).Fig. 3, shows three independent realizations conditional to only the hard data, generated using mps snesim tree, mps snesim list, and mps snesim genesim with N max = ∞, N max = 10, and N max = 1.Fig. 4, shows the same results as Fig. 3 but in case conditioning to both hard and soft data.Note how soft data is not taken into account, as discussed previously, when using GENESIM in style of direct sampling with N max = 1, Fig. 4(e).On the other hand, using N max = 10, the GENESIM algorithm can take soft data into account, Fig. 4(d).

Impact
MPSLIB has been developed to be useful to both beginners and experts using multiple-point geostatistics.It should be useful for both scientific research and commercial application.
A goal has been to provide a foundation that allows developing, testing, and comparing the many different approaches for utilizing multiple-point simulation, which are all largely built on the same basic idea.
Perhaps the most widely known and used software for geostatistical estimation and simulation is GSLIB [1].GSLIB, written in Fortran, provides both a set of building blocks, and numerous examples of how to use these building blocks to implement useful algorithms, for 2-point statistical modeling.The history of GSLIB has shown that availability of free and open software can be a main driver in the development of a research field, benefiting both basic research and commercial application.The design of MPSLIB is chosen with GSLIB in mind.
MPSLIB should be flexible, easy to extend, easy to compile, and applicable for both scientific and commercial purposes.MPSLIB should alleviate the possibility to test out new ideas, without the need to re-implement every aspect of sequential simulation.Currently no such computer code is available, that uses both a permissive open source license, and is maintained.
The development of the generalized ENESIM algorithm is an example on how the existing code can be extended to produce a new type of multiple-point statistical algorithm with some novel features.We hope this example, and the base code, will encourage other developers to extend the code in a similar manner.
The software was developed in a collaboration between researchers from a public institution and a commercial company.The company will use the code as the back end for a user orientated geological model building experience.The researchers will use the code for scientific research related to multiple-point statistical modeling in the future.

Conclusions
MPSLIB is a C++ library for multiple-point based statistical models based on sequential simulation.It is intended to provide the building blocks that allow implementation of any multiplepoint simulation method based on sequential simulation.As examples, variants of ENESIM and SNESIM type multiplepoint algorithms have been implemented and a new algorithm proposed.MPSLIB relies only on standard C++11 and should be easy to compile, modify and extend, and is released under the LGPLv3 license.

Fig. 2 .Fig. 3 .
Fig. 2. (a) Training image, (b) Simulation grid with 6 hard (circles) and soft data at all grid nodes.Colorbar indicates the local soft probability.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) (a) mps snesim tree.(b) mps snesim list.(c) mps genesim (N max = ∞).