Edinburgh Research Explorer pyPcazip: A PCA-based toolkit for compression and analysis of molecular simulation data

The biomolecular simulation community is currently in need of novel and optimised software tools that can analyse and process, in reasonable timescales, the large generated amounts of molecular simulation data. In light of this, we have developed and present here pyPcazip: a suite of software tools for compression and analysis of molecular dynamics (MD) simulation data. The software is compatible with trajectory file formats generated by most contemporary MD engines such as AMBER, CHARMM, GROMACS and NAMD, and is MPI parallelised to permit the efficient processing of very large datasets. pyPcazip is a Unix based open-source software (BSD licenced) written in Python. c ⃝ 2016 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/ by/4.0/)


Software metadata
Current software version 1. 4

Introduction
Molecular dynamics simulations of biological molecules and complexes can give insights into the relationship between macromolecular structure and dynamics at the atomistic level, and the complex emergent properties of the system at much longer length and timescales.Continuing developments in hardware and software mean that researchers are faced with ever increasing volumes of raw data from the simulations that need to be stored, analysed, and shared.The key raw data are trajectories: snapshots of the time-evolution of the system output at regular intervals, where each snapshot records the three dimensional coordinates of each atom in the simulation at that moment in time.The widening gap between highly scalable molecular simulation codes that enable simulation of multimillion atom systems over microseconds of time, and legacy sequential analysis tools, that were designed to deal with tens to hundreds of thousands of atoms over nanoseconds of time, is exposing a new bottleneck in the process of obtaining scientific insights from the computational experiments.
As one of the efforts to address this gap we have developed pyPcazip, a suite of software tools that can compress molecular simulation data to a small fraction (few percent) of their original size without significant loss of information.According to their interests, users can control the balance between the pyPcazip degree of compression and precision (fraction of the overall variance that is captured from the original molecular simulation data) of the molecular simulation data.Subsequently, the compressed data opens the door to a manifold of analysis methods that produce objective, quantitative and comparative metrics related to convergence and sampling of molecular simulation as well as metrics on the similarity between molecular simulation trajectories.

Problems and background
pyPcazip uses Principal Component Analysis (PCA), a dimensionality reduction technique at the core of its algorithms for compression and analysis of MD trajectory data.
Dimensionality reduction techniques such as PCA are increasingly being applied to the analysis of molecular simulation data [1][2][3][4], as well as other types of data that report on variations in biomolecular conformation such as NMR ensembles [5], collections of crystal structures [6], and Monte Carlo simulations [7].
PCA allows the dominant modes of molecular flexibility to be identified in a rigorous manner, and presented in the form of variations in the values of a small number of collective coordinates, rather than the 3N independent Cartesian coordinates of the individual atoms, so greatly easing interpretation and visualisation.PCA provides the gateway to a range of analysis methods that provide quantitative and comparative metrics related to convergence and sampling, and the similarity between one trajectory and another.
We have previously described how this method can be applied to compression of the data [8], and as a route to enhanced sampling [9] using our Fortran1 and C2 software codes.These software codes which we have developed in previous years, have very limited documentation and a very basic functionality.For these reasons we have undertaken the current code development, re-engineering and substantial functionality enhancement in order to provide the user community with a complete suite of software tools that they can use much more easily, flexibly and compatibly with their needs and that they can cite.
In fact, here and now, with pyPcazip we present a complete new suite of software tools written in Python that includes some redesigned and reengineered algorithms of our Fortran and C codes but a much better engineered software and a much more extended range of functionalities with respect to these formerly used codes of ours.Originalities of pyPcazip (not present in our Fortran and C codes) include but are not limited to: (i) A better handling of memory issues when dealing with very large datasets; (ii) On-the-fly selection of subsets of atoms of interest for the PCA analysis from the available datasets (rather than having the datasets filtered by the user prior to using the software); (iii) Flexible support for the simultaneous analysis of multi-trajectory datasets that vary in their molecular topology and number of atoms; (iv) MPI support for input processing and internal calculations; (v) Compliance with High Performance Computing (HPC) architectures such as ARCHER (UK national supercomputing resource); (vi) Unit testing and an automatic testing suite; (vii) Compliance with a large range of state-of-the-art formats (output formats of MD engines such as AMBER [10], CHARMM [11], GROMACS [12] and NAMD [13]) and analysis tools (such as MDAnalysis [14]) of MD code outputs.

Software architecture
In our software, the compression of an MD trajectory of N atoms using PCA is implemented through a workflow that involves: • Calculating the average structure and least-squares fitting each snapshot to this average structure; • Calculating the Cartesian coordinate covariance matrix of the fitted data; • Diagonalisation of this covariance matrix to obtain a set of N eigenvectors and eigenvalues; • Selecting the top M of these that capture a chosen percentage of the total variance (e.g.90%); • Calculating, for each snapshot in the trajectory, the corresponding projection into the M-dimensional subspace; • Writing the selected eigenvectors, eigenvalues, and projections to the output file.

Software functionalities
As a suite of software tools, pyPcazip is composed of four main components and related functionalities: • pyPcazip itself takes one or many input MD trajectory files and converts them into a highly compressed, HDF5based, 3 .pczformat.The program has options to select subsets of atoms, and/or subsets of snapshots from the trajectory files for analysis.The file-reading capabilities of pyPcazip draw extensively on the MDAnalysis Python toolkit [14].In addition to providing pyPczdump (see below) as a tool to post-process the .pczformat we also provide a customised reader of the output files produced by pyPcazip as part of a module of the software.This module 4 can be found within the source code and could be easily integrated and used in third-party software if needed upon citation of this work.• pyPcaunzip can decompress a .pczfile back into a conventional trajectory file in a range of formats (including dcd, .xtc,.ncdf,.trj).
• pyPczdump extracts information such as eigenvectors, eigenvalues, and projections from a .pczfile.It can also produce multi-model PDB format files to animate eigenvectors.• pyPczcomp permits the quantitative comparison of the data from two congruent .pczfiles.An example might be the dynamics of a protein in the presence and absence of a ligand, or a comparative analysis of the dynamics of a wild-type protein and a mutant.

Implementation details
pyPcazip is a Python software code that provides commandline tools for the compression and analysis of molecular dynamics trajectory data using PCA methods.The software is designed to be flexible, scalable, and compatible with other Python toolkits that are used in the molecular simulation and analysis field such as MDAnalysis [14].
Many stages in the pyPcazip workflow such as the input reading process and the covariance matrix calculation of the PCA analysis are amenable to parallelisation, which has been implemented using MPI.Fig. 1 shows scaling data of pyPcazip on up to 96 cores on ARCHER-UK national supercomputing service.Each of the ARCHER compute nodes contains two 2.7 GHz, 12-core processors for a total of 24 cores per node.Where the number of cores is less than 24, MPI processes are assigned to the one processor first, then the second, and in the other cases multiples of 24 processes have been used, keeping the nodes fully populated.The bend in this figure at 12 cores occurs as the first processor of the first node has become fully occupied.With less than 12 cores used, each core has access to proportionally more cache space and memory bandwidth, so the higher performance is obtained.The dataset for this scalability analysis includes 10,000 snapshots of a Mouse Major Urinary Protein (MUP) [15] trajectory (with 35k atoms including solvent) that is atom-filtered on-the-fly selecting the backbone related atoms only (157 alpha-carbon atoms).
An automatic testing suite has been developed for pyPcazip that is activated by a single command ("pyPcazip -tests").This validates the correct functioning of the installed software.In addition, individual python modules are instrumented with extensive unit tests.

Empirical results
The compression achieved by pyPcazip depends on the nature of the molecular system and, as a "lossy" method, on the chosen quality threshold (percentage of the total variance to be captured).In particular, the variance captured considering a small number of the most important principal components (modes) retains crucial insights for conformational investigations as these modes directly relate to the highest amplitude (and normally slower) motions of molecular systems whereas the less important principal components relate to the high frequency small amplitude atomistic fluctuations.For this reason we would not recommend the use of this suite of tools for the investigation of phenomena such as reaction mechanisms where the retention of sub-angstrom accuracy in e.g. bond lengths is required.
Table 1 illustrates performance for three example datasets, each of 1000 snapshots: (1) a short peptide (alanine-12, unpublished data); (2) a DNA 18mer [16]; and (3) the mouse major urinary protein (MUP) [15].The two metrics are the degree of compression achieved and the compression error expressed as the average RMSD between snapshots in the original file and those in the compressed file.Clearly for many purposes compression to 10%-20% of the original file size is quite possible.The code has been tested on a variety of platforms ranging from laptops to national HPC facilities, and compression/decompression gives results identical to within an RMSD of less than 0.02 Å, even when run in parallel.

Illustrative examples
The pyPczdump and pyPczcomp utilities allow a range of PCA-related metrics extracted or calculated from the compressed trajectory files.Output is in the form of ASCII data files that may be easily rendered using the user's preferred graph plotting packages and molecular visualisation tools.Fig. 2 (rendered using matplotlib) illustrates the sampling projected onto the subspace defined by the first two Principal Components (PC), PC1 and PC2, for simulations of MUP ( [15], plus additional unpublished data), resolving three conformational states, while the time series of PC1 for one trajectory as shown in Fig. 3 (rendered using matplotlib) reveals how it moves between these.Finally, for a different biomolecular system, Fig. 4 illustrates (using Chimera [17]) the animation of the first PC extracted from MD simulations of a DNA tetranucleotide (sequence TGTC) [16].
In the following, an example of the use of pyPczcomp is shown to compare quantitatively the MUP [15] collective mode data files of the respectively free state (apo CA.pcz) and bound state (holo CA.pcz), using the quick option of pyPczcomp so as to skip the time-consuming comparisons of collective modes.This is carried out by calculating the dot product matrix between the eigenvectors identified by the PCA investigation on the apo-protein and those identified by PCA in the ligandbound form of the protein.In particular, from this example we observe that the major similarity (63%) between the two different scenarios concerns the comparison of the first two principal components and there is a subspace overlap of 73.4%.Indeed, we would expect a degree of similarity between the two scenarios (as we would also expect differences due to ligand binding) and by means of pyPczcomp we have shown a straightforward way to quantify such similarity.In addition to the here presented illustrative examples, an introductory tutorial that includes a variety of analysis examples that can be performed through this software, is available at https://github.com/ElsevierSoftwareX/SOFTX-D-15-00082/blob/master/ramonbsc-pypcazip-3d7ab553c8dc/README.mdl.

Impact
The software presented in this paper, pyPcazip, is an easy to use, flexible and extensible package for PCA-based investigations of molecular simulation data generated by most common state-of-the-art simulation packages such as AMBER, CHARMM, GROMACS and NAMD.PCA methods are of growing importance in the Biosimulation field, as the volumes of data that can be produced using modern HPC facilities overwhelm more traditional qualitative and human operatorintensive analysis techniques.The method has been used for some time to provide key insights into the relationship between biomolecular structure, dynamics, and function.Examples of our own work include the analysis of sequence-dependent DNA dynamics [3,[18][19][20], protein-ligand interactions [15] and GPCR dynamics [21,22] but to date there has been no open source software product designed specifically to perform this type of analysis, or compatible with all common simulation packages.
pyPcazip gives insights into structure and behaviour of molecules in addition to enabling highly compressed data storage of simulation trajectory files with insignificant loss of information.Through its analysis components the software provides a variety of methods that produce objective, quantitative and comparative metrics related to convergence and sampling of molecular simulation as well as metrics on the similarity between molecular simulation trajectories.We envisage a large potential user base for pyPcazip given that (i) PCA methods have been in use by the Biomolecular simulation community for about 15 years and that, as discussed above, (ii) the need for such automated and quantitative routes to data analysis is growing rapidly, together with the fact that (iii) the approach is applicable to trajectory data from any type of simulation (nucleic acids, proteins, oligosaccharides etc.).Moreover, pyPcazip has been promoted to the Biomolecular simulation community at several international conferences [23][24][25][26][27] that have contributed to the expansion of its user base.It is easy to install and use on a wide variety of different platforms ranging from personal Unix-based workstations to national HPC resources.
Finally, the source code repository of pyPcazip, together with accompanying documentation, installation instructions for a variety of platforms, testing data and examples, is distributed under the BSD license version 2. In order to increase the visibility and usage of our software package that we present in this work but also for the benefit of the user community, we have made it freely and anonymously available for download at the official Python package register (https://pypi.python.org/pypi/pyPcazip/).More than 1158 downloads in the last month were recorded at the beginning of April 2016.

Conclusions and future directions
pyPcazip gives insights into structure and behaviour of molecules in addition to enabling highly compressed data storage of simulation trajectory files with insignificant loss of information.It provides an easy to use, flexible approach to undertaking PCA-based investigations of MD trajectory data generated by all of the most common current simulation packages.
The modularity of the software enables the integration and planning of future methodologies to complement the insights obtained from the use of the PCA method.Currently, a sparse-PCA approach that has been implemented using the main data structures of the pyPcazip package is being investigated for potential better insights on collective motions in molecular systems.
Although due to significant differences in terms of functionality (that we have mentioned in the "Problems and Background" section) and usability we could not directly compare the suite of software tools we present here with the old Fortran and/or C codes that we cite in the "Problems and Background" section, we would expect the pyPcazip performance in terms of speed to be close to the performance of Fortran and C codes given that most of the heavy algebra calculation sections in pyPcazip have been implemented via SciPy whose time-critical loops are usually implemented in C or Fortran.In addition, pyPcazip has also been designed for use on HPC architectures so that the workload can be spread across different nodes and cores of a cluster making the final results available to the user at a fraction of the time of a serial run.In addition, for the sake of even more improved performance, the replacement of the most computationally expensive routines of pyPcazip with corresponding Fortran code is being validated and an up to 10-fold improvement of performance has been observed during preliminary tests of performance analysis.We plan future software releases to incorporate these enhancements.
Finally, as an open source Python package, pyPcazip is amenable to end-user driven development and integration with the growing number of other Python-based packages in the molecular simulation domain.

Supporting material
Source code, supporting material and users support through the bitbucket ticketing system is publicly available at https://github.com/ElsevierSoftwareX/SOFTX-D-15-00082.The code is accompanied with extensive information on the application of this software, detailed installation instructions for desktop workstations but also HPC architectures and details of the code performance for differing system configurations, all available at the same public access repository.

Fig. 2 .Fig. 3 .
Fig. 2. Conformational sampling in MD simulations of the MUP protein, projected onto the subspace defined by the first two principal components.Two highly populated states can be distinguished, and a third, rare, state identified.

Fig. 4 .
Fig. 4. Visualisation of the first principal component extracted from MD simulations of a DNA tetranucleotide (sequence TGTC).Multiple conformations of the DNA, sampling this mode, have been overlaid to demonstrate the collective motion of atoms in the structure.The colouring of carbon atoms changes from blue to red in sequential frames, while the colouring of other atoms (O, N, P, H) remains constant.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Table 1
Compression performance.
a See text for details.b Compression as a fraction of the original file size.c Average error between original and compressed snapshots (RMSD in angstroms).