DCMS: A data analytics and management system for molecular simulation
© Kumar et al.; licensee Springer. 2014
Received: 30 September 2014
Accepted: 6 November 2014
Published: 26 November 2014
Molecular Simulation (MS) is a powerful tool for studying physical/chemical features of large systems and has seen applications in many scientific and engineering domains. During the simulation process, the experiments generate a very large number of atoms and intend to observe their spatial and temporal relationships for scientific analysis. The sheer data volumes and their intensive interactions impose significant challenges for data accessing, managing, and analysis. To date, existing MS software systems fall short on storage and handling of MS data, mainly because of the missing of a platform to support applications that involve intensive data access and analytical process. In this paper, we present the database-centric molecular simulation (DCMS) system our team developed in the past few years. The main idea behind DCMS is to store MS data in a relational database management system (DBMS) to take advantage of the declarative query interface (i.e., SQL), data access methods, query processing, and optimization mechanisms of modern DBMSs. A unique challenge is to handle the analytical queries that are often compute-intensive. For that, we developed novel indexing and query processing strategies (including algorithms running on modern co-processors) as integrated components of the DBMS. As a result, researchers can upload and analyze their data using efficient functions implemented inside the DBMS. Index structures are generated to store analysis results that may be interesting to other users, so that the results are readily available without duplicating the analysis. We have developed a prototype of DCMS based on the PostgreSQL system and experiments using real MS data and workload show that DCMS significantly outperforms existing MS software systems. We also used it as a platform to test other data management issues such as security and compression.
KeywordsScientific database Molecular simulation Molecular dynamics Data compression Spatiotemporal database
Recent advancement in computing and networking technologies has witnessed the rising and flourishing of data-intensive applications that severely challenge the existing data management and computing systems. In a narrow sense, data-intensive applications commonly require significant storage space and intensive computing power. The demand of such resources alone, however, is not the only fundamental challenge of dealing with big data -. Instead, the complications of big data are mainly driven by the complexity and the variety of the data generated from different domains . For example, online social media has now been popularly used to collect real-time public feedback related to specific topics or products . Data storage and management systems should support high throughput data access with millions of tweets generated each second . Meanwhile, the tweets may be generated from different geographic regions, using different languages, and many of them may contain spam messages, typos, and malicious links etc. In addition to the low level data cleansing, access, and management issues, user privacy and public policies should also be considered (and integrated) in the analytical process for meaningful outcomes.
For many other application domains, such as scientific data analysis, the above big data complications also commonly exist. For example, particle simulation is a major computational method in many scientific and engineering fields for studying physical/chemical features of natural systems. In such simulations, a system of interest is treated as a collection of potentially large number of particles (e.g., atoms, stars) that interact under classical physics rules. In the molecular and structural biology world, such simulations are generally called Molecular Simulations (MS). By providing a model description for biochemical and biophysical processes at a nanoscopic scale, MS is a powerful tool towards fundamental understanding of biological systems.
In this paper, we present our recent research efforts in advancing big data analytic and management systems for scientific simulation domains, which usually generate large datasets with temporal and spatial correlations for analysis. Our research mainly emphasizes on the design of the data management system in supporting intensive data access, query processing, and optimization mechanisms for MS data. The main objective of our study is to produce high performance techniques for the MS community to accelerate the discovery process in biological/medical research. In particular, we introduce the design and development of a Database-Centric Molecular Simulation (DCMS) framework that allows scientists to efficiently retrieve, query, analyze, and share MS data.
A unique feature of DCMS is to build the system framework on top of a relational database management system (RDBMS). Such a decision is justified by careful analysis of the data processing requirements of the target application: since MS data is spatiotemporal in nature, existing DBMS provides significant advantages in modeling and application development. Plus, we can leverage the results of decades of research in spatiotemporal databases that are often (at least partially) implemented in RDBMSs. On the other hand, the unique features of MS data analysis/querying workload call for significant improvement and new functionalities in existing RDBMSs. A salient problem we face is the high computational complexity in processing analytical queries that are not seen in typical databases, demonstrating another dimension of difficulty shared by many of today’s big data applications. For MS, there are also data compression and data security issues that require innovative solutions. Therefore, our work in DCMS focuses on meeting those challenges by augmenting the DBMS kernel with novel data structures and query processing/optimization algorithms. As a result, our system achieves significant improvement in data processing efficiency as compared to legacy solutions.
In current MS software -, simulation data is typically stored in data files, which are further organized into various levels of directories. Data access is enabled by encoding the descriptions of the content in files into the names of files and directories, or storing more detailed descriptions about the file content in separate metadata files. Under the traditional file-based scheme, data/information sharing among MS community involves shipping the raw data packed in files along with the required format information and analysis tools. Due to the sheer volume of MS data, such sharing is extremely difficult, if possible at all. Two MS data analysis projects, BioSimGrid  and SimDB , store data and perform analysis at the same computer system and allow users remotely send in queries and get back results. This approach is based on the premises that: (1) analysis of MS data involves projection and/or reduction of data to smaller volume; (2) users need to exchange the reduced representation of data, rather than the whole raw data. In a similar project , databases are used to store digital movies generated from visualization of MS datasets.
In BioSimGrid and SimDB, relational databases are used to store and manage the metadata information. However, both systems store raw MS data as flat files instead of database records. Thus, the database only helps in locating the files of interest by querying the metadata. Further data retrieval and analysis are performed by opening and scanning the files located. Such an approach suffers from the following drawbacks: (1) Difficulties in the development and maintenance of application programs. Specific programs have to be coded for each specific type of queries using a general-purpose programming language such as C. This creates high demand for experienced programmers and thus limits the type of queries the system can support. (2) Lack of a systematic scheme for efficient data retrieval and analysis. An operating system views data as continuous bytes and only provides simple data access interfaces such as seek (i.e., jumping to a specific position of the file). Without data structures that semantically organize data records, data retrieval is often accomplished by sequentially scanning all relevant files. There is also a lack of efficient algorithms for processing queries that are often analytical in nature - most of existing algorithms are brute-force solutions. (3) Other issues such as data security and data compression are not sufficiently addressed.
The MDDB system  is close in spirit to DCMS. However, it focuses on data exploration and analysis within the simulation process rather than post-simulation data management. Another project named Dynameomics  coincided with the development of DCMS and delivered a database containing data from 11,000 protein simulations. Note that the main objective of the DCMS project is to provide a systematic solution to the problems mentioned above. To that end, most of our work is done within the kernel space of an open-source DBMS. In contrast to that, Dynameomics uses a commercial DBMS in its current form and attempts to optimize data management tasks at the application layer. We believe the DCMS approach has significant advantages in solving the last two issues mentioned above.
Here we summarize the data management challenges in typical MS applications.
MS Data A typical simulation outputs multiple trajectory files containing a number of snapshots (named frames) of the simulated system. Depending on the software and format, such data may be stored in binary form and undergo simple lossless compression. The main part of the data is very similar to those found in spatio-temporal databases. A typical trajectory file has some global data, which is used to identify the simulation, and a set of frames arranged in a sequential manner. Each frame may contain data entries that are independent of the atom index. The main part of trajectory frame is a sequential list of atoms with their positions, velocities, perhaps forces, masses, and types. These entries may contain additional quantities like identifiers to place an atom in particular residue or molecule. In file-based approach, the bond structure of residues is stored separately in topology files and the control parameters of a simulation are kept separately in control files. Hence, any sharing of data or analysis requires consistent exchange or availability of three types of files. Further complications in data exchange/use is due to different naming and storage convention used by individual researchers.
Popular analytical queries in MS
Moment of inertia
Moment of inertia on z axis
Sum of masses
Center of mass
Radius of gyration
Mean square displacement
Histogram of atom counts
Spatial distance histogram (SDH)
Histogram of all atom-to-atom distances
While analytical queries are the workhorse tools for scientific discovery, many require retrieval of data as the first step. Furthermore, visualization tools also interact with the database retrieving subsets of the data points. By studying the data access patterns of the analytical queries and such tools, we identify the following data access queries that are relevant in DCMS:
Point queries are equivalent to accessing a single point at the 3D space, e.g., find the location and/or other physical measurements of an atom at a specific time frame. Such queries are extremely useful for many visualization tools. A typical scenario is: the visualization tool asks for a sample of n data points within a specific region that reflects the underlying distribution. This is done by issuing queries with randomly generated atom IDs.
Trajectory queries retrieve all data points by fixing the value in one dimension. Two queries in this category are very popular in MS analysis: (1) single-atom trajectory (TRJ) query that retrieves the readings of a specific atom along time, and (2) frame (FRM) query that asks for the readings of all atoms in a specific time frame. These two queries, especially the second, are often issued to retrieve data points for various analytical queries such as the diffusion coefficient, in which we compute the root mean square displacement of all atoms in a frame.
Range (RNG) queries are generalized trajectory queries with range predicates on one or more dimensions. For example, find all atoms in a specific region of the simulated space, or, find all atoms with velocity greater than 50 and smaller than 75. Range queries are the main building blocks of many analytical queries and visualization tasks.
Nearest neighbor (NN) queries ask for the point(s) in a multidimensional space that are closest to a given point. For example, retrieve the 20 closest atoms to a given iron atom. This may help us locate unique structural features, e.g., certain part of the protein where a metal ion is bound to.
Data loader The data loader is responsible for transforming simulation data to the format required for storage in the database system. First, it can read and understand data generated by current simulation programs and stored in popular MS file formats (e.g., GROMACS, PDB). We developed, as a part of the data loader, a unified data transformation module to perform such translations. A user only needs to specify the file format his/her data follows and data loading will be performed automatically. Second, the raw data will also be sent to a Data Compression module to generate a volume-reduced copy to be stored as compressed files. The rationale of this design is to enable efficient transmission of MS data among different data centers where DCMS are deployed. We will address this issue in Section “Data compression”.
User interfaces Runtime data access in DCMS is provided by a graphical query interface and an SQL-based query programming interface. In DCMS, we envision two types of user programs: primary queries and high-level data analytics. The primary queries correspond to (raw) data retrieval queries that can be directly supported by the DBMS via SQL. Analytical queries are those containing application-specific logic and are directly used by scientists for scientific discovery. The latter builds upon query results of the primary queries. To ease the development of high-level analytics, an important design of DCMS is that the query interfaces are extensible: an analytical query written by a user (called user-programmed analytics) can be integrated into the current DCMS query interface (and become part of the DCMS built-in analytics that can be directly accessed by an SQL statement). By this, the code of analytical queries can be reused by other users to issue the same query or build new analytical queries based on the current ones.
In addition to the query programming interface, all built-in analytics and primary queries can also be accessed from a graphical query interface, which accepts user query specifications via web forms and translates such specifications into SQL statements. The main purpose of designing a graphical interface is to, again, ease the use of DCMS to an extent that users can perform data analysis without writing programs.
First group of relations (i.e., Simulation and Parameter) describe time-invariant information of the simulation system. The System Properties table contains time-variant information of the entire system (instead of individual atoms). The Atom Static Info table holds the static features of an atom and forms a star-shaped schema with a series of other tables: Location, Velocity, Force, Connection, Torsion, Angle, and Molecules. The first three form the main body of the database - they represent atom states that change over time during the simulation. The reason why we cannot combine these three into one table is: MS programs usually do not output data at every step of the simulation, and different intervals (in steps) can be set to output location, velocity, and force. For the same reason, each of the three tables is linked to another table that maps the step number to the frame number. The next three tables hold information that describes atom to atom relationships. For example, a row in Connection represents a chemical bond between two atoms. Such relationships are time-variant therefore we again need to map their step numbers to frame numbers. The Molecules table is similar except it holds static relationship among the atoms. Specifically, each row in this table records the membership of one atom as a part of a molecule.
Analytical queries in DCMS As mentioned earlier, DCMS provides system support for analytical queries that are unique in MS. The most popular DCMS built-in analytics are listed in Table 1, in which we assume an MS system has n atoms, and denote the mass, coordinates, charge, and number of electrons of an atom i as m i , r i , q i , and e i , respectively. Roughly, such queries can be divided into two categories: (1)One-body functions. In computing this type of functions, the readings of each atom in the system is processed constant number of times, giving rise to O(n) total running time. All queries in Table 1 except the last two fall into this category. Most such queries are defined within a single frame while the various autocorrelation functions are defined over two different frames; (2) Multi-body functions. The computation of these functions requires interactions of all atom pairs (2-body) or triplets (3-body). Popular examples include Radial Distribution Function (RDF) - and some quantities related to chemical shifts . These functions are often computed as histograms. For example, RDF is generally derived from a histogram of all atom-to-atom distances (i.e., Spatial Distance Histogram (SDH)). Straightforward computation of multi-body functions is very expensive therefore it imposes great challenges to query processing.
Query processing and optimization in DCMS
In DCMS, we could rely on a legacy DBMS (e.g., PostgreSQL) to provide query processing mechanisms. However, we believe that explorations on the algorithmic issues in major DBMS modules (e.g., query processing and access methods) will further improve the efficiency of data analysis in DCMS. This is because existing DBMSs, with general-purpose data management as the main purpose, have little consideration of the query types and user access patterns that are unique in MS data analysis.
Data indexing is the most important database technique to improve the efficiency of data retrieval. In DCMS, algorithms for processing primary queries will be exclusively index-based to reduce data access time. To support a rich set of queries, multiple indexes are necessary. However, it is infeasible to maintain excessive number of indexes due to the extremely high storage cost for MS data. Note that MS databases are most likely read-only therefore the maintenance (update) cost of indexes can be ignored. We have designed and tested several novel indexes to handle the various queries in DCMS but finally adopted the following indexes in our implemented system: (1) the B +-tree and a bitmap-based index which are the default indexes provided by PostgreSQL - they provide a certain level of support for some of the MS queries; and (2) a new index named Time-Parameterized Spatial (TPS) tree to provide further performance boost. We accordingly modify the query optimizer of the DBMS to generate query execution plans that take advantage of the aforementioned indexes.
For spatial (range, nearest neighbor, or spatial join) queries, Quadtree, R*-tree or SS-tree are the most popular indexes. The main challenge in DCMS comes from the continuous queries where time serves as an extra dimension therefore the above data structures are not suitable for MS queries. The design of TPS tree can be briefly described as building a spatial index for each time frame in the dataset. Then we need to combine neighboring trees to save space, taking advantage of the similarities among these trees. We decided to use Quadtree as the underlying spatial index to build TPS. This is because: (1) the performance of Quadtree in handling spatial queries is equivalent (sometimes even superior) to that of the R*-tree ; (2) the chances of getting an unbalanced tree (which is the main drawback of the Quadtree) are small due to the “spread-out" feature of MS data; and most importantly, (3) Quadtrees can be augmented to build other data structures needed for our high-level analytical query processing (Section “Analytical queries”).
A major challenge in designing the TPS tree is to minimize the storage cost. Our main idea is to share nodes among neighboring Quadtrees corresponding to consecutive time frames - similar to the historical R-tree (HR-tree) . The node sharing in an HR-tree depends on the assumption that some objects do not move for a period of time, which is not applicable to MS data where all atoms move at all times. However, we can exploit the existence of slow-moving atoms or those that move together to achieve node sharing. To build a TPS tree, we start by creating a spatial tree for each time frame using bulk loading; and then merge nodes in neighboring trees iteratively.
A straightforward view of analytical query processing consists of two stages: retrieve the raw data and compute the results in a separate program. While indexing techniques shown in Section “Primary queries” can speed up the first stage, further optimizations are made in DCMS by pushing the computation into index construction. To be more specific, we cache critical statistics of all atoms contained in a node of the TPS tree. Query results can be derived directly from such statistics, saving much time for visiting the raw data points. In the following text, we sketch our caching-based query processing strategy by visiting two groups of analytical queries.
The multi-body functions are all holistic functions therefore cannot be computed in the same way as one-body functions. Current MS software adopts simple yet naïve algorithms to compute the multi-body functions . For example, the SDH is computed by retrieving the locations of all atoms, computing all pairwise distances, and grouping them into the histogram - a O(n2) approach. For a large simulation system where n is big, this algorithm could be intolerably slow. In DCMS, we invested much efforts into algorithmic design related to such queries.
Our strategy for fast multi-body function computation follows the same path of the caching-based query processing. Specifically, we cache the atom counts in each spatial tree node and compute the multi-body functions based on these counts, avoiding the computation of pairwise interactions. Again, we do this on the TPS tree and call the atom counts of all nodes on one tree level a density map (which is basically the result of a density function query shown in Table 1). The main issue is how to translate the atom counts into a histogram. In the following text, we will sketch a series of algorithms we developed for SDH.
In many cases approximate query results are acceptable to users, therefore we developed approximate algorithms with much better performance to solve the SDH problem. Basically, we modified the node resolving operation to generate partial SDH faster. Whenever a desired error bound is reached while traversing the tree, the distance distribution into histogram buckets is approximated using certain heuristics. Such heuristics are constant time operations while having guaranteed error bound. Total running time of the algorithm is only related to a user-defined error tolerance. A detailed analysis of such algorithms is presented in . Performance of SDH algorithms can be further improved if certain inherent properties of the simulation system are utilized. It is often observed in MS systems that the atoms are evenly spread out due to existence of chemical bonds and inter-particle forces. Therefore, it is possible to take advantage of the spatial uniformity of the atoms to speed up the computation of SDH. With the spatial distribution of atoms in nodes, we can derive the distribution of distances between any pair of nodes. The entire distribution (i.e., histogram) can thus be obtained by considering all such pairs. Similarly, the locality of atoms in different frames is also utilized to compute approximate SDH very efficiently .
Our algorithm performs the same operation on different pairs of regions. This gives us a hint to use parallel processing to further improve performance. General Purpose Computing on Graphics Processing Unit (GPGPU) is a low-cost high performance solution to parallel processing problems. Large number of parallel threads can be created on GPUs, which are executed on multiple cores. Unlike CPUs, the GPU architecture consists of more than one level of memory that can be addressed by the user program - the Global memory and the Shared memory. The latter is a cache grade memory with extremely low latency. This makes code optimization in GPUs a very challenging task. We developed and optimized GPU versions of above SDH algorithms and achieved dramatic speedup of computation .
View-based query processing
To make the view-based solution work, the main challenge is the design of query optimization algorithms that take views into account. Query optimizers of existing DBMSs are not established for our purpose: they focus on views that are built over various base tables , in the database, often as a result of join operations. On the other hand, a view in our system maps a multidimensional data region to a complex aggregate. Such differences require development of novel techniques to address the following research problems.
The first problem is how to represent and store recorded views. Since any view of interest in DCMS describes a certain type of query (e.g., mass center) over a collection of raw data points in a 3D region, we organize the views in data structures similar to the spatial trees used for indexing the data. We call this the view index, in which a leaf node has the form of (B R,t1,t2,T Y P E,r p r t) where BR (short for Bounding Region, as in R-trees ) is a description of the query region of the view, TYPE is a variable encoding the query type and parameters (e.g., resolution, maximum, and minimum of an SDH query), t1 and t2 refer to the starting and ending frame the query covers, and rprt points to the query results of the view. An R*-tree-like structure is designed to organize the view entries based on their BRs. Upon receiving a query, we search the view index using the BR value of the new query as the key and retrieve all views that match the type and overlap with the query in their BRs and temporal coverage. The set of views retrieved form the basis for query optimization. We found the cost of maintaining and searching the view index is small due to its tree-based structure and the moderate number of views (as compared to the number of the data points).
Cost evaluation of view-based plans
Given a set of matched views, there could be multiple ways (i.e., execution plans) to execute a relevant query. For instance, to compute Q2 in Figure 6, we could either compute the result in region B∪C and merge it with that of Q 2−(B∪C), or merge the results of region Q 2−C with that of C, or Q 2−B with B. The query optimizer of DCMS should be able to list the different execution plans and choose one with the lowest expected cost. Obviously, those plans that do not involve any views should also be evaluated for comparison. For this purpose, a cost model for each query type is designed to quantify the time needed to accomplish a plan. Factors that are considered in the model include: area/volume of the relevant regions involved, expected number of data points in these regions, costs to resolve views with overlapping BRs (e.g., costs to compute the query in B∪C, which can be used to solve Q2), and existence of indexes. For queries with a small number of execution plans, the decision on which plan to choose can be made by evaluating the costs of all the plans. We are in the process of designing heuristic algorithms to help make decisions with reasonable response time in facing a large number of execution plans.
where f is the frequency at which the view is utilized for query execution and o shows the extent to which the view’s BR overlaps with other materialized views. In case of enforced view selection, a view with lower score will be discarded before one with a higher score. The intuition behind the above formula is to calculate the benefit to cost ratio of view maintenance: views that are frequently used for query processing (i.e., more benefit) and cover a less crowded region (i.e., lower maintenance/query optimization cost) should be kept. Note that storage cost is not included in the scoring function as we believe it is not the bottlenecking factor in our view-based query processing.
Simulation information is stored onto disk frame by frame for further analysis of the system under study. Given large number of particles in the system, a simulation of few micro-seconds can generate terabytes of data. The size of the data poses problems in input/output, storage, and network transfer. Therefore, compression of simulation data is very important. Traditional compression techniques can’t achieve high compression ratio. Size of the data compressed using dictionary-based and statistical methods can still be large. Accessing a small portion of the compressed data requires decompression of entire data set. In addition, corruption of few bits can make the entire data unusable. Techniques that use spatial uniformity of the data, such as space-filling curves  can produce better compression ratios. But, all existing methods do not consider temporal locality for further compression. We group several frames together to form a window, and a window is compressed using our technique explained below.
Discussion and evaluation
We have implemented a prototype of DCMS and tested it with real MS datasets and workloads. We used PostgreSQL (version 8.2.6) as our database engine. We extended the current PostgreSQL codebase significantly by adding new data types, the TPS tree index, and various query processing and optimization algorithms. The TPS tree implementation, along with the addition of two new data types (i.e., 3D point and 3D box) was built on top of the SP-GiST  package. The query processing algorithms were programmed in following three ways: (1) most one-body functions are implemented as stored procedures using PL/pgSQL – a built-in procedural language provided by PostgreSQL; (2) the more complex multi-body functions are programmed as C functions and integrated into the PostgreSQL query interface; and (3) query processing algorithms related to the TPS tree are directly implemented as part of the PostgreSQL kernel. The data transformation module and data compressor were implemented as programs outside the DBMS. In the remainder of this 2, we report results of selected experiments run on our prototype. Such experiments were conducted on a Mac Xserve with quad-core Intel Xeon CPU running OS X 10.6 operating system (Snow Leopard), the server is connected to a storage array with a 35TB capacity with dual FibreChannel links.
Efficiency of data retrieval in DCMS
The main goal of this experiment is to show the efficiency of data retrieval in DCMS. Here we show the results of queries against a single MS dataset with 286,000 atoms and 100,000 frames, the total data size of which is about 250 GB. Note that such queries against a single simulation are typical in MS applications as comparing multiple simulations is less popular. This dataset was generated from our previous work to simulate a hydrated DPPC system in NaCl and KCl solutions . For comparison, we sent the same queries against the data analysis toolkit of GROMACS – a mainstream file-based system for MS simulation and data analysis . According to , GROMACS has better performance over other popular MS systems therefore represents the state-of-the-art in MS data analysis. We tested the following types of queries: (1) random point access (RDM); (2) single-atom trajectory retrieval (TRJ); (3) frame retrieval (FRM); (4) Range query (RNG); and (5) Nearest neighbor (NN) queries. These queries, as discussed in Section “Analytical queries in DCMS”, form the basis for most high-level data processing tasks in MS. Upon loading the data, PostgreSQL automatically builds a B+-tree index on the combination of step number and atom ID. Then we built the TPS tree on the Atom Location table for our experiments. For the control experiments using GROMACS, the queries were against the same dataset organized in 400 files, each holding 250 time frames. To achieve fair comparisons, we also used a grid-based spatial index in GROMACS  for the RNG and NN queries.
Query processing time (in seconds) in database-centric and file-based MS analysis
DCMS + TPS
SDH Computation results
Running time (seconds) of brute-force SDH method on GPU
> 27 days
Data compression results
Other related work
Traditionally, database systems are mainly designed for commercial data and applications. In recent years, the scientific community has also adopted database technology in processing scientific data. However, scientific data are different from commercial data in that: (1) the volume of scientific data can be orders of magnitude larger; (2) data are often multidimensional and continuous; and (3) queries against scientific data are more complex. The above differences bring significant challenges to system design in scientific databases.
In summary, scientific database research fall into the following three types. The first is to build databases on top of out-of-box DBMS products, as seen in the following examples: GenBank (http://www.ncbi.nlm.nih.gov/Genbank) provides public access to about 80 million gene sequences; the Sloan Digital Sky Survey  enables astronomers to explore millions of objects in the sky; and the PeriScope  project explores declarative queries against biological sequence data. The second type focuses on extending the kernel functionalities of DBMSs to meet challenges in scientific data management. This includes work that deals with query language , data storage ,, data compression ,, index design , I/O scheduling , and data provenance . The last type takes a more aggressive path by designing new DBMS architectures and building the DBMS from scratch. Most efforts along this direction happened in the past few years -. The SciDB project advocates a new data model (i.e., the multidimensional array model) for scientific domains and releases a prototype that enables parallel processing of data in a highly distributed environment. Clearly, our strategy of building DCMS falls into the second category.
Despite the importance of MS as a major research tool in many scientific and engineering fields, there is a lack of systems for effective data management. To that end, we developed a unified Information Integration and Informatics framework named Database-centric Molecular Simulations (DCMS). DCMS is designed to store simulation data collected from various sources, provide standard APIs for efficient data retrieval and analysis, and allow global data access to the research community. This framework is also a portal for registering well–accepted queries that in turn serve as building blocks for more complex high–level analytical programs. Users can develop these high–level programs into applications such as, applications that grant easy access of their data to experimentalists, visualize data, and provide feedback which can be used in the steering of MS. A fundamental component of the DCMS system is a relational database, which allows scientists to concentrate on developing high-level analytical tools using declarative query languages while passing the low-level details (e.g., primary query processing, data storage, basic access control) to DCMS. One of the most serious problems in existing MS systems is the low efficiency of data access and query processing. The unique query patterns of MS applications impose interesting challenges and also provide abundant optimization opportunities to DCMS design. To meet such challenges, we augmented an open-source DBMS with novel data structures and algorithms for efficient data retrieval and query processing. We focused on creative indexing and data organization techniques, query processing algorithms and optimization strategies. The DCMS system was also used as a platform to evaluate data compression algorithms specifically designed for MS data that can significantly reduce the size of the data.
Immediate work within DCMS will be focused on sharing computations (especially I/O operations) among different queries. Unlike traditional database applications, MS analysis normally centers around a small number of analytical queries (Table 1). Therefore, we can pro-actively run all relevant analytics at the time when the data is being loaded to DCMS. The advantage of this strategy is that only one I/O stream is needed - we have shown earlier that I/O can be the bottleneck in handling typical MS workloads. This requires us to modify the DBMS kernel to implement a master query processing algorithm to replace the ones dealing with individual queries. On the query processing side, utilization of other parallel hardware such as multi-core CPUs and FPGAs is definitely worth more efforts. Our current design of DCMS focuses on a single-node environment, deployment of DCMS on modern data processing platforms in a highly distributed environment (e.g., a computing cloud) will be an obvious direction for our future exploration.
The project described here is supported by a research award (No. R01GM086707) from the US National Institutes of Health (NIH). Part of this work is also supported by two grants (IIS-1117699 and IIS-1253980) from US National Science Foundation (NSF), and gift from Nvidia via its CUDA Research Center program. The authors would like to thank the following collaborators for their contributions at various stages of this project: Anthony Casagrande, Shaoping Chen, Jin Huang, Jacob Israel, Dan Lin, Gang Shen, and Yongke Yuan.
- Howe D, Costanzo M, Fey P, Gojobori T, Hannick L, Hide W, Hill DP, Kania R, Schaeffer M, St. Pierre S, Twigger S, White O, Rhe SY: Big data: The future of biocuration. Nature 2008, 455: 47–50. 10.1038/455047aView ArticleGoogle Scholar
- Huberman B: Sociology of science: Big data deserve a bigger audience. Nature 2012, 482: 308. 10.1038/482308dView ArticleGoogle Scholar
- Centola D: The spread of behavior in an online social network experiment. Science 2010, 329: 1194–1197. 10.1126/science.1185231View ArticleGoogle Scholar
- Wu X, Zhu X, Wu G-Q, Ding W: Data mining with big data. IEEE Trans Knowl Data Eng 2014,26(1):97–107. 10.1109/TKDE.2013.109View ArticleGoogle Scholar
- J Bollen HM, Zeng X: Twitter mood predicts the stock market. J Comput Sci 2011, 2: 1–8. 10.1016/j.jocs.2010.12.007View ArticleGoogle Scholar
- Michaud-Agrawal N, Denning E, Woolf T, Beckstein O: MDAnalysis: A Toolkit for the Analysis of Molecular Dynamics Simulations. J Comput Chem 2011,32(10):2319–2327. 10.1002/jcc.21787View ArticleGoogle Scholar
- Humphrey W, Dalke A, Shulten K: VMD: visual molecular dynamics. J Mol Graph 1996,14(1):33–38. 10.1016/0263-7855(96)00018-5View ArticleGoogle Scholar
- Nutanong S, Carey N, Ahmad Y, Szalay AS, Woolf TB: Adaptive exploration for large-scale protein analysis in the molecular dynamics database. In Proceedings of 25th Intl. Conf. Scientific and Statistical Database Management. SSDBM. ACM, New York, NY, USA; 2013:45–1454.Google Scholar
- Hess B, Kutzner C, van der Spoel D, Lindahl E: GROMACS 4: Algorithms for Highly Efficient, Load-Balanced, and Scalable Molecular Simulation. J Chem Theory Comput 2008,4(3):435–447. 10.1021/ct700301qView ArticleGoogle Scholar
- Plimpton SJ: Fast parallel algorithms for short range molecular dynamics. J Comput Phys 1995, 117: 1–19. 10.1006/jcph.1995.1039MATHView ArticleGoogle Scholar
- Brooks BR, Bruccoleri RE, Olafson BD, States DJ, Karplus M: CHARMM: A program for macromolecular energy, minimization, and dynamics calculations. J Comput Chem 1985, 4: 187–217. 10.1002/jcc.540040211View ArticleGoogle Scholar
- Ng MH, Johnston S, Wu B, Murdock SE, Tai K, Fangohr H, Cox SJ, Essex JW, Sansom MSP, Jeffreys P: BioSimGrid: grid-enabled biomolecular simulation data storage and analysis. Future Generation Comput Systs 2006,22(6):657–664. 10.1016/j.future.2005.10.005View ArticleGoogle Scholar
- Feig M, Abdullah M, Johnsson L, Pettitt BM: Large scale distributed data repository: design of a molecular dynamics trajectory database. Future Generation Comput Syst 1999,16(1):101–110. 10.1016/S0167-739X(99)00039-4View ArticleGoogle Scholar
- Finocchiaro G, Wang T, Hoffmann R, Gonzalez A, Wade R: DSMM: a database of simulated molecular motions. Nucleic Acids Res 2003,31(1):456–457. 10.1093/nar/gkg113View ArticleGoogle Scholar
- van der Kamp M, Schaeffer R, Jonsson A, Scouras A, Simms A, Toofanny R, Benson N, Anderson P, Merkley E, Rysavy S, Bromley D, Beck D, Daggett V: Dynameomics: a comprehensive database of protein dynamics. Structure 2010,18(4):423–435. 10.1016/j.str.2010.01.012View ArticleGoogle Scholar
- Frenkel D, Smit B (2002) Understanding molecular simulation: from algorithm to applications. Comput Sci Ser 1. Academic Press.MATHGoogle Scholar
- Bamdad M, Alavi S, Najafi B, Keshavarzi E: A new expression for radial distribution function and infinite shear modulus of lennard-jones fluids. Chem Phys 2006, 325: 554–562. 10.1016/j.chemphys.2006.02.001View ArticleGoogle Scholar
- Stark JL, Murtagh F: Astronomical image and data analysis. Springer, Berlin, Heidelberg; 2006.View ArticleGoogle Scholar
- Wishart DS, Nip AM: Protein chemical shift analysis: a practical guide. Biochem Cell Biol 1998, 76: 153–163. 10.1139/o98-038View ArticleGoogle Scholar
- Kim YJ, Patel JM (2007) Rethinking choices for multi-dimensional point indexing: making the case for the often ignored quadtree In: Proceedings of the 3rd Biennial Conference on Innovative Data Systems Resarch (CIDR), 281–291., [www.cidrdb.org]Google Scholar
- Nascimento M, Silva J (1998) Towards historical R-trees In: Proceedings of ACM Symposium of Applied Computing (SAC), 235–240.View ArticleGoogle Scholar
- Szalay A, Gray J, vandenBerg J (2002) Petabyte scale data mining: dream or reality. Technical Report MSR-TR-2002–84, Microsoft Research.Google Scholar
- Chen S, Tu Y-C, Xia Y: Performance analysis of a dual-tree algorithm for computing spatial distance histograms. VLDB Journal 2011,20(4):471–494. 10.1007/s00778-010-0205-7View ArticleGoogle Scholar
- Grupcev V, Yuan Y, Tu Y-C, Huang J, Chen S, Pandit S, Weng M: Approximate algorithms for computing spatial distance histograms with accuracy guarantees. IEEE Trans Knowl Data Eng 2013,25(9):1982–1996. 10.1109/TKDE.2012.149View ArticleGoogle Scholar
- Kumar A, Grupcev V, Yuan Y, Tu Y-C, Huang J: Computing spatial distance histograms for large scientific datasets on-the-fly. IEEE Trans Knowl Data Eng 2014,26(10):2410–2424. 10.1109/TKDE.2014.2298015View ArticleGoogle Scholar
- Halevy AY: Answering queries using views: A survey. VLDB Journal 2001,10(4):270–294. 10.1007/s007780100054MATHView ArticleGoogle Scholar
- Afrati FN, Li C, Ullman JD: Using views to generate efficient evaluation plans for queries. J Comput Syst Sci 2007,73(5):703–724. 10.1016/j.jcss.2006.10.019MATHMathSciNetView ArticleGoogle Scholar
- Guttman A: R-trees: a dynamic index structure for spatial searching. In Proceedings of International Conference on Management of Data (SIGMOD). ACM Press, Boston, Massachusetts; 1984:47–57.Google Scholar
- Omeltchenko A, Campbell TJ, Kalia RK, Liu X, Nakano A, Vashishta P: Scalable I/O of large-scale molecular dynamics simulations: a data-compression algorithm. Comput Phys Commun 2000, 131: 78–85. 10.1016/S0010-4655(00)00083-7MATHView ArticleGoogle Scholar
- Kumar A, Zhu X, Tu Y-C, Pandit S: Compression in molecular simulation datasets. In 4th International Conference on Intelligence Science and Big Data Engineering (IScIDE). Springer, Beijing, China; 2013:22–29. 10.1007/978-3-642-42057-3_4View ArticleGoogle Scholar
- Aref WG, Ilyas IF: SP-GiST: an extensible database index for supporting space partitioning trees. J Intell Inform Syst 2001,17(2–3):215–240. 10.1023/A:1012809914301MATHView ArticleGoogle Scholar
- Nvidia. [http://www.nvidia.com/object/cuda_home_new.html]
- Szalay AS, Gray J, Thakar A, Kunszt PZ, Malik T, Raddick J, Stoughton C, vandenBerg J: The SDSS Skyserver: Public Access to the Sloan Digital Sky Server Data. In Proceedings of International Conference on Management of Data (SIGMOD). ACM, Madison, Wisconsin; 2002:570–581.Google Scholar
- Patel JM: The Role of Declarative Querying in Bioinformatics. OMICS: J Integr Biol 2003,7(1):89–91. 10.1089/153623103322006670View ArticleGoogle Scholar
- Chiu D, Agrawal G: Enabling Ad Hoc Queries over Low-Level Scientific Data Sets. In SSDBM. Springer, New Orleans, LA, USA; 2009:218–236.Google Scholar
- Arya M, Cody WF, Faloutsos C, Richardson J, Toya A: QBISM: Extending a DBMS to Support 3D Medical Images. In ICDE. IEEE, Houston, Texas, USA; 1994:314–325.Google Scholar
- Ivanova M, Kersten ML, Nes N: Adaptive segmentation for scientific databases. In ICDE. IEEE, Cancún, México; 2008:1412–1414.Google Scholar
- Shahabi C, Jahangiri M, Banaei-Kashani F: Proda: An end-to-end wavelet-based olap system for massive datasets. IEEE Comput 2008,41(4):69–77. 10.1109/MC.2008.130View ArticleGoogle Scholar
- Chakrabarti K, Garofalakis M, Rastogi R, Shim K: Approximate query processing using wavelets. VLDB J 2001,10(2–3):199–223.MATHGoogle Scholar
- Csabai I, Trencseni M, Dobos L, Jozsa P, Herczegh G, Purger N, Budavari T, Szalay AS (2007) Spatial indexing of large multidimensional databases In: Proceedings of the 3rd Biennial Conference on Innovative Data Systems Resarch (CIDR), 207–218., [www.cidrdb.org]Google Scholar
- Ma X, Winslett M, Norris J, Jiao X: Godiva: Lightweight data management for scientific visualization applications. In ICDE. IEEE Computer Society, Boston, MA, USA; 2004:732–744.Google Scholar
- Chapman A, Jagadish HV, Ramanan P: Efficient provenance storage. In SIGMOD Conference. ACM, Vancouver, BC, Canada; 2008:993–1006.Google Scholar
- Stonebraker M, Becla J, Dewitt D, Lim K-T, Maier D, Ratzesberger O (2009) Requirements for Science Data Bases and SciDB In: CIDR 2009, Fourth Biennial Conference on Innovative Data Systems Research., [www.cidrdb.org]Google Scholar
- Stonebraker M, Bear C, Cetintemel U, Cherniack M, Ge T, Hacham N, Harizopoulos S, Lifter J, Rogers J, Zdonik S (2007) One Size Fits All?- Part 2: Benchmarking Results In: CIDR 2007, Third Biennial Conference on Innovative Data Systems Research., [www.cidrdb.org]Google Scholar
- Stonebraker M, Madden S, Abadi DJ, Harizopoulos S, Hachem N, Helland P: The End of an Architectural Era (It’s Time for a Complete Rewrite). In Proceedings of the 33rd International Conference on Very Large Data Bases. ACM, University of Vienna, Austria; 2007:1150–1160.Google Scholar
- Sinha RR, Termehchy A, Mitra S, Winslett M (2007) Maitri Demonstration: Managing Large Scale Scientific Data (Demo) In: CIDR 2007, Third Biennial Conference on Innovative Data Systems Research, 219–224, Asilomar, CA, USA., [www.cidrdb.org]Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.