基于LC/MS的代谢组学数据并行处理研究

孙海涛; 杨志强; 李葆红; 陈德展

doi:10.7538/zpxb.2015.36.06.0535

基于LC/MS的代谢组学数据并行处理研究

Study on Metabonomic Data Parallel Processing Based on LC/MS

摘要

摘要: 代谢组学是继基因组学和蛋白质组学之后生命科学研究领域一个新的分支，液相色谱-质谱（LC/MS）联用技术是代谢组学研究过程中广泛使用的一种代谢物检测方法。为快速处理LC/MS联用仪在检测过程中产生的大量原始数据，本研究将原始数据分组后由不同的计算节点完成预处理工作，提出了基于数据并行的预处理方法；为提高并行效率，提出了谱峰预识别算法。实验表明：27个小鼠血清样本原始数据在5个计算节点上分组并行处理，按照保留时间平均划分的加速比为2.87，按照谱峰平均划分的加速比为4.55；经大量数据和更多计算节点测试，数据并行处理方法比单计算节点串行处理方法的速度有很大提高，谱峰并行模式加速比S_p趋近于理想加速比P。该方法能够快速、准确地处理代谢组学研究过程中产生的海量数据。

Abstract: Metabonomics is a new research field of life science after genomics and proteomics. It explores the relationship between metabolites of a creature and the pathological changes. LC/MS is an important analytical technology in the determination of metabolites, and has been widely used in disease diagnosis, pharmaceutical analysis as well as other aspect of metabonomics. With the wide application of this technology, amount of raw data was formed quickly. Currently, MZmine is one of the leading software environments that provides a full analysis pipeline for these data. However, it takes a long time to process the data due to the performance of traditional serial computational method meet with the problem of physical extreme limit. Therefore, a method of faster data processing and finding useful information from these massive data timely is significant. In this paper, a new parallel data pre-processing method based on data parallel was proposed, which increases the speed of data processing. Raw data was grouped and parallel processed in different computing nodes which have been installed with MZmine. Because the complexity of data process is closely related to data grouping mode, the experiments show that the simple time grouping mode is unstable. So, a new parallel peak pre-identification method, named peak grouping mode, was proposed to quickly identify peaks and group data. The results show the speedup rate was 2.87 for time grouping mode and 4.55 for peak grouping mode when process the raw samples of 27 mice serum with 5 nodes. More data and computing nodes test indicate that the speed of new parallel data processing method is faster than one of the serial computational method, and that the speedup rate of peak grouping mode tended to a linear one. In addition, the peak grouping mode is more efficient and stable than the time grouping mode in the parallel computing load balancing.

HTML全文

参考文献(27)

施引文献

资源附件(0)