Background :
Converting individual gas/liquid chromatography mass spectrometry data files into a data matrix of peaks/samples/intensity is an essential step in a standardized metabolomics workflow. For GCMS data files, Automated mass spectral deconvolution and identification system (AMDIS) offers a free solution for extracting peak list for raw data files. Data files can be in native vendor format or in netCDF format. AMDIS can process files in batch mode, so up to 500 data files can be batch processed easily on a normal desktop. For more info, YT search ‘AMDIS batch’. One of the AMDIS outputs is an .ELU file, which is a peaklist containing a range of information such as retention time, S/N ratio, quant ion, deconvoluted spectra about chromatographic peaks. The next step in the data processing is to use SpectConnect , a freely available software from MIT to merge individual ELU files into a data matrix that can be used for doing statistical analysis. The merging of peaks in files is decided by retention time and deconvoluted mass spectra of peaks, if the same peak is observed across all the samples. SpectConnect web-server ( http://spectconnect.mit.edu/) allows uploading up to 100 data files, with one by one upload. However, if you have more than 100 data files, you need to install the SpectConnect locally to process those files and generate a matrix.
Objective :
To install SpectConnect locally and process more than 500 GCMS data files.
Materials:
Software :
- SpectConnect code : Download from here .
- Gemoda Algorithm. Download form here.
- Latest Ubuntu and VBox (if you are using a PC).
- Ubuntu scipy and numpy packages .
Raw Data :
If you don’t have test data, send an email to (barupal@gmail.com), I can share some Agilent GCMS data files with you.
Steps :
- Process the individual GCMS files with AMDIS in batch mode. Use some optimized settings for deconvolution.
- Transfer all the ELU files to a directory name ‘main’.
- Inside the ‘main’ folder, make folders for each experimental class or group and transfer the related ELU files to those folders.
- Extract gemoda and spectconnect inside the ‘main folder’.
- Setup the Virtual Ubuntu inside VBox.
- From the VBox folder sharing option, share the ‘main’ folder containing these sub-directories for groups with VBBOXSF command. Here are the instructions .
- Now in the ubuntu terminal go to the folder where gemoda code is unzipped. Now type these commands. ./configure; make; make install . Do the same inside the folder of spectconnect
- Now, type python spectConnect.py –help to see if spectconnect is installed properly. It should not through an error, if installed correctly.
- Now, run the command (get from the help) to process the GCMS data files. Parameters are – basePath = path to the ‘main’ folder. elutionThreshold= medium, similarityThreshold=medium, -alsoMatrix.
- Output would be ‘results’ folder in the ‘main’ folder with matrix and MSL library. Use the MSL library in the NIST MS search program to perform identification. And use the matrix to perform statistical analysis.
It takes about 3 hours to process 600 GCMS data files and generate the table, which is quite impressive. Though, the spectconnect code is single threaded and performs well, there is a need to make it parallel to process large number of big data files which are coming from TOF instruments with high scan speed (20scans/secs). Single quad instruments would have around 2 scan per seconds, so size is small, hence data processing is faster so far. I have not tested this for TOF data yet. But, if anyone is interested, they can download the test data from here. (2.1 gb).
References :-
- Styczynski, Mark P., et al. “Systematic identification of conserved metabolites in GC/MS data for metabolomics and biomarker discovery.”Analytical Chemistry 79.3 (2007): 966-973.
- Barupal, Dinesh K., et al. “Hydrocarbon phenotyping of algal species using pyrolysis-gas chromatography mass spectrometry.” BMC biotechnology 10.1 (2010): 40.