GitHub Repository for this Script
OrgMassSpecR Homepage

Introduction

Given spectra from a set of samples, this script identifies the mass spectra containing bromine and/or chlorine. It was designed for electron impact (EI) mass spectra generated by a LECO GCxGC-TOF mass spectrometer (unit resolution), although it could be adapted for data generated by other instruments. The script does not identify the specific type of halogenation, such as Br2Cl or Cl5, only that the compound is likely halogenated.

The script was created as part of a non-targeted analysis workflow to identify halogenated organic contaminants in environmental samples. For example, tissue samples were analyzed by GCxGC-TOF and the instrument software was used to deconvolute the data and generate the MSP and CSV files described below. These files have a large majority of non-halogneated spectra and a minority of halogenated spectra. The files were processed by the script to automatically identify the halogenated spectra. Chemical structures were then assigned by seaches against mass spectral libraries and manually reviewed using the instrument’s software.

The core of the identification (Step 5 below) is performed by the isotope ratio filter originally described in Pena-Abaurrea M., et al. Identification of potential novel bioaccumulative and persistent chemicals in sediments from Ontario (Canada) using scripting approaches with GCxGC-TOF MS analysis. Environ. Sci. Technol. 2014; 48: 9591-9599. Per our use-case, we have re-written the rules in R, added additional filter rules, and added the code to process the input files as a batch.

The input data types are MSP files containing the mass spectra and CSV files containing ancillary information, see below for more details. Both are exported by the LECO ChromaTOF software. The final output of the script is a set of CSV files, one for each sample, listing the spectra that are halogenated.

This R Notebook documents the script and runs a small example of two samples, each containing three spectra. For large data sets, it is recommended to use Identify Halogenated Script.R, which can more conveniently report status information to the console while running.

The analysis runs in steps in order to troubleshoot problems, and because situations may require re-running only certain portions of code. Each step holds files within a defined directory structure, which will be automatically created within the working directory.

rm(list = ls())
library(knitr)
opts_chunk$set(tidy = FALSE, comment = NA)
library(OrgMassSpecR) # The OrgMassSpecR version must be >= 0.5-1.

## Loading required package: grid

Create Directory Structure

If they do not exist, create the following directories within the working directory.

cat("* Creating directory structure as needed...")

* Creating directory structure as needed...

dir.create("Step 1 Input Spectra", showWarnings = FALSE)
dir.create("Step 2 Input Tables", showWarnings = FALSE)
dir.create("Step 3 Split MSP Files", showWarnings = FALSE)
dir.create("Step 4 MSP to CSV", showWarnings = FALSE)
dir.create("Step 5 Halogenation Filter", showWarnings = FALSE)
dir.create("Step 6 Additional Rules", showWarnings = FALSE)
dir.create("Step 7 Final Results", showWarnings = FALSE)
cat("OK\n")

OK

Prepare Spectra

Step 1: Add Input Spectra

For each sample, one file in the NIST MSP format containing all spectra. The file name is meaningful and should match the file name for the corresponding CSV file described below. The compound name for each spectrum (the NAME: field in the MSP formatted spectrum) is not used and does not need to be meaningful. (The workflow assumes spectra do not have assigned compound structures at this point.)

See the folder “Step 1 Input Spectra”" for examples. For instructions on how to generate this file within ChromaTOF, see the PDF.

Step 2: Add Input Peak Tables

For each sample, a table in CSV format containing the chromatographic peak number, retention times, and other optional information for each spectrum. The order of spectra in the CSV table must correspond to the order of the spectra in the MSP file, since this information will be linked together based on the order. The file name should also match that for the corresponding MSP file.

See the folder “Step 2 Input Tables” for examples. For instructions on how to generate this file within ChromaTOF, see the PDF.

cat("* Checking for presence of Theoretical Distributions.csv...")

* Checking for presence of Theoretical Distributions.csv...

if(file.exists("Theoretical Distributions.csv")) {
  cat("OK\n") } else {
    cat("ERROR\n")
    stop("Working directory must contain the file Theoretical Distributions.csv")
}

OK

# Check input spectra and input tables have the same name.
cat("* Checking if input spectra and input tables have the same name...")

* Checking if input spectra and input tables have the same name...

step1Files <- sort(sub(".txt", "", list.files("Step 1 Input Spectra")))
step2Files <- sort(sub(".csv", "", list.files("Step 2 Input Tables")))
if(all(step1Files == step2Files)) {
  cat("OK\n") } else {
  cat("ERROR\n")
    stop("Input spectra (*.txt) in Step 1 folder and input 
       tables (*.csv) in Step 2 folder must have the same names.
       Recommend manually deleting folders for Steps 3-7 before re-running.") 
  }

OK

Step 3: Split MSP Files

Function to Split one MSP file containing multiple spectra into multiple files each containing one spectrum. Creates sub-directories for each sample. Individual MSP files are named based on their order in the original file.

SplitMSP <- function(originFileName, outputFolderName) {
  
  a <- readLines(originFileName)
  
  # Make factor specifying each spectrum.
  spectrumFactor <- vector(mode = "integer", length = length(a)) 
  spectrumCount <- 1
  for(i in 1: length(a)) {
    if(a[i] != "") {
      spectrumFactor[i] <- spectrumCount
    } else {
      spectrumFactor[i] <- spectrumCount
      spectrumCount <- spectrumCount + 1
    }
  }
  
  spectrumList <- split(a, f = spectrumFactor)
  
  # Make individual files, where the filename specifies the original order in 
  # the input file. Removes blank lines and writeLines adds a single blank line
  # to the bottom of the file.
  for(j in 1:length(spectrumList)) {
    individualSpectrum <- spectrumList[[j]]
    individualSpectrum <- individualSpectrum[individualSpectrum != ""]
    fullFilename <- paste(outputFolderName, "/Spectrum ", j, ".txt", sep = "")
    writeLines(individualSpectrum, fullFilename)
    # Reporting
    cat('Step 3: Splitting ', fullFilename, '...\n', sep = "")
  }
}

Function call to split the spectra and report progress.

cat("* Beginning Step 3...\n")

* Beginning Step 3...

tmp <- list.files("Step 1 Input Spectra", full.names = TRUE)

for (i in 1:length(tmp)) {
  # Capture MSP file name to use as folder name for split files.
  f1 <- strsplit(tmp[i], split = c("/"), fixed = TRUE)[[1]][2]
  folder <- substr(f1, 1, nchar(f1) - 4) # remove .txt
  dir.create(paste("Step 3 Split MSP Files/", folder, " Spectra", sep = ""))
  SplitMSP(originFileName = tmp[i], 
           outputFolderName = paste("Step 3 Split MSP Files/", folder, " Spectra", sep = ""))
}

Step 3: Splitting Step 3 Split MSP Files/Sample 1 Spectra/Spectrum 1.txt...
Step 3: Splitting Step 3 Split MSP Files/Sample 1 Spectra/Spectrum 2.txt...
Step 3: Splitting Step 3 Split MSP Files/Sample 1 Spectra/Spectrum 3.txt...
Step 3: Splitting Step 3 Split MSP Files/Sample 2 Spectra/Spectrum 1.txt...
Step 3: Splitting Step 3 Split MSP Files/Sample 2 Spectra/Spectrum 2.txt...
Step 3: Splitting Step 3 Split MSP Files/Sample 2 Spectra/Spectrum 3.txt...

Step 4: Convert MSP Files to CSV Files

Function to convert each MSP file into a CSV table. The CSV table is later read into R as a data frame.

rm(list = ls())

ConvertMSP <- function(file) {
  x <- ReadMspFile(file, skip = 2, comment.char = "", remove.placeholders = FALSE)
  # If reporting is needed:
  # cat('Making ', file, '\n', sep = "")
  f1 <- strsplit(file, split = c("/"), fixed = TRUE)[[1]][3] # filename
  f2 <- strsplit(file, split = c("/"), fixed = TRUE)[[1]][2] # sample directory
  fileName <- substr(f1, 1, nchar(f1) - 4) # remove .txt
  cat('Step 4: Converting ', fileName, '...\n', sep = "")
  write.csv(x, file = paste("Step 4 MSP to CSV/", f2, "/", fileName, ".csv", sep = ''), row.names = FALSE)
}

Function call to create the CSV files.

cat("* Beginning Step 4...\n")

* Beginning Step 4...

tmp <- list.dirs("Step 3 Split MSP Files", recursive = FALSE)

for (i in 1:length(tmp)) {
  inputSpectraDirectory <- tmp[i]
  spectra <- dir(inputSpectraDirectory)
  f <- strsplit(tmp[i], split = c("/"), fixed = TRUE)[[1]][2] # sample directory
  dir.create(paste("Step 4 MSP to CSV/", f, sep = ""))
  x <- do.call('rbind', lapply(paste(inputSpectraDirectory, '/', spectra, sep = ''), ConvertMSP))
}

Step 4: Converting Spectrum 1...
Step 4: Converting Spectrum 2...
Step 4: Converting Spectrum 3...
Step 4: Converting Spectrum 1...
Step 4: Converting Spectrum 2...
Step 4: Converting Spectrum 3...

Process Spectra

Step 5: Apply Halogenation Filter

Functions to apply the halogenation filter and make a CSV file of spectra identified as having halogenation.

# The isotope ratio filter.
ProcessSpectrum <- function(x) {
  
  # Normalize peak intensity to percentage of max peak and only examine m/z > 100 amu.
  x$percentIntensity <- with(x, intensity / max(intensity) * 100)
  x$Filter <- FALSE
  x <- x[x$mz > 100, ]
  
  for(i in (nrow(x) - 6):5) {
    
    if(x$percentIntensity[i] > 3) {
      
      # Calculate peak ratios.
      
      peakRatio1 <- x$percentIntensity[i + 1] / x$percentIntensity[i]
      peakRatio2 <- x$percentIntensity[i + 2] / x$percentIntensity[i]
      peakRatio3 <- x$percentIntensity[i + 3] / x$percentIntensity[i]
      peakRatio4 <- x$percentIntensity[i + 4] / x$percentIntensity[i]
      peakRatio5 <- x$percentIntensity[i + 5] / x$percentIntensity[i]
      peakRatio6 <- x$percentIntensity[i + 6] / x$percentIntensity[i]
      peakRatioa <- x$percentIntensity[i - 1] / x$percentIntensity[i]
      peakRatiob <- x$percentIntensity[i - 2] / x$percentIntensity[i]
      peakRatioc <- x$percentIntensity[i - 4] / x$percentIntensity[i]
      
      # Determine if peak ratios pass rules.
      
      if(peakRatio1 < 0.5 & peakRatio1 > 0.003 &   # Rule 1
         peakRatio2 > 0.225 & peakRatio2 <= 1 &    # Rule 2
         peakRatio1 < peakRatio2 &                 # Rule 3
         peakRatio3 < 0.5 & peakRatio3 > 0.001 &   # Rule 4
         peakRatio3 < peakRatio1 &                 # Rule 5
         peakRatio4 < peakRatio2 &                 # Rule 6
         peakRatio5 < peakRatio1 &                 # Rule 7
         peakRatio6 < peakRatio2 &                 # Rule 8
         peakRatioa < 0.5 &                        # Rule 9
         peakRatiob < 1 &                          # Rule 10
         peakRatioc < 1) {                         # Rule 11
        
        x$Filter[i] <- TRUE
        
      }
      
    } # encloses filter
    
  } # encloses for loop that iterates through each m/z
  
  return(any(x$Filter))  # Returns TRUE if compound is halogenated.
  
}

# Prepares filenames and calls ProcessSpectrum

FindHalogens <- function(spectrumFilename) {
  
  cat('Step 5: Processing spectrum', spectrumFilename, '...\n')
  
  x <- read.csv(spectrumFilename, stringsAsFactors = FALSE)
  
  # Shorten filename so it is the same for all spectra (used later).
  fileName1 <- strsplit(spectrumFilename, split = c("/"), fixed = TRUE)[[1]][3]
  fileName2 <- substr(fileName1, 1, nchar(fileName1) - 4) # remove extension
  
  y <- ProcessSpectrum(x)
  
  return(data.frame(SpectrumFilename = fileName2, Halogenated = y))
  
}

# Calls FindHalogens for all spectra in sample and merges table information (retention times, etc.)

ProcessHalogenationSearch <- function(spectraDirectory, inputTableName, sampleName, outputTableName) {
  
  spectra <- dir(spectraDirectory)
  x <- do.call('rbind', lapply(paste(spectraDirectory, '/', spectra, sep = ''), FindHalogens))
  chromaTOF_table <- read.csv(inputTableName, stringsAsFactors = FALSE)
  chromaTOF_table$SpectrumFilename <- paste("Spectrum", chromaTOF_table$PeakNumber)
  x <- merge(x, chromaTOF_table, by = 'SpectrumFilename', all = TRUE)
  x$Sample <- sampleName
  write.csv(x, file = outputTableName, row.names = FALSE)
  
}

Function call to apply the halogenation filter and report the progress.

cat("* Beginning Step 5...\n")

* Beginning Step 5...

tmp <- dir("Step 4 MSP to CSV") # "Sample 1 Spectra" "Sample 2 Spectra"
for (i in 1:length(tmp)) {
  sn <- sub(" Spectra", "", x = tmp[i], fixed = TRUE)
  ProcessHalogenationSearch(spectraDirectory = paste("Step 4 MSP to CSV/", tmp[i], sep = ""),
                            inputTableName = paste("Step 2 Input Tables/", sn, ".csv", sep = ""),
                            sampleName = sn,
                            outputTableName = paste("Step 5 Halogenation Filter/Step 5 ", sn, ".csv", sep = ""))
}

Step 5: Processing spectrum Step 4 MSP to CSV/Sample 1 Spectra/Spectrum 1.csv ...
Step 5: Processing spectrum Step 4 MSP to CSV/Sample 1 Spectra/Spectrum 2.csv ...
Step 5: Processing spectrum Step 4 MSP to CSV/Sample 1 Spectra/Spectrum 3.csv ...
Step 5: Processing spectrum Step 4 MSP to CSV/Sample 2 Spectra/Spectrum 1.csv ...
Step 5: Processing spectrum Step 4 MSP to CSV/Sample 2 Spectra/Spectrum 2.csv ...
Step 5: Processing spectrum Step 4 MSP to CSV/Sample 2 Spectra/Spectrum 3.csv ...

Step 6: Set Up Table for Additional Rules

Function to make a table of infomation for application of additional rules. Requires file “Theoretical Distributions.csv” to be in the working directory. To save time, this is only applied to the spectra identified in Step 5. Step 6 determines matching theoretical isotopic distributions for each halogenated cluster identified in Step 5, and calculates a similarity score.

ApplyFilter <- function(x) {

  x$Filter <- FALSE
  x <- x[x$mz > 100, ]

  for(i in (nrow(x) - 6):5) {

    if(x$percentIntensity[i] > 3) {

      # Calculate peak ratios.
      
      peakRatio1 <- x$percentIntensity[i + 1] / x$percentIntensity[i]
      peakRatio2 <- x$percentIntensity[i + 2] / x$percentIntensity[i]
      peakRatio3 <- x$percentIntensity[i + 3] / x$percentIntensity[i]
      peakRatio4 <- x$percentIntensity[i + 4] / x$percentIntensity[i]
      peakRatio5 <- x$percentIntensity[i + 5] / x$percentIntensity[i]
      peakRatio6 <- x$percentIntensity[i + 6] / x$percentIntensity[i]
      peakRatioa <- x$percentIntensity[i - 1] / x$percentIntensity[i]
      peakRatiob <- x$percentIntensity[i - 2] / x$percentIntensity[i]
      peakRatioc <- x$percentIntensity[i - 4] / x$percentIntensity[i]
      
      # Determine if peak ratios pass rules.
      
      if(peakRatio1 < 0.5 & peakRatio1 > 0.003 &   # Rule 1
         peakRatio2 > 0.225 & peakRatio2 <= 1 &    # Rule 2
         peakRatio1 < peakRatio2 &                 # Rule 3
         peakRatio3 < 0.5 & peakRatio3 > 0.001 &   # Rule 4
         peakRatio3 < peakRatio1 &                 # Rule 5
         peakRatio4 < peakRatio2 &                 # Rule 6
         peakRatio5 < peakRatio1 &                 # Rule 7
         peakRatio6 < peakRatio2 &                 # Rule 8
         peakRatioa < 0.5 &                        # Rule 9
         peakRatiob < 1 &                          # Rule 10
         peakRatioc < 1) {                         # Rule 11

        x$Filter[i] <- TRUE

      }

    }

  }

  return(x)

}

# Select (already identified) halogenated distributions. Remove miss-identified
# peaks within the same distribution. Select 5 most abundant distributions. 
# Select m/z window for the distributions.

SelectHalogenatedDistribution <- function(xFilter, x) { 
  
  # Remove miss-identified peaks within the same distribution. Sometimes another
  # peak within the same distribution is identified. Only the maximum peak 
  # within the distribution should be identified. The following code eliminates
  # any multiple identifications within the same distribution, although not
  # necessairly the maximum one. The maximum peak is identified again later when
  # plotting the distribution.
  
  xFilter$Diff <- c(diff(xFilter$mz), 100) # 100 is placeholder at the end
  xFilter <- xFilter[xFilter$Diff > 10, ] # exclude if within 10 mz units
  
  # Select the 5 most abundant distributions. Then return to the original order.
  if(nrow(xFilter) > 5) {
    
    xFilter <- head(xFilter[rev(order(xFilter$intensity)), ], n = 5) 
    xFilter <- xFilter[order(xFilter$mz), ]
    
  }
  
  # Select window of m/z around the experimental distributions and store in
  # distList.
  
  distList <- vector(mode = 'list')
  for (i in 1:nrow(xFilter)) {
    distList[[i]] <- x[x$mz %in% seq(from = xFilter$mz[i] - 10, to = xFilter$mz[i] + 10, by = 1), ]  
  }
  return(distList)
  
}

# Function to determine matching theoretical distributions. Inputs are
# distList[[i]] and tDistList

SimilarityScoreMatch <- function(xDist, tDistList, tDistAll) {
  
  resultStore <- data.frame(NULL)
  
  # Iterate through each theoretical distribution and calculate a similarity
  # score against the experimental spectrum.
  for (j in 1:length(tDistList)) {
    
    tDist <- tDistAll[tDistAll$DistributionLabel == tDistList[j], ] # current theoretical distribution
    
    # Align theoretical distribution to experimental distribution based on the
    # maximum peak (maxvalue). The theoretical distribution m/z window is
    # given the same size as the experimental m/z window.
    tDist$mz[tDist$percent == 100] <- xDist$mz[which.max(xDist$percentIntensity)]
    maxvalue <- xDist$mz[which.max(xDist$percentIntensity)]
    index <- which.max(tDist$mz)
    
    if(index > 1) {
      startValue <- maxvalue - (index - 1)
      endValue <- startValue + nrow(tDist) - 1
      tDist$mz <- seq(from = startValue, to = endValue, by = 1)
    }
    
    # Need special case when the max ion is the first one.
    if(index == 1) {
      startValue <- maxvalue
      endValue <- maxvalue + nrow(tDist) - 1
      tDist$mz <- seq(from = startValue, to = endValue, by = 1)
    }
    
    # Merge experimental and theoretical distributions by the m/z.
    cDist <- merge(xDist, tDist, by = 'mz', all.x = TRUE, all.y = FALSE)
    cDist$percent[is.na(cDist$percent)] <- 0
    cDist$DistributionLabel <- tDistList[j]
    
    # Calculate similarity score, add to dataframe, and store in resultStore.
    u <- (cDist$intensity.x/max(cDist$intensity.x)) * 100
    v <- cDist$percent
    cDist$SimilarityScore <- as.vector((u %*% v) / (sqrt(sum(u^2)) * sqrt(sum(v^2))))
    resultStore <- rbind(resultStore, cDist)
    
  }
  
  return(resultStore)
  
}

Report <- function(spectrumFilename) {
  cat('Step 6: Making table for', spectrumFilename, '...\n')
  x <- read.csv(spectrumFilename, stringsAsFactors = FALSE)
  x$percentIntensity <- with(x, intensity / max(intensity) * 100)
  
  # Identify halogenated distributions
  x <- ApplyFilter(x = x)
  
  # Isolate halogenated distributions. xFilter contains only ions identifed as
  # halogenated.
  xFilter <- x[x$Filter == TRUE, ]
  
  # Need to skip if no halogenation was detected.
  if(nrow(xFilter) > 0) {
    distList <- SelectHalogenatedDistribution(xFilter, x)
    
    # Iterate through experimental distributions.
    for (i in 1:length(distList)) {
      xDist <- distList[[i]]
    }
    
    # Identify Distributions -------------------------------------------
    
    # Read in theoretical distributions
    tDistAll <- read.csv('Theoretical Distributions.csv', stringsAsFactors = FALSE)
    tDistAll$mz <- NA
    tDistList <- unique(tDistAll$DistributionLabel)
    bestMatchStore <- data.frame(NULL)
    
    # Iterate through experimental distributions to make an identificaiton of each
    # one. bestMatchStore holds the identification information for output.
    for (i in 1:length(distList)) {
      
      xDist <- distList[[i]] 
      resultStore <- SimilarityScoreMatch(xDist, tDistList, tDistAll)
      
      # Deterime best match by highest similarity score, and calculate the noise
      # of the experimental distribution.
      bestMatch <- resultStore[resultStore$SimilarityScore == max(resultStore$SimilarityScore), ]
      bestMatch$Noise <- with(bestMatch, round((abs(percent - percentIntensity) / percent) * 100))
      
      # Set up best match data frame.
      bestMatch$ID <- i
      bestMatch$Spectrum <- spectrumFilename
      bestMatchStore <- rbind(bestMatchStore, bestMatch)
      
    }
    
  }
  
  if(nrow(xFilter) > 0) {
    
    names(bestMatchStore) <- c('Expt_mz', 'Expt_Intenstiy', 'Expt_PercentIntensity', 'HalogenationFilter', 
                               'Theory_Intensity', 'Theory_PercentIntensity', 'DistributionLabel',
                               'SimilarityScore', 'Noise', 'ID', 'Spectrum')
    
    bestMatchStore <- bestMatchStore[bestMatchStore$HalogenationFilter == TRUE &
                                       bestMatchStore$Theory_PercentIntensity != 0, ]
  
    return(bestMatchStore)
    
  } else {
    
    return(NA)
    
  }
  
}


ProcessTableReports <- function(spectraDirectory, inputTableName, sampleName, outputTableName) {
  spectra <- read.csv(inputTableName, stringsAsFactors = FALSE)
  spectra <- spectra$SpectrumFilename[spectra$Halogenated == TRUE]
  x <- do.call('rbind', lapply(paste(spectraDirectory, '/', spectra, '.csv', sep = ''), Report))
  x$Sample <- sampleName
  tmp <- strsplit(x$Spectrum, split = "/", fixed = TRUE)
  tmp1 <- sapply(tmp, "[[", 3)
  x$SpectrumFilename <- sub(".csv", "", tmp1, fixed = TRUE)
  write.csv(x, file = outputTableName, row.names = FALSE)
}

Function call to make tables and report progress.

cat("* Beginning Step 6...\n")

* Beginning Step 6...

tmp <- dir("Step 4 MSP to CSV") # "Sample 1 Spectra" "Sample 2 Spectra"
for (i in 1:length(tmp)) {
  sn <- sub(" Spectra", "", x = tmp[i], fixed = TRUE)
  ProcessTableReports(spectraDirectory = paste("Step 4 MSP to CSV/", tmp[i], sep = ""),
                      inputTableName = paste("Step 5 Halogenation Filter/Step 5 ", sn, ".csv", sep = ""),
                      sampleName = sn,
                      outputTableName = paste("Step 6 Additional Rules/Step 6 ", sn, ".csv", sep = ""))
}

Step 6: Making table for Step 4 MSP to CSV/Sample 1 Spectra/Spectrum 2.csv ...
Step 6: Making table for Step 4 MSP to CSV/Sample 1 Spectra/Spectrum 3.csv ...
Step 6: Making table for Step 4 MSP to CSV/Sample 2 Spectra/Spectrum 1.csv ...
Step 6: Making table for Step 4 MSP to CSV/Sample 2 Spectra/Spectrum 3.csv ...

Step 7: Apply Additional Rules for Final Results

Function to subset the table based on the additonal rules. Produces one subsetted table with the final set of halogenated spectra and associated chromatographic information. The additional rules are:

The matching theoretical distribution cannot be Cl1. This distribution frequently matches noise, and compounds containing only one Cl are not commonly observed in our samples.
The similarity score of the best theoretical distribution match must be > 0.85.
The spectrum must have more than one halogenated fragment.

SubsetTables <- function(inputFile, chromaTOF_file, sampleName, outputFile) {
  
  # Read back in a apply additional filters.
  x <- read.csv(inputFile, stringsAsFactors = FALSE) 
  
  # Apply additonal filters.
  x <- x[x$Theory_PercentIntensity != 0, ]
  x <- x[x$DistributionLabel != 'Cl1', ]
  x <- x[x$SimilarityScore > 0.85, ]
  
  # Select only the spectra with > 1 halogenated fragment per spectrum.
  y <- setNames(aggregate(Expt_mz ~ Spectrum, data = x, FUN = length), c('Spectrum', 'Count'))
  countFilter <- y$Spectrum[y$Count > 1]
  x <- x[x$Spectrum %in% countFilter, ]
  
  # Merge back ChromaTOF table information.
  
  # Assemble ChromaTOF table.
  chromaTOF_table <- read.csv(chromaTOF_file, stringsAsFactors = FALSE)
  chromaTOF_table$SpectrumFilename <- paste("Spectrum", chromaTOF_table$PeakNumber)
  chromaTOF_table$Sample <- sampleName
  
  # Write table of only spectra passing the additional filters.
  tmp <- x[, c('Sample', 'SpectrumFilename')]
  tmp <- tmp[!duplicated(tmp), ]
  result2 <- merge(tmp, chromaTOF_table, by = c('Sample', 'SpectrumFilename'))
  
  cat('Step 7: Processing additional rules', inputFile, '...\n')
  
  write.csv(result2, file = outputFile, row.names = FALSE)
}

Function call to make table of final results.

cat("* Beginning Step 7...\n")

* Beginning Step 7...

tmp <- dir("Step 6 Additional Rules")
for (i in 1:length(tmp)) {
  sn1 <- sub("Step 6 ", "", x = tmp[i], fixed = TRUE)
  sn2 <- sub(".csv", "", x = sn1, fixed = TRUE)
  SubsetTables(inputFile = paste("Step 6 Additional Rules/Step 6 ", sn2, ".csv", sep = ""),
                      chromaTOF_file = paste("Step 2 Input Tables/", sn2, ".csv", sep = ""),
                      sampleName = sn2,
                      outputFile = paste("Step 7 Final Results/Step 7 ", sn2, ".csv", sep = ""))
}

Step 7: Processing additional rules Step 6 Additional Rules/Step 6 Sample 1.csv ...
Step 7: Processing additional rules Step 6 Additional Rules/Step 6 Sample 2.csv ...

cat("* Processing complete.")

* Processing complete.

# To extract code:
# library(knitr)
# purl("IdentifyHalogenationScript.Rmd")

Identify Brominated and/or Chlorinated Halogenation in Unit Resolution Mass Spectra

Nathan Dodder, Jennifer Cossaboon, and Eunha Hoh. Graduate School of Public Health, San Diego State University.

2017-08-14