Analytica Chimica Acta (v.490, #1-2)

Editorial CAC 2002 by Lutgarde Buydens; Sarah C. Rutan (1).

NMR-based metabonomic toxicity classification: hierarchical cluster analysis and k-nearest-neighbour approaches by Olaf Beckonert; Mary E. Bollard; Timothy M.D Ebbels; Hector C Keun; Henrik Antti; Elaine Holmes; John C Lindon; Jeremy K Nicholson (3-15).
The COnsortium for MEtabonomic Toxicology (COMET) project is constructing databases and metabolic models of drug toxicity using ca. 100,000 1 H NMR spectra of biofluids from animals treated with model toxins. Mathematical models characterising the effects of toxins on endogenous metabolite profiles will enable rapid toxicological screening of potential drug candidates and discovery of novel mechanisms and biomarkers of specific types of toxicity.The metabolic effects and toxicity of 19 model compounds administered to rats in separate studies at toxic (high) and sub-toxic (low) doses were investigated. Urine samples were collected from control and dosed rats at 10 time points over 8 days and were subsequently analysed by 600 MHz 1 H NMR spectroscopy. In order to classify toxicity and to reveal similarities in the response of animals to different toxins, principal component analysis (PCA), hierarchical cluster analysis (HCA) and k-nearest-neighbour (kNN) classification were applied to the data from the high-dose studies to reveal dose and time-related effects. Both PCA and HCA provided valuable overviews of the data, highlighting characteristic metabolic perturbations in the urine spectra between the four groups: controls (C), liver (L) toxins, kidney (K) toxins and other (O) treatments, and revealed further differences between subgroups of liver toxins. kNN analysis of the multivariate data using both leave-one-out (LOO) cross-validation and training and test-set (50:50) classification successfully predicted all the different toxin classes. The four treatment groups (control, liver, kidney and other) were predicted with 86, 85, 91 and 88% success rate (training/test). In a study-by-study comparison, 81% of the samples were predicted into the correct toxin study (training/test). This work illustrates the high power and reliability of metabonomic data analysis using 1 H NMR spectroscopy together with chemometric techniques for the exploration and prediction of toxic effects in the rat.
Keywords: Principal component analysis; Partial-least-squares discriminant analysis; Hierarchical cluster analysis; k-Nearest-neighbour; Metabonomics; NMR; Toxicity prediction; Biofluid; Urine;

In this paper, we investigate the accuracy and precision of the results from diode array detector (DAD) data and mass spectrometry (MS) data as obtained subsequent to chromatographic separations using computer simulations. Special attention was given to simulations of multiple injections from a developing enzymatic reaction. These simulations result in three-way LC–DAD–MS kinetic data; LC–DAD and LC–MS data were also evaluated independently in this investigation. The noise characteristics of the MS detector prevent accurate determination of the individual reaction rate constants by the analysis method. Using the data from the DAD in combination with the MS detector results in improved estimation of the rate constants. The results also indicate that the higher resolving power of the MS information compensates for the lower signal-to-noise ratio in these data, compared to DAD data.
Keywords: Enzyme kinetics; LC–DAD–MS data; Computer simulations; Multivariate curve resolution–alternating least squares (MCR–ALS);

Instrumentation spectra used for chemometrics analysis are often too unwieldy to model, as many of the inputs do not contain important information. Several mathematical methods are used for reducing the number of inputs to the significant ones only. Artificial neural networks (ANN) modeling suffers from difficulties in training models with a large number of inputs. However, using a non-random initial connection weight algorithm and local minima avoidance and escape techniques can overcome these difficulties. Once the ANN model is trained, the analysis of its connection weights can easily identify the more relevant inputs. Repeating the process of training the ANN model with the reduced input set and the selection of the more relevant inputs can proceed until a quasi-optimal, small, set of inputs is identified. Two examples are presented—finding the minimal set of wavelengths in benchmark diesel fuel NIR spectra, and in spectra generated in a recent work, modeling of “artificial nose” sensor array. In the last example, 1260 inputs were reduced to optimal sets of <10 inputs. Causal index calculation can analyze the influence of each of selected wavelengths on the predicted property. Some of the resulting minimal sets are not unique, depending on the ANN architecture used in the training. The accuracy of the resulting ANN models is usually better, and more robust, than the original large ANN model.
Keywords: Artificial neural networks; Chemometrics; Input selection; Microhotplate sensor array;

Mutual peak matching in a series of HPLC–DAD mixture analyses by Andrey Bogomolov; Michael McBrien (41-58).
One of the largest challenges in high performance liquid chromatography (HPLC) method development is the necessity for tracking the movement of peaks as separation conditions are changed. Peak increments are then used to build a mathematical model capable of minimizing the number of experiments in an optimization circuit. Method optimization for an unknown mixture is, moreover, complicated by the absence of any a priori information on component properties and retention times when direct signal assignment is not possible. On the contrary, achievement of the maximum separation becomes an important factor for successful identification or quantitation. In this case, the optimization may be based on assigning peaks of the same component chosen from different experiments to each other. In other words, mutual peak matching between the HPLC runs is required.A new method for mutual peak matching in a series of HPLC with diode array detector (HPLC–DAD) analyses of the same unknown mixture acquired at varying separation conditions has been developed. This approach, called mutual automated peak matching (MAP), does not require any prior knowledge of the mixture composition. Applying abstract factor analysis (AFA) and iterative key set factor analysis (IKSFA) on the augmented data matrix, the algorithm detects the number of mixture components and calculates the retention times of every individual compound in each of the input chromatograms. Every candidate component is then validated by target testing for presence in each HPLC run to provide quantitative criteria for the detection of “missing” peaks and non-analyte components as well as confirming successful matches. The matching algorithm by itself does not perform full curve resolution. However, its output may serve as a good initial estimate for further modeling. A common set of UV-Vis spectra of pure components can be obtained, as well as their corresponding concentration profiles in separate runs, by means of alternating least-square multivariate curve resolution (ALS MCR), resulting in reconstruction of overlapped peaks.The algorithms were programmed in MATLAB® and tested on a number of sets of simulated data. Possible ways to improve the stability of results, reduce calculation time, and minimize operator interaction are discussed. The technique can be used to optimize HPLC analysis of a complex mixture without preliminary identification of its components.
Keywords: Peak matching; Multivariate data analysis; Self-modeling curve resolution; HPLC;

Real-time data analysis is important in many applications. However, many chemometric algorithms have difficulty processing data in real-time. A novel real-time two-dimensional wavelet compression (WC2) has been developed to compress data as it is acquired from analytical instrumentation. The WC algorithm was enhanced so that data with an arbitrary number of points were compressed, and truncation or padding to a dyadic number was avoided. After compression, the noise level is reduced while useful chemical information is retained. A modified simple-to-use interactive self-modeling mixture analysis (SIMPLISMA) algorithm was applied to the wavelet-compressed data and the model was transformed back to the original representation while leaving the data compressed. The reduced size of the wavelet-compressed data furnished a faster implementation of SIMPLISMA that facilitates real-time acquisition.This real-time WC2-SIMPLISMA algorithm was applied to the rapid identification of explosives by ion mobility spectrometry (IMS). SIMPLISMA resolved concentration profiles and component spectra were displayed simultaneously while the data was acquired from an ion mobility spectrometer with a LabVIEW virtual instrument (VI).
Keywords: Real-time; Multi-dimensional; Wavelet compression; SIMPLISMA; FIR; Ion mobility spectrometry; Explosives; ChemometricsN;

In this study, an algorithm for growing neural networks is proposed. Starting with an empty network the algorithm reduces the error of prediction by subsequently inserting connections and neurons. The type of network element and the location where to insert the element is determined by the maximum reduction of the error of prediction. The algorithm builds non-uniform neural networks without any constraints of size and complexity. The algorithm is additionally implemented into two frameworks, which use a data set limited in size very efficiently, resulting in a more reproducible variable selection and network topology.The algorithm is applied to a data set of binary mixtures of the refrigerants R22 and R134a, which were measured by a surface plasmon resonance (SPR) device in a time-resolved mode. Compared with common static neural networks all implementations of the growing neural networks show better generalization abilities resulting in low relative errors of prediction of 0.75% for R22 and 1.18% for R134a using unknown data.
Keywords: Growing neural networks; Neural network topology; Variable selection; Refrigerants; Time-resolved measurements;

Rare-earth glass reference materials for near-infrared spectrometry: sources of x-axis location variability by David L. Duewer; Steven J. Choquette; Lindsey O’Neal; James J. Filliben (85-98).
The National Institute of Standards and Technology (NIST) recently introduced two optical filter standards for wavelength/wavenumber calibration of near-infrared (NIR) spectrometers. Standard Reference Material®s (SRM®s) 2035 and 2065 were fabricated in lots of ≈100 units each from separate melts of nominally identical rare-earth glass. Since individual filter certification is extremely time-consuming and thus costly, economic production of these SRMs required the ability to batch certify band locations. Given the specification that the combined uncertainty for the location of the bands in a given filter should be ≤0.2 cm−1, rigorous evaluation of material heterogeneity was required to demonstrate the adequacy of batch certification for these materials. Among-filter variation in measured band locations convolves any influence of material heterogeneity with that of environmental, procedural, and instrumental artifacts. While univariate analysis of variance established band-specific heterogeneity upper bounds, it did not provide quantitative descriptions of the other possible sources for the observed measurement variability. Principal components analysis enabled both the identification and isolation of the most important NIR band location variances among the SRM 2065 filters. After correction for these variance sources, the upper bound on the material heterogeneity was determined to be 0.03 cm−1 for all bands. Since this is a small part of the measurement uncertainty, we conclude that batch analysis provides an acceptable certification approach for these and similarly fabricated rare-earth glass reference materials.
Keywords: Material homogeneity; Optical filters; Principal components analysis (PCA); Spectrometer x-axis calibration; Temperature correction;

Analyses of three-way data from equilibrium and kinetic investigations by Raylene Dyson; Marcel Maeder; Yorck-Michael Neuhold; Graeme Puxty (99-108).
In kinetic or equilibrium investigations it is common to measure two-way multiwavelength data, e.g. absorption spectra as a function of time or reagent addition. Often it is advantageous to acquire experimental data at various initial conditions or even on different instruments. A collection of these measurements can be arranged in three-dimensional arrays, which can be analysed as a whole under the assumption of a superimposed function, e.g. a kinetic model, and/or common properties of the subsets, such as molar absorptivity. As we show on selected formation equilibria (Zn2+/phen) and kinetic studies (Cu2+/cyclam) from our own research, an appropriate combination of multivariate data can lead to an improved analysis of the investigated systems.
Keywords: Metal complexation; Three-way data; Kinetics; Equilibria;

Toxicity classification from metabonomic data using a density superposition approach: ‘CLOUDS’ by Tim Ebbels; Hector Keun; Olaf Beckonert; Henrik Antti; Mary Bollard; Elaine Holmes; John Lindon; Jeremy Nicholson (109-122).
Predicting and avoiding the potential toxicity of candidate drugs is of fundamental importance to the pharmaceutical industry. The consortium for metabonomic toxicology (COMET) project aims to construct databases and metabolic models of drug toxicity using ca. 100,000 600 MHz 1 H NMR spectra of biofluids from laboratory rats and mice treated with model toxic compounds. Chemometric methods are being used to characterise the time-related and dose-specific effects of toxins on the endogenous metabolite profiles. Here we present a probabilistic approach to the classification of a large data set of COMET samples using Classification Of Unknowns by Density Superposition (CLOUDS), a novel non-neural implementation of a classification technique developed from probabilistic neural networks.NMR spectra of urine from rats from 19 different treatment groups, collected over 8 days, were processed to produce a data matrix with 2844 samples and 205 spectral variables. The spectra were normalised to account for gross concentration differences in the urine and regions corresponding to non-endogenous metabolites (0.4% of the data) were treated as missing values. Modeling the data according to organ of effect (control, liver, kidney or other organ), with a 50/50 train/test set split, over 90% of the test samples were classified as belonging to the correct group. In particular, samples from liver and kidney treatments were classified with 77 and 90% success, respectively, with only a 2% misclassification rate between these classes. Further analysis of the data, counting each of the 19 treatment groups as separate classes, resulted in a mean success rate across groups of 74%. Finally, as a severe test, the data were split into 88 classes, each representing a particular toxin at a particular time point. Fifty-four percent of the spectra from non-control samples were classified correctly, particularly successful when compared to the null success rate of ∼1% expected from random class assignment. The CLOUDS technique has advantages when modelling complex multi-dimensional distributions, giving a probabilistic rather than absolute class description of the data and is particularly amenable to inclusion of prior knowledge such as uncertainties in the data descriptors.This work shows that it is possible to construct viable and informative models of metabonomic data using the CLOUDS methodology, delineating the whole time course of toxicity. These models will be useful in building hybrid expert systems for predicting toxicology, which are the ultimate goal of the COMET project.
Keywords: Probabilistic classification; CLOUDS; Probabilistic neural networks; Metabonomics; Toxicity prediction;

Factor analytical approaches for evaluating groundwater trace element chemistry data by I.M Farnham; K.H Johannesson; A.K Singh; V.F Hodge; K.J Stetzenbach (123-138).
The multivariate statistical techniques principal component analysis (PCA), Q-mode factor analysis (QFA), and correspondence analysis (CA) were applied to a dataset containing trace element concentrations in groundwater samples collected from a number of wells located downgradient from the potential nuclear waste repository at Yucca Mountain, Nevada. PCA results reflect the similarities in the concentrations of trace elements in the water samples resulting from different geochemical processes. QFA results reflect similarities in the trace element compositions, whereas CA reflects similarities in the trace elements that are dominant in the waters relative to all other groundwater samples included in the dataset. These differences are mainly due to the ways in which data are preprocessed by each of the three methods.The highly concentrated, and thus possibly more mature (i.e. older), groundwaters are separated from the more dilute waters using principal component 1 (PC 1). PC 2, as well as dimension 1 of the CA results, describe differences in the trace element chemistry of the groundwaters resulting from the different aquifer materials through which they have flowed. Groundwaters thought to be representative of those flowing through an aquifer composed dominantly of volcanic rocks are characterized by elevated concentrations of Li, Be, Ge, Rb, Cs, and Ba, whereas those associated with an aquifer dominated by carbonate rocks exhibit greater concentrations of Ti, Ni, Sr, Rh, and Bi. PC 3, and to a lesser extent dimension 2 of the CA results, show a strong monotonic relationship with the percentage of As(III) in the groundwater suggesting that these multivariate statistical results reflect, in a qualitative sense, the oxidizing/reducing conditions within the groundwater. Groundwaters that are relatively more reducing exhibit greater concentrations of Mn, Cs, Co, Ba, Rb, and Be, and those that are more oxidizing are characterized by greater concentrations of V, Cr, Ga, As, W, and U.
Keywords: Correspondence analysis; Principal component analysis; Q-mode factor analysis; Trace element chemistry;

A novel approach for quantification of chemical vapor effluents in stack plumes using infrared hyperspectral imaging are presented and examined. The algorithms use a novel application of the extended mixture model to provide estimates of background clutter in the on-plume pixel. These estimates are then used iteratively to improve the quantification. The final step in the algorithm employs either an extended least-squares (ELS) or generalized least-squares (GLS) procedure. It was found that the GLS weighting procedure generally performed better than ELS, but they performed similarly when the analyte spectra had relatively narrow features. The algorithms require estimates of the atmospheric radiance and transmission from the target plume to the imaging spectrometer and an estimate of the plume temperature. However, estimates of the background temperature and emissivity are not required which is a distinct advantage. The algorithm effectively provides a local estimate of the clutter, and an error analysis shows that it can provide superior quantification over approaches that model the background clutter in a more global sense. It was also found that the estimation error depended strongly on the net analyte signal for each analyte, and this quantity is scenario-specific.
Keywords: Chemometrics; Remote sensing; Hyperspectral imaging; Quantification; Generalized least-squares; Extended least-squares;

The aim of this study is to identify relationships between volatile organic components (VOCs) and transient high ozone formation in the Houston area. The ozone is not emitted to the atmosphere directly but is formed by chemical reactions in the atmosphere. In Houston, short-term (1 h) sharp increases are observed followed by a rapid decrease back to typical concentrations. Automatic gas chromatographs (GCs) are operated at several sites which cryogenically collect VOCs during an hour and then the compounds are flash evaporated into the GC for analysis. Chromatographic data for more than 65 VOCs are stored in analysis report text files. A program has been developed to read the amount of each component in the measurements such that a data set is generated that includes the concentrations of each VOC for each hourly sample. A subset of the data is selected that corresponds to the period of the positive ozone transient and these data are used in the data mining (DM) process. Based on a chemical mass balance (CMB) analysis, a linear model was established between the subset of the VOCs data and the positive ozone transition. Non-negative least squares (NNLS) was used to calculate the regression coefficient of the VOCs that have the most significant positive relationship to the positive ozone transition. The results show that more attention might be paid to several unknown VOCs, which have significant relationships to the transient high ozone formation.
Keywords: Data mining; Volatile organic components; Transient high ozone formation;

Gy sampling theory in environmental studies by Robert W. Gerlach; John M. Nocerino; Charles A. Ramsey; Brad C. Venner (159-168).
Sampling can be a significant source of error in the measurement process. The characterization and cleanup of hazardous waste sites require data that meet site-specific levels of acceptable quality if scientifically supportable decisions are to be made. In support of this effort, the US Environmental Protection Agency (EPA) is investigating methods that relate sample characteristics to analytical performance. Predicted uncertainty levels allow appropriate study design decisions to be made, facilitating more timely and less expensive evaluations. Gy sampling theory can predict a significant fraction of sampling error when certain conditions are met. We report on several controlled studies of subsampling procedures to evaluate the utility of Gy sampling theory applied to laboratory subsampling practices. Several sample types were studied and both analyte and non-analyte containing particles were shown to play important roles affecting the measured uncertainty.Gy sampling theory was useful in predicting minimum uncertainty levels provided the theoretical assumptions were met. Predicted fundamental errors ranged from 46 to 68% of the total measurement variability. The study results also showed sectorial splitting outperformed incremental sampling for simple model systems and suggested that sectorial splitters divide each size fraction independently. Under the limited conditions tested in this study, incremental sampling with a spatula produced biased results when sampling particulate matrices with grain sizes about 1 mm.
Keywords: Particulates; Sampling; Subsampling; Representative; Heterogeneous; Fundamental error; Pierre Gy;

Classical least squares transformations of sensor array pattern vectors into vapor descriptors by Jay W. Grate; Barry M. Wise; Neal B. Gallagher (169-184).
A new method of processing multivariate response data to extract chemical information has been developed. Sensor array response patterns are transformed into a vector containing values for solvation parameter descriptors of the detected vapor’s properties. These results can be obtained by using a method similar to classical least squares (CLS), and equations have been derived for mass- or volume-transducing sensors. Polymer-coated acoustic wave devices are an example of mass-transducing sensors. However, some acoustic wave sensors, such as polymer-coated surface acoustic wave (SAW) devices give responses resulting from both mass-loading and decreases in modulus. The latter effect can be modeled as a volume effect. In this paper, we derive solutions for obtaining descriptor values from arrays of mass-plus-volume-transducing sensors. Simulations were performed to investigate the effectiveness of these solutions and compared with solutions for purely mass-transducing sensor arrays. It is concluded that this new method of processing sensor array data can be applied to SAW sensor arrays even when the modulus changes contribute to the responses. The simulations show that good estimations of vapor descriptors can be obtained by using a closed form estimation approach that is similar to the closed form solution for purely mass-transducing sensor arrays. Estimations can be improved using a nonlinear least squares optimization method. The results also suggest ways to design SAW arrays to obtain the best results, either by minimizing the volume sensitivity or matching the volume sensitivities in the array.
Keywords: Sensor array; Surface acoustic wave (SAW); Chemometric; Vapor descriptors;

A simplified photon time-of-flight (TOF) instrument based on a nanosecond rise-time diode laser at 635 nm was used for the quantification of optical properties of samples. A series of transmittance photon time-of-flight measurements were acquired from absorbing/scattering Intralipid™ samples of known composition (0<μ a<0.0014 mm−1; 13<μ s<24 mm−1). Time-of-flight distributions were analyzed using Haar transform with selection of the most parsimonious set of wavelets by genetic algorithm optimization. Results showed that the scattering coefficient could be estimated with a coefficient of variation (CV) of 4.4% and r 2=0.95 using wavelets of frequency up to 400 MHz. Absorption coefficients were estimated with a CV of 6.9% and r 2=0.99 using steady-state intensity of blank- and scatter-corrected data. Furthermore, it was shown that quantification using simplified electronics can estimate scattering to within 7.2% (r 2=0.88) and absorption with an error of 8.3% (r 2=0.99). The above findings suggest that a simplified instrument based on a pulsed laser diode and low frequency switches could be developed to quantify absorption in highly scattering media.
Keywords: Scattering media; Time-correlated single photon counting; Laser diode; Haar transform; Optical properties;

The selectivity of high performance liquid chromatography (HPLC) separations is increased using a parallel column configuration. In this system, an injected sample is first split between two HPLC columns that provide complementary separations. The effluent from the two columns is recombined prior to detection with a single multiwavelength absorbance detector. Complementary stationary phases are used so that each chemical component produces a detected concentration profile consisting of two peaks. A parallel column configuration, when coupled with multivariate detection, provides increased chemical selectivity relative to a single column configuration with the same multivariate detection. This enhanced selectivity is achieved by doubling the number of peaks in the chromatographic dimension while keeping the run time constant. Unlike traditional single column separation methodology, the parallel column system sacrifices chromatographic resolution while actually increasing the chemical selectivity, thus allowing chemometric data analysis methods to mathematically resolve the multivariate chromatographic data. The parallel column system can be used to reduce analysis times for partially resolved peaks and simplify initial method development as well as provide a more robust methodology if and when subsequent changes in the sample matrix occur (such as when new interferences show up in subsequent samples). Here, a mixture of common aromatic compounds were separated with this system and analyzed using the generalized rank annihilation method (GRAM). Analytes that were significantly overlapped on both stationary phases applied, ZirChrom PBD and CARB phases, when used in traditional single column format, were successfully quantified with a R.S.D.% of typically 2% when the same stationary phases were used in the parallel column format. These results indicate that a parallel column system should substantially improve the chemical selectivity and quantitative precision of the analysis relative to a single-column instrument.
Keywords: Parallel column; HPLC; Selectivity; Generalized rank annihilation method; Complementary stationary phases;

Hybrid genetic algorithm–tabu search approach for optimising multilayer optical coatings by J.A. Hageman; R. Wehrens; H.A. van Sprang; L.M.C. Buydens (211-222).
Constructing multilayer optical coatings (MOCs) is a difficult large-scale optimisation problem due to the enormous size of the search space. In the present paper, a new approach for designing MOCs is presented using genetic algorithms (GAs) and tabu search (TS). In this approach, it is not necessary to specify how many layers will be present in a design, only a maximum needs to be defined. As it is generally recognised that the existence of specific repeating blocks is beneficial for a design, a specific GA representation of a design is used which promotes the occurrence of repeating blocks. Solutions found by GAs are improved by a new refinement method, based on TS, a global optimisation method which is loosely based on artificial intelligence. The improvements are demonstrated by creating a visible transmitting/infrared reflecting filter with a wide variety of materials.
Keywords: Optimisation; Genetic algorithms; Tabu search; Multilayer optical coatings;

High-speed gas chromatographic separations with diaphragm valve-based injection and chemometric analysis as a gas chromatographic “sensor” by Janiece L. Hope; Kevin J. Johnson; Marianne A. Cavelti; Bryan J. Prazen; Jay W. Grate; Robert E. Synovec (223-230).
A high-speed gas chromatography system, the gas chromatographic sensor (GCS), is developed and evaluated. The GCS combines fast separations and chemometric analysis to produce an instrument capable of high-speed, high-throughput screening and quantitative analysis of complex chemical mixtures on a similar time scale as typical chemical sensors. The GCS was evaluated with 28 test mixtures consisting of 15 compounds from four chemical classes: alkanes, ketones, alkyl benzenes, and alcohols. The chromatograms are on the order of one second in duration, which is considerably faster than the traditional use of gas chromatography. While complete chromatographic separation of each analyte peak is not aimed for, chemical information is readily extracted through chemometric data analysis and quantification of the samples is achieved in considerably less time than conventional gas chromatography.Calibration models to predict percent volume content of either alkanes or ketones were constructed using partial least squares (PLS) regression on calibration sets consisting of the five replicate GCS runs of six different samples. The percent volume content of the alkane and ketone chemical classes were predicted on five replicate runs of the 22 remaining samples ranging from 0 to 50 or 60% depending on the class. Root mean square errors of prediction were 2–3% relative to the mean percent volume values for either alkane or ketone prediction models, depending on the samples chosen for the calibration set of that model. The alkyl benzenes and alcohols present in the calibration sets or samples were treated as variable background interference. It is anticipated that the GCS will eventually be used to rapidly sample and directly analyze industrial processes or for the high throughput analysis of batches of samples.
Keywords: Gas chromatography; Chemometrics; Multivariate; Partial least squares; High-speed; Sensor;

Assessment of techniques for DOSY NMR data processing by R. Huo; R. Wehrens; J.van Duynhoven; L.M.C. Buydens (231-251).
Diffusion-ordered spectroscopy (DOSY) NMR is based on a pulse-field gradient spin-echo NMR experiment, in which components experience diffusion. Consequently, the signal of each component decays with different diffusion rates as the gradient strength increases, constructing a bilinear NMR data set of a mixture. By calculating the diffusion coefficient for each component, it is possible to obtain a two-dimensional NMR spectrum: one dimension is for the conventional chemical shift and the other for the diffusion coefficient. The most interesting point is that this two-dimensional NMR allows non-invasive “chromatography” to obtain the pure spectrum for each component, providing a possible alternative for LC-NMR that is more expensive and time-consuming. Potential applications of DOSY NMR include identification of the components and impurities in complex mixtures, such as body fluids, or reaction mixtures, and technical or commercial products, e.g. comprising polymers or surfactants.Data processing is the most important step to interpret DOSY NMR. Single channel methods and multivariate methods have been proposed for the data processing but all of them have difficulties when applied to real-world cases. The big challenge appears when dealing with more complex samples, e.g. components with small differences in diffusion coefficients, or severely overlapping in the chemical shift dimension. Two single channel methods, including SPLMOD and continuous diffusion coefficient (CONTIN), and two multivariate methods, called direct exponential curve resolution algorithm (DECRA) and multivariate curve resolution (MCR), are critically evaluated by simulated and real DOSY data sets. The assessments in this paper indicate the possible improvement of the DOSY data processing by applying iterative principal component analysis (IPCA) followed by MCR-alternating least square (MCR-ALS).
Keywords: DOSY NMR; Diffusion NMR; SPLMOD; CONTIN; Multivariate curve resolution; Alternating least squares; Factor analysis; DECRA;

Multivariate resolution of NMR labile signals by means of hard- and soft-modelling methods by Joaquim Jaumot; Montserrat Vives; Raimundo Gargallo; Romà Tauler (253-264).
One of the difficulties frequently encountered when studying acid–base equilibria with NMR spectroscopy is the labile behaviour of the measured signal, which hinders the application of bilinear multivariate data analysis methods. In this work, a mathematical transformation is proposed for the conversion of NMR labile signals to inert signals, which make possible the application of multivariate data analysis methods, based on bilinear data models. The procedure has been applied to the analysis of NMR data corresponding to the acid–base equilibria of nucleotides dCMP and dGMP. Both hard-modelling (EQUISPEC) and soft-modelling (MCR-ALS) approaches have been applied for the analysis and resolution of transformed bilinear NMR data matrices.
Keywords: Curve resolution; Labile–inert; NMR; Protonation equilibria; pK a determination; Nucleotides;

Improved analysis of multivariate data by variable stability scaling: application to NMR-based metabolic profiling by Hector C. Keun; Timothy M.D. Ebbels; Henrik Antti; Mary E. Bollard; Olaf Beckonert; Elaine Holmes; John C. Lindon; Jeremy K. Nicholson (265-276).
Variable scaling alters the covariance structure of data, affecting the outcome of multivariate analysis and calibration. Here we present a new method, variable stability (VAST) scaling, which weights each variable according to a metric of its stability. The beneficial effect of VAST scaling is demonstrated for a data set of 1 H NMR spectra of urine acquired as part of a metabonomic study into the effects of unilateral nephrectomy in an animal model. The application of VAST scaling improved the class distinction and predictive power of partial least squares discriminant analysis (PLS-DA) models. The effects of other data scaling and pre-processing methods, such as orthogonal signal correction (OSC), were also tested. VAST scaling produced the most robust models in terms of class prediction, outperforming OSC in this aspect. As a result the subtle, but consistent, metabolic perturbation caused by unilateral nephrectomy could be accurately characterised despite the presence of much greater biological differences caused by normal physiological variation. VAST scaling presents itself as an interpretable, robust and easily implemented data treatment for the enhancement of multivariate data analysis.
Keywords: Orthogonal signal correction; Variable scaling; Coefficient of variation; Metabonomics; Metabolomics; Partial least squares discriminant analysis; Variable stability; Data pre-processing; Biofluid NMR;

This work examines the factor analysis of matrices where the proportion of signal and noise is very different in different columns (variables). Such matrices often occur when measuring elemental concentrations in environmental samples. In the strongest variables, the error level may be a few percent. For the weakest variables, the data may consist almost entirely of noise. This paper demonstrates that the proper scaling of weak variables is critical. It is found that if a few weak variables are scaled to too high a weight in the analysis, the errors in computed factors would grow, possibly obscuring the weakest factor(s) by the increased noise level. The mathematical explanation of this phenomenon is explored by means of Givens rotations. It is shown that the customary form of principal component analysis (PCA), based on autoscaling the original data, is generally very ineffective because the scaling of weak variables becomes much too high. Practical advice is given for dealing with noisy data in both PCA and positive matrix factorization (PMF).
Keywords: Principal component analysis; Positive matrix factorization; Signal-to-noise; Scaling of variables; Autoscaling; Weak variables; Givens rotations;

To date, few efforts have been made to take simultaneous advantage of the local nature of spectral data in both the time and frequency domains in a single regression model. We describe here the use of a novel chemometrics algorithm using the wavelet transform. We call the algorithm dual-domain regression, as the regression step defines a weighted model in the time-domain based on the contributions of parallel, frequency-domain models made from wavelet coefficients reflecting different scales. In principle, any regression method can be used, and implementation of the algorithm using partial least squares regression and principal component regression are reported here. The performance of the models produced from the algorithm is generally superior to that of regular partial least squares (PLS) or principal component regression (PCR) models applied to data restricted to a single domain. Dual-domain PLS and PCR algorithms are applied to near infrared (NIR) spectral datasets of Cargill corn samples and sets of spectra collected on batch chemical reactions run in different reactors to illustrate the improved robustness of the modeling.
Keywords: Wavelet; Alternate domain regression; PLS; Calibration; Robust calibration;

SpaRef: a clustering algorithm for multispectral images by Thanh N. Tran; Ron Wehrens; Lutgarde M.C. Buydens (303-312).
Multispectral images such as multispectral chemical images or multispectral satellite images provide detailed data with information in both the spatial and spectral domains. Many segmentation methods for multispectral images are based on a per-pixel classification, which uses only spectral information and ignores spatial information. A clustering algorithm based on both spectral and spatial information would produce better results.In this work, spatial refinement clustering (SpaRef), a new clustering algorithm for multispectral images is presented. Spatial information is integrated with partitional and agglomeration clustering processes. The number of clusters is automatically identified. SpaRef is compared with a set of well-known clustering methods on compact airborne spectrographic imager (CASI) over an area in the Klompenwaard, The Netherlands. The clusters obtained show improved results. Applying SpaRef to multispectral chemical images would be a straight-forward step.
Keywords: Clustering algorithm; Multispectral image segmentation; Spatial information;

Spectral similarity versus structural similarity: infrared spectroscopy by K. Varmuza; M. Karlovits; W. Demuth (313-324).
A new method is described for evaluation of spectral similarity searches. Aim of the method is to measure the similarity between the chemical structures of query compounds and the found reference compounds (hits). A high structural similarity is essential if the query is not present in the spectral library. Similarity of chemical structures was measured by the Tanimoto index, calculated from 1365 binary substructure descriptors. The method has been applied to several 1000 hitlists from searches in an infrared (IR) spectra database containing 13,484 compounds. Hitlists with highest structure information were obtained using a similarity measure based on the correlation coefficient computed from mean centered absorbance units. Frequency distributions of spectral and structural similarities have been investigated and a threshold for the spectral similarity has been derived that in general gives hitlists exhibiting significant chemical structure similarities with the query.
Keywords: Spectral library search; IR spectra; Tanimoto index; Substructure descriptors; Interpretative power;

The selectivity and robustness of near-infrared (near-IR) calibration models based on short-scan Fourier transform (FT) infrared interferogram data are explored. The calibration methodology used in this work employs bandpass digital filters to reduce the frequency content of the interferogram data, followed by the use of partial least-squares (PLS) regression to build calibration models with the filtered interferogram signals. Combination region near-IR interferogram data are employed corresponding to physiological levels of glucose in an aqueous matrix containing variable levels of alanine, sodium ascorbate, sodium lactate, urea, and triacetin. A randomized design procedure is used to minimize correlations between the component concentrations and between the concentration of glucose and water. Because of the severe spectral overlap of the components, this sample matrix provides an excellent test of the ability of the calibration methodology to extract the glucose signature from the interferogram data. The robustness of the analysis is also studied by applying the calibration models to data collected outside of the time span of the data used to compute the models. A calibration model based on 52 samples collected over 4 days and employing two digital filters produces a standard error of calibration (SEC) of 0.36 mM glucose. The corresponding standard errors of prediction (SEP) for data collected on the 5th (18 samples) and 7th (10 samples) day are 0.42 and 0.48 mM, respectively. The interferogram segment used for the analysis contained only 155 points. These results are compatible with those obtained in a conventional analysis of absorbance spectra and serve to validate the viability of the interferogram-based calibration.
Keywords: Near-infrared; Glucose; Interferogram; Digital filtering; Partial least-squares;

A data analysis tool, known as independent component analysis (ICA), is the main focus of this paper. The theory of ICA is briefly reviewed, and the underlying statistical assumptions and a practical algorithm are described. This paper introduces cross validation/jack-knifing and significance tests to ICA. Jack-knifing is applied to estimate uncertainties for the ICA loadings, which also serve as a basis for significance tests. These tests are shown to improve ICA performance, indicating how many components are mixed in the observed data, and also which parts of the extracted sources that contain significant information. We address the issue of stability for the ICA model through uncertainty plots. The ICA performance is compared to principal component analysis (PCA) for two selected applications, a simulated experiment and a real world application.
Keywords: Independent component analysis (ICA); Source separation; Cross validation; Jack-knifing; Uncertainty estimates;

Partial least-squares modeling of continuous nodes in Bayesian networks by Nathaniel A. Woody; Steven D. Brown (355-363).
In Bayesian networks it is necessary to compute relationships between continuous nodes. The standard Bayesian network methodology represents this dependency with a linear regression model whose parameters are estimated by a maximum likelihood (ML) calculation. Partial least-squares (PLS) is proposed as an alternative method for computing the model parameters. This new hybrid method is termed PLS-Bayes, as it uses PLS to calculate regression vectors for a Bayesian network. This alternative approach requires storing the raw data matrix rather than sequentially updating sufficient statistics, but results in a regression matrix that predicts with higher accuracy, requires less training data, and performs well in large networks.
Keywords: Bayesian networks; Inverse calibration; PLS;

The use of proteomic data for compound characterisation and toxicity prediction has recently gathered much interest in the pharmaceutical industry, particularly with the development of new high-throughput proteomic techniques such as surface-enhanced laser desorption/ionisation time of flight mass spectrometry (SELDI-ToF) mass spectrometry. To validate these techniques, comparison with established methods such as clinical chemistry endpoints is required; however, there is currently no statistical method available to assess whether the proteomic data describes the same toxicological information as the clinical chemistry data. In this paper, generalised procrustes analysis (GPA) is applied to obtain a consensus between SELDI-ToF data and clinical chemistry data, both obtained from a study of cholestasis in rats. The significance of the consensus and the dimension of the consensus space are diagnosed by a newly developed randomisation F-test method for GPA [Food Qual. Pref. 13 (2002) 191]. Two kinds of matching were considered, using individual animals or treatment groups as samples in GPA. The results show that the SELDI-ToF data has significant consensus with clinical chemistry data, and that the consensus can be visualised in the significant dimensions of group average space.
Keywords: Proteomics; Protein expression; Generalised procrustes analysis; Consensus; Validation; Randomisation test; Significant factors; F-test; SELDI-ToF; Predictive toxicology;

Author Index (379-381).