tecator (Q6033241)

From MaRDI portal
OpenML dataset with id 505
Language Label Description Also known as
English
tecator
OpenML dataset with id 505

    Statements

    0 references
    0 references
    **Author**: \N**Source**: Unknown - Date unknown \N**Please cite**: \N\NThis is the Tecator data set: The task is to predict the fat content of a\Nmeat sample on the basis of its near infrared absorbance spectrum.\N1. Statement of permission from Tecator (the original data source)\N\NThese data are recorded on a Tecator Infratec Food and Feed Analyzer\Nworking in the wavelength range 850 - 1050 nm by the Near Infrared\NTransmission (NIT) principle. Each sample contains finely chopped pure\Nmeat with different moisture, fat and protein contents.\N\NIf results from these data are used in a publication we want you to\Nmention the instrument and company name (Tecator) in the publication.\NIn addition, please send a preprint of your article to\N\NKarin Thente, Tecator AB,\NBox 70, S-263 21 Hoganas, Sweden\N\NThe data are available in the public domain with no responsability from\Nthe original data source. The data can be redistributed as long as this\Npermission note is attached.\NFor more information about the instrument - call Perstorp Analytical's\Nrepresentative in your area.\N\N\N2. Description of the data file\N\NFor each meat sample the data consists of a 100 channel spectrum of\Nabsorbances and the contents of moisture (water), fat and protein.\NThe absorbance is -log10 of the transmittance\Nmeasured by the spectrometer. The three contents, measured in percent,\Nare determined by analytic chemistry.\N\NThere are 240 samples which are divided into 5 data sets for the purpose\Nof model validation and extrapolation studies. The data sets, further\Ndescribed in reference 1, are:\N\NData set Use Samples\NC Traning 129\NM Monitoring 43\NT Testing 43\NE1 Extrapolation, Fat 8\NE2 Extrapolation, Protein 17\N\NThe data for all 240 samples appear at the end of this file - 25 lines\Nper sample. The data sets appear in the order of the table above.\NThe spectra are preprocessed using a principal component analysis on the\Ndata set C, and the first 22 principal components (scaled to unit\Nvariance) are included for each sample.\NThus if you want to use the data for a standard (interpolation) test\Nof your algorithm, use sample 1-172 for training and sample 173-215\Nfor testing (and ignore the last 25 samples), and use the first 13 or so\Nprincipal components to predict the fat content.\N\NEach line contains the 100 absorbances followed by the 22 principal\Ncomponents and finally the contents of moisture, fat and protein.\N\NPreceeding the data lines, the following lines appear:\N\Nreal_in=122\Nreal_out=3\Ntraining_examples=172\Ntest_examples=43\Nextrapolation_examples=25\N\N\N3. More details on how to use the data\N\NThe data are made available as a benchmark for regression models. In order\Nto compare models, it is practical to use the data set as follows:\N\NC and M combined are used to tune (estimate, train) the model. (Some\Napproaches set aside some training data to control overfitting. These data\Nshould be a subset of C+M. In (1) the subset M was used for this purpose.)\N\NT is used to test the model once it has been tuned.\NIf each model has an element of randomness (as is the case\Nfor neural networks) the most reliable measure of performance of a single\Nmodel is obtained by selecting a handful of models on the basis of C+M and\Nquoting the average of the performances on T.\NIn the presence of randomness it is bad practice to train a lot of models\Non C+M and then select the best of these on the basis of T.\N\NC, M and T are drawn from the same pool of data, so T is used to test the\Nability of the models to interpolate. The data sets E1 and E2 contain\Nmore fat and protein respectively and are intended to be used to test the\Nability of the models to extrapolate.\N\N\N4. Performance of neural network models\N\NThe performance is measured as Standard Error of Prediction (SEP) which\Nis the root mean square of the difference between the true and the predicted\Ncontent.\N\NFor the prediction of fat on the data set T the following results were obtained\N\NReference SEP method (see the papers for details)\N(1) 0.65 10-6-1 network, early stopping\N(2) 0.52 10-3-1 network, Bayesian\N(3) 0.36 13-X-1 network, Bayesian, Automatic Relevance Determination\N\NA linear model with 10 inputs yields SEP=2.78.\N\N5. References\N\N(1) C.Borggaard and H.H.Thodberg,\N"Optimal Minimal Neural Interpretation of Spectra",\NAnalytical Chemistry 64 (1992), p 545-551.\N(2) H.H.Thodberg, "Ace of Bayes: Application of Neural Networks with Pruning"\NManuscript 1132, Danish Meat Research Institute (1993),\Navailable by anonymous ftp in the file:\Npub/neuroprose/thodberg.ace-of-bayes.ps.Z on the Internet node\Narchive.cis.ohio-state.edu (128.146.8.52).\N\N(3) Revised and extended version of (2), in preparation, to be\Nsubmitted to IEEE Trans. Neural Networks (1995)\Navailable by anonymous ftp in the file:\Npub/neuroprose/thodberg.bayesARD.ps.Z on the Internet node\Narchive.cis.ohio-state.edu (128.146.8.52).\N\NHans Henrik Thodberg Email: thodberg@nn.dmri.dk\NDanish Meat Research Institute Phone: (+45) 42 36 12 00\NMaglegaardsvej 2, Postboks 57 Fax: (+45) 42 36 48 36\NDK-4000 Roskilde, Denmark\N\Nreal_in=122\Nreal_out=3\Ntraining_examples=172\Ntest_examples=43\Nextrapolation_examples=25\N\N\NNote: all 240 samples are included in the same order as mentioned above\N\N\NInformation about the dataset\NCLASSTYPE: numeric\NCLASSINDEX: none specific
    0 references
    29 September 2014
    0 references
    fat
    0 references
    bc1cfff2d40bc7e47e7b6aa0826f3d5d
    0 references
    0
    0 references
    0
    0 references
    125
    0 references
    240
    0 references
    0
    0 references
    125
    0 references
    0 references