spambase (Q6032900)

From MaRDI portal
Revision as of 09:44, 15 April 2024 by Importer (talk | contribs) (‎Created a new Item)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
OpenML dataset with id 44
Language Label Description Also known as
English
spambase
OpenML dataset with id 44

    Statements

    0 references
    **Author**: Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt \N**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/spambase) \N**Please cite**: [UCI](https://archive.ics.uci.edu/ml/citation_policy.html)\N\NSPAM E-mail Database \NThe "spam" concept is diverse: advertisements for products/websites, make money fast schemes, chain letters, pornography... Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter.\N \NFor background on spam: \NCranor, Lorrie F., LaMacchia, Brian A. Spam! Communications of the ACM, 41(8):74-83, 1998. \N\N### Attribute Information: \NThe last column denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail. Most of the attributes indicate whether a particular word or character was frequently occurring in the e-mail. The run-length attributes (55-57) measure the length of sequences of consecutive capital letters. \N\NFor the statistical measures of each attribute, see the end of this file. Here are the definitions of the attributes: \N\N48 continuous real [0,100] attributes of type \Nword_freq_WORD = percentage of words in the e-mail that match WORD, i.e. 100 * (number of times the WORD appears in the e-mail) / total number of words in e-mail. A "word" in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string.\N \N6 continuous real [0,100] attributes of type char_freq_CHAR = percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) / total characters in e-mail\N \N1 continuous real [1,...] attribute of type capital_run_length_average\N = average length of uninterrupted sequences of capital letters\N \N1 continuous integer [1,...] attribute of type capital_run_length_longest\N = length of longest uninterrupted sequence of capital letters\N \N1 continuous integer [1,...] attribute of type capital_run_length_total\N = sum of length of uninterrupted sequences of capital letters\N = total number of capital letters in the e-mail\N \N1 nominal {0,1} class attribute of type spam\N = denotes whether the e-mail was considered spam (1) or not (0), \N i.e. unsolicited commercial e-mail.
    0 references
    Mark Hopkins
    0 references
    Erik Reeber
    0 references
    George Forman
    0 references
    Jaap Suermondt
    0 references
    Hewlett-Packard Labs
    0 references
    1999-07-01
    0 references
    6 April 2014
    0 references
    class
    0 references
    d9ace01aeac3461e326a8e1b2d53fd84
    0 references
    1
    0 references
    2
    0 references
    58
    0 references
    4,601
    0 references
    0
    0 references
    57
    0 references
    0 references

    Identifiers

    0 references