KDDCup99 (Q6033833)
From MaRDI portal
OpenML dataset with id 1113
Language | Label | Description | Also known as |
---|---|---|---|
English | KDDCup99 |
OpenML dataset with id 1113 |
Statements
1
0 references
**Author**: \N**Source**: Unknown - Date unknown \N**Please cite**: \N\NThis is a 10% stratified subsample of the data from the 1999 ACM KDD Cup (http://www.sigkdd.org/kddcup/index.php).\N\NModified by TunedIT (converted to ARFF format)\N\Nhttp://www.sigkdd.org/kddcup/index.php?section=1999&method=info\NThis is the data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99 The Fifth International Conference on Knowledge Discovery and Data Mining. The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between ``bad'' connections, called intrusions or attacks, and "good" normal connections. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment.\NThe training and test datasets are also available in the UC Irvine KDD archive.\N\N\N\N\NKDD Cup 1999: Tasks\N\NThis document is adapted from the paper Cost-based Modeling and Evaluation for Data Mining With Application to Fraud and Intrusion Detection: Results from the JAM Project by Salvatore J. Stolfo, Wei Fan, Wenke Lee, Andreas Prodromidis, and Philip K. Chan.\N\NIntrusion Detector Learning\N\NSoftware to detect network intrusions protects a computer network from unauthorized users, including perhaps insiders. The intrusion detector learning task is to build a predictive model (i.e. a classifier) capable of distinguishing between ``bad'' connections, called intrusions or attacks, and ``good'' normal connections.\N\NThe 1998 DARPA Intrusion Detection Evaluation Program was prepared and managed by MIT Lincoln Labs. The objective was to survey and evaluate research in intrusion detection. A standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment, was provided. The 1999 KDD intrusion detection contest uses a version of this dataset.\N\NLincoln Labs set up an environment to acquire nine weeks of raw TCP dump data for a local-area network (LAN) simulating a typical U.S. Air Force LAN. They operated the LAN as if it were a true Air Force environment, but peppered it with multiple attacks.\N\NThe raw training data was about four gigabytes of compressed binary TCP dump data from seven weeks of network traffic. This was processed into about five million connection records. Similarly, the two weeks of test data yielded around two million connection records.\N\NA connection is a sequence of TCP packets starting and ending at some well defined times, between which data flows to and from a source IP address to a target IP address under some well defined protocol. Each connection is labeled as either normal, or as an attack, with exactly one specific attack type. Each connection record consists of about 100 bytes.\N\NAttacks fall into four main categories:\N\N* DOS: denial-of-service, e.g. syn flood;\N* R2L: unauthorized access from a remote machine, e.g. guessing password;\N* U2R: unauthorized access to local superuser (root) privileges, e.g., various ``buffer overflow'' attacks;\N* probing: surveillance and other probing, e.g., port scanning.\N\NIt is important to note that the test data is not from the same probability distribution as the training data, and it includes specific attack types not in the training data. This makes the task more realistic. Some intrusion experts believe that most novel attacks are variants of known attacks and the "signature" of known attacks can be sufficient to catch novel variants. The datasets contain a total of 24 training attack types, with an additional 14 types in the test data only.\N\NDerived Features\N\NStolfo et al. defined higher-level features that help in distinguishing normal connections from attacks. There are several categories of derived features.\N\NThe ``same host'' features examine only the connections in the past two seconds that have the same destination host as the current connection, and calculate statistics related to protocol behavior, service, etc.\N\NThe similar ``same service'' features examine only the connections in the past two seconds that have the same service as the current connection.\N\N"Same host" and "same service" features are together called time-based traffic features of the connection records.\N\NSome probing attacks scan the hosts (or ports) using a much larger time interval than two seconds, for example once per minute. Therefore, connection records were also sorted by destination host, and features were constructed using a window of 100 connections to the same host instead of a time window. This yields a set of so-called host-based traffic features.\N\NUnlike most of the DOS and probing attacks, there appear to be no sequential patterns that are frequent in records of R2L and U2R attacks. This is because the DOS and probing attacks involve many connections to some host(s) in a very short period of time, but the R2L and U2R attacks are embedded in the data portions of packets, and normally involve only a single connection.\N\NUseful algorithms for mining the unstructured data portions of packets automatically are an open research question. Stolfo et al. used domain knowledge to add features that look for suspicious behavior in the data portions, such as the number of failed login attempts. These features are called ``content'' features.\N\NA complete listing of the set of features defined for the connection records is given in the three tables below. The data schema of the contest dataset is available in machine-readable form.\N\Nfeature name description type\Nduration length (number of seconds) of the connection continuous\Nprotocol_type type of the protocol, e.g. tcp, udp, etc. discrete\Nservice network service on the destination, e.g., http, telnet, etc. discrete\Nsrc_bytes number of data bytes from source to destination continuous\Ndst_bytes number of data bytes from destination to source continuous\Nflag normal or error status of the connection discrete\Nland 1 if connection is from/to the same host/port; 0 otherwise discrete\Nwrong_fragment number of ``wrong'' fragments continuous\Nurgent number of urgent packets continuous\N\NTable 1: Basic features of individual TCP connections.\N\Nfeature name description type\Nhot number of ``hot'' indicators continuous\Nnum_failed_logins number of failed login attempts continuous\Nlogged_in 1 if successfully logged in; 0 otherwise discrete\Nnum_compromised number of ``compromised'' conditions continuous\Nroot_shell 1 if root shell is obtained; 0 otherwise discrete\Nsu_attempted 1 if ``su root'' command attempted; 0 otherwise discrete\Nnum_root number of ``root'' accesses continuous\Nnum_file_creations number of file creation operations continuous\Nnum_shells number of shell prompts continuous\Nnum_access_files number of operations on access control files continuous\Nnum_outbound_cmds number of outbound commands in an ftp session continuous\Nis_hot_login 1 if the login belongs to the ``hot'' list; 0 otherwise discrete\Nis_guest_login 1 if the login is a ``guest''login; 0 otherwise discrete\N\NTable 2: Content features within a connection suggested by domain knowledge.\N\Nfeature name description type\Ncount number of connections to the same host as the current connection in the past two seconds continuous\NNote: The following features refer to these same-host connections.\Nserror_rate % of connections that have ``SYN'' errors continuous\Nrerror_rate % of connections that have ``REJ'' errors continuous\Nsame_srv_rate % of connections to the same service continuous\Ndiff_srv_rate % of connections to different services continuous\Nsrv_count number of connections to the same service as the current connection in the past two seconds continuous\NNote: The following features refer to these same-service connections.\Nsrv_serror_rate % of connections that have ``SYN'' errors continuous\Nsrv_rerror_rate % of connections that have ``REJ'' errors continuous\Nsrv_diff_host_rate % of connections to different hosts continuous\N\NTable 3: Traffic features computed using a two-second time window.\N\N\N\N\Nhttp://www.sigkdd.org/kddcup
0 references
7 October 2014
0 references
label
0 references
4
0 references
23
0 references
42
0 references
494,020
0 references
0
0 references
0
0 references
34
0 references
8
0 references