kc1-binary

From MaRDI portal
Dataset:6033790



OpenML1066MaRDI QIDQ6033790

OpenML dataset with id 1066

No author found.

Full work available at URL: https://api.openml.org/data/v1/download/53949/kc1-binary.arff

Upload date: 6 October 2014



Dataset Characteristics

Number of classes: 2
Number of features: 95 (numeric: 94, symbolic: 1 and in total binary: 1 )
Number of instances: 145
Number of instances with missing values: 0
Number of missing values: 0

Author: Source: Unknown - Date unknown Please cite:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% This is a PROMISE Software Engineering Repository data set made publicly available in order to encourage repeatable, verifiable, refutable, and/or improvable predictive models of software engineering.

If you publish material based on PROMISE data sets then, please follow the acknowledgment guidelines posted on the PROMISE repository web page http://promise.site.uottawa.ca/SERepository . %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

1. Title: Class-level data for KC1 This one includes a {_TRUE,FALSE} attribute (DL) to indicate defectiveness.

2. Sources (a) Creator: A. Gunes Koru (b) Date: February 21, 2005 (c) Contact: gkoru AT umbc DOT edu Phone: +1 (410) 455 8843

3. Donor: A. Gunes Koru

4. Past Usage: This data was used for:

A. Gunes Koru and Hongfang Liu, "An Investigation of the Effect of Module Size on Defect Prediction Using Static Measures", PROMISE - Predictive Models in Software Engineering Workshop, ICSE 2005, May 15th 2005, Saint Louis, Missouri, US.

We used several machine learning algorithms to predict the defective modules in five NASA products, namely, CM1, JM1, KC1, KC2, and PC1. A set of static measures were used as predictor variables. While doing so, we observed that a large portion of the modules were small, as measured by lines of code (LOC). When we experimented on the data subsets created by partitioning according to module size, we obtained higher prediction performance for the subsets that include larger modules. We also performed defect prediction using class-level data for KC1 rather than method-level data. In this case, the use of class-level data resulted in improved prediction performance compared to using method-level data. These findings suggest that quality assurance activities can be guided even better if defect predictions are made by using data that belong to larger modules.

5. Features:

The descriptions of the features are taken from http://mdp.ivv.nasa.gov/mdp_glossary.html

Feature Used as the Response Variable:

==========================

DL: Defect level. _TRUE if the class contains one or more defects, false otherwise.

Features at Class Level Originally

======================

PERCENT_PUB_DATA: The percentage of data that is public and protected data in a class. In general, lower values indicate greater encapsulation. It is measure of encapsulation.

ACCESS_TO_PUB_DATA: The amount of times that a class's public and protected data is accessed. In general, lower values indicate greater encapsulation. It is a measure of encapsulation.

COUPLING_BETWEEN_OBJECTS: The number of distinct non-inheritance-related classes on which a class depends. If a class that is heavily dependent on many classes outside of its hierarchy is introduced into a library, all the classes upon which it depends need to be introduced as well. This may be acceptable, especially if the classes which it references are already part of a class library and are even more fundamental than the specified class.

DEPTH: The level for a class. For instance, if a parent has one child the depth for the child is two. Depth indicates at what level a class is located within its class hierarchy. In general, inheritance increases when depth increases.

LACK_OF_COHESION_OF_METHODS: For each data field in a class, the percentage of the methods in the class using that data field; the percentages are averaged then subtracted from 100%. The locm metric indicates low or high percentage of cohesion. If the percentage is low, the class is cohesive. If it is high, it may indicate that the class could be split into separate classes that will individually have greater cohesion.

NUM_OF_CHILDREN: The number of classes derived from a specified class.

DEP_ON_CHILD: Whether a class is dependent on a descendant.

FAN_IN: This is a count of calls by higher modules.

RESPONSE_FOR_CLASS: A count of methods implemented within a class plus the number of methods accessible to an object class due to inheritance. In general, lower values indicate greater polymorphism.

WEIGHTED_METHODS_PER_CLASS: A count of methods implemented within a class (rather than all methods accessible within the class hierarchy). In general, lower values indicate greater polymorphism. Features Transformed to Class Level (Originally at Method Level)

====================================================

Transformation was achieved by obtaining min, max, sum, and avg values over all the methods in a class. There this data set includes four features for all of the following features that were originally at the method level but transformed to the class level. For example, LOC_BLANK has minLOC_BLANK, maxLOC_BLANK, avgLOC_BLANK, and maxLOC_BLANK.

LOC_BLANK: Lines with only white space or no text content.

BRANCH_COUNT: This metric is the number of branches for each module. Branches are defined as those edges that exit from a decision node. The greater the number of branches in a program's modules, the more testing resource's required.

LOC_CODE_AND_COMMENT: Lines that contain both code and comment.

LOC_COMMENTS: The number of lines in a module. This particular metric includes all blank lines, comment lines, and source lines.

CYCLOMATIC_COMPLEXITY: It is a measure of the complexity of a modules decision structure. It is the number of linearly independent paths.

DESIGN_COMPLEXITY: Design complexity is a measure of a module's decision structure as it relates to calls to other modules. This quantifies the testing effort related to integration.

ESSENTIAL_COMPLEXITY: Essential complexity is a measure of the degree to which a module contains unstructured constructs.

LOC_EXECUTABLE: Source lines of code that contain only code and white space.

HALSTEAD_CONTENT: Complexity of a given algorithm independent of the language used to express the algorithm.

HALSTEAD_DIFFICULTY: Level of difficulty in the program.

HALSTEAD_EFFORT: Estimated mental effort required to develop the program.

HALSTEAD_ERROR_EST: Estimated number of errors in the program.

HALSTEAD_LENGTH: This is a Halstead metric that includes the total number of operator occurrences and total number of operand occurrences.

HALSTEAD_LEVEL: Level at which the program can be understood.

HALSTEAD_PROG_TIME: Estimated amount of time to implement the algorithm.

HALSTEAD_VOLUME: This is a Halstead metric that contains the minimum number of bits required for coding the program.

NUM_OPERANDS: Variables and identifiers Constants (numeric literal/string) Function names when used during calls.

NUM_UNIQUE_OPERANDS: Variables and identifiers Constants (numeric literal/string) Function names when used during calls

NUM_UNIQUE_OPERATORS: Number of unique operators.

LOC_TOTAL: Total Lines of Code.