Click or drag to resize

CategoricalDataSetCategorizeByEntropyMinimization Method (TextReader, Char, IndexCollection, Boolean, Int32, IFormatProvider)

Discretizes numerical data from the stream underlying the specified text reader by defining multiple intervals of the numerical data range. Intervals are identified by minimizing the intra-interval entropy of the specified target data.

Namespace:  Novacta.Analytics
Assembly:  Novacta.Analytics (in Novacta.Analytics.dll) Version: 2.0.0
Syntax
public static Dictionary<int, Categorizer> CategorizeByEntropyMinimization(
	TextReader reader,
	char columnDelimiter,
	IndexCollection numericalColumns,
	bool firstLineContainsVariableNames,
	int targetColumn,
	IFormatProvider provider
)

Parameters

reader
Type: System.IOTextReader
The reader having access to the data stream.
columnDelimiter
Type: SystemChar
The delimiter used to separate columns in data lines.
numericalColumns
Type: Novacta.AnalyticsIndexCollection
The zero-based indexes of the columns from which numerical data are to be extracted.
firstLineContainsVariableNames
Type: SystemBoolean
If set to true signals that the first line contains variable names.
targetColumn
Type: SystemInt32
The zero-based index of the column from which target data are to be extracted.
provider
Type: SystemIFormatProvider
An object that provides formatting information to parse numeric values.

Return Value

Type: DictionaryInt32, Categorizer
A mapping from the set of extracted numerical column indexes to a set of categorizers of the corresponding data.
Exceptions
ExceptionCondition
ArgumentNullExceptionreader is null.
-or-
numericalColumns is null.
-or-
provider is null.
ArgumentOutOfRangeExceptiontargetColumn is negative.
InvalidDataException The stream accessed by reader contains no data rows.
-or-
There is at least a row which contains not enough data for any column specified by targetColumn or numericalColumns. This can happen if there are missing columns, or if strings representing target category labels, are null or consist only of white-space characters, or if strings representing numerical values cannot be converted to an equivalent double-precision floating-point number. In some cases, the InnerException property is set to add further details about the occurred error.
Remarks

Data Extraction

Each line from the stream is interpreted as the information about categorical or numerical variables observed at a given instance. A line is split in tokens, each corresponding to a (zero-based) column, which in turn stores the data of a given variable. Columns are assumed to be separated each other by the character passed as columnDelimiter. Data from a variable are extracted only if the corresponding column index is equal to targetColumn or in the collection numericalColumns.

Intra Interval Entropy Minimization

By default, when encoding a CategoricalDataSet, tokens in a column are interpreted as category labels of the corresponding variable, which are inserted in the dataset as such. This behavior can be overridden by mapping a special Categorizer to a given column. Following Fayyad and Irani, (1993)[1] , this method selects a categorizer by splitting the range of the numerical data into multiple intervals in order to minimize the intra-interval heterogeneity of the given target. A dictionary is returned in which, for each numerical column, the corresponding categorizer is inserted as a value keyed with the index of the given column. A special categorizer can be useful if a given column refers to a numerical variable which must be discretized before its insertion in a categorical dataset.

Examples

In the following example, a stream contains two columns, the first corresponding to a numerical variable, and the second to a categorical one, which is interpreted as the target. A special categorizer, obtained by intra interval entropy minimization, is assigned to the first column to discretize its data, then both columns are encoded in a categorical dataset.

Categorizing numerical data by intra interval entropy minimization and subsequent encoding in a categorical dataset
using System;
using System.Globalization;
using System.IO;

namespace Novacta.Analytics.CodeExamples
{
    public class CategoricalEncodeExample1  
    {
        public void Main()
        {
            // Create a data stream.
            const int numberOfInstances = 27;
            string[] data = new string[numberOfInstances + 1] {
            "NUMERICAL,TARGET",
            "0,A",
            "0,A",
            "0,A",
            "1,B",
            "1,B",
            "1,B",
            "1,B",
            "2,B",
            "2,B",
            "3,C",
            "3,C",
            "3,C",
            "4,B",
            "4,B",
            "4,B",
            "4,C",
            "5,A",
            "5,A",
            "6,A",
            "7,C",
            "7,C",
            "7,C",
            "8,C",
            "8,C",
            "9,C",
            "9,C",
            "9,C" };

            MemoryStream stream = new();
            StreamWriter writer = new(stream);
            for (int i = 0; i < data.Length; i++) {
                writer.WriteLine(data[i].ToCharArray());
                writer.Flush();
            }
            stream.Position = 0;

            // Identify the special categorizer for variable NUMERICAL.
            StreamReader streamReader = new(stream);
            char columnDelimiter = ',';
            IndexCollection numericalColumns = IndexCollection.Range(0, 0);
            bool firstLineContainsColumnHeaders = true;
            int targetColumn = 1;
            IFormatProvider provider = CultureInfo.InvariantCulture;
            var specialCategorizers = CategoricalDataSet.CategorizeByEntropyMinimization(
                streamReader,
                columnDelimiter,
                numericalColumns,
                firstLineContainsColumnHeaders,
                targetColumn,
                provider);

            // Encode the categorical data set using the special categorizer.
            stream.Position = 0;
            IndexCollection extractedColumns = IndexCollection.Range(0, 1);
            CategoricalDataSet dataset = CategoricalDataSet.Encode(
                streamReader,
                columnDelimiter,
                extractedColumns,
                firstLineContainsColumnHeaders,
                specialCategorizers,
                provider);

            // Decode and show the data set.
            Console.WriteLine("Decoded data set:");
            Console.WriteLine();
            var decodedDataSet = dataset.Decode();
            int numberOfVariables = dataset.Data.NumberOfColumns;

            foreach (var variable in dataset.Variables) {
                Console.Write(variable.Name + ",");
            }
            Console.WriteLine();

            for (int i = 0; i < numberOfInstances; i++) {
                for (int j = 0; j < numberOfVariables; j++) {
                    Console.Write(decodedDataSet[i][j] + ",");
                }
                Console.WriteLine();
            }
        }
    }
}

// Executing method Main() produces the following output:
// 
// Decoded data set:
// 
// NUMERICAL,TARGET,
// ]-Inf, 2.5],A,
// ]-Inf, 2.5],A,
// ]-Inf, 2.5],A,
// ]-Inf, 2.5],B,
// ]-Inf, 2.5],B,
// ]-Inf, 2.5],B,
// ]-Inf, 2.5],B,
// ]-Inf, 2.5],B,
// ]-Inf, 2.5],B,
// ]2.5, Inf[,C,
// ]2.5, Inf[,C,
// ]2.5, Inf[,C,
// ]2.5, Inf[,B,
// ]2.5, Inf[,B,
// ]2.5, Inf[,B,
// ]2.5, Inf[,C,
// ]2.5, Inf[,A,
// ]2.5, Inf[,A,
// ]2.5, Inf[,A,
// ]2.5, Inf[,C,
// ]2.5, Inf[,C,
// ]2.5, Inf[,C,
// ]2.5, Inf[,C,
// ]2.5, Inf[,C,
// ]2.5, Inf[,C,
// ]2.5, Inf[,C,
// ]2.5, Inf[,C,

Bibliography
[1] Fayyad, U.M. and Irani, K.B., Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning, in: Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, pp. 1022-1027. San Francisco, CA: Morgan Kaufmann. (1993), http://ijcai.org/Proceedings/93-2/Papers/022.pdf
See Also