CategoricalDataSetCategorizeByEntropyMinimization(TextReader, Char, IndexCollection, Boolean, Int32, IFormatProvider) Method

Discretizes numerical data from the stream underlying the specified text reader by defining multiple intervals of the numerical data range. Intervals are identified by minimizing the intra-interval entropy of the specified target data.

Definition

Namespace: Novacta.Analytics
Assembly: Novacta.Analytics (in Novacta.Analytics.dll) Version: 2.1.0+428f3840cfab98dda567bb0ed350b302533e273a

C#

public static Dictionary<int, Categorizer> CategorizeByEntropyMinimization(
	TextReader reader,
	char columnDelimiter,
	IndexCollection numericalColumns,
	bool firstLineContainsVariableNames,
	int targetColumn,
	IFormatProvider provider
)

VB

Public Shared Function CategorizeByEntropyMinimization ( 
	reader As TextReader,
	columnDelimiter As Char,
	numericalColumns As IndexCollection,
	firstLineContainsVariableNames As Boolean,
	targetColumn As Integer,
	provider As IFormatProvider
) As Dictionary(Of Integer, Categorizer)

C++

public:
static Dictionary<int, Categorizer^>^ CategorizeByEntropyMinimization(
	TextReader^ reader, 
	wchar_t columnDelimiter, 
	IndexCollection^ numericalColumns, 
	bool firstLineContainsVariableNames, 
	int targetColumn, 
	IFormatProvider^ provider
)

F#

static member CategorizeByEntropyMinimization : 
        reader : TextReader * 
        columnDelimiter : char * 
        numericalColumns : IndexCollection * 
        firstLineContainsVariableNames : bool * 
        targetColumn : int * 
        provider : IFormatProvider -> Dictionary<int, Categorizer>

Parameters

reader TextReader: The reader having access to the data stream.
columnDelimiter Char: The delimiter used to separate columns in data lines.
numericalColumns IndexCollection: The zero-based indexes of the columns from which numerical data are to be extracted.
firstLineContainsVariableNames Boolean: If set to true signals that the first line contains variable names.
targetColumn Int32: The zero-based index of the column from which target data are to be extracted.
provider IFormatProvider: An object that provides formatting information to parse numeric values.

Return Value

DictionaryInt32, Categorizer
A mapping from the set of extracted numerical column indexes to a set of categorizers of the corresponding data.

Remarks

Data Extraction

Each line from the stream is interpreted as the information about categorical or numerical variables observed at a given instance. A line is split in tokens, each corresponding to a (zero-based) column, which in turn stores the data of a given variable. Columns are assumed to be separated each other by the character passed as columnDelimiter. Data from a variable are extracted only if the corresponding column index is equal to targetColumn or in the collection numericalColumns.

Intra Interval Entropy Minimization

By default, when encoding a CategoricalDataSet, tokens in a column are interpreted as category labels of the corresponding variable, which are inserted in the dataset as such. This behavior can be overridden by mapping a special Categorizer to a given column. Following Fayyad and Irani, (1993)^[1], this method selects a categorizer by splitting the range of the numerical data into multiple intervals in order to minimize the intra-interval heterogeneity of the given target. A dictionary is returned in which, for each numerical column, the corresponding categorizer is inserted as a value keyed with the index of the given column. A special categorizer can be useful if a given column refers to a numerical variable which must be discretized before its insertion in a categorical dataset.

Example

In the following example, a stream contains two columns, the first corresponding to a numerical variable, and the second to a categorical one, which is interpreted as the target. A special categorizer, obtained by intra interval entropy minimization, is assigned to the first column to discretize its data, then both columns are encoded in a categorical dataset.

Categorizing numerical data by intra interval entropy minimization and subsequent encoding in a categorical dataset

using System;
using System.Globalization;
using System.IO;

namespace Novacta.Analytics.CodeExamples
{
    public class CategoricalEncodeExample1  
    {
        public void Main()
        {
            // Create a data stream.
            const int numberOfInstances = 27;
            string[] data = [
            "NUMERICAL,TARGET",
            "0,A",
            "0,A",
            "0,A",
            "1,B",
            "1,B",
            "1,B",
            "1,B",
            "2,B",
            "2,B",
            "3,C",
            "3,C",
            "3,C",
            "4,B",
            "4,B",
            "4,B",
            "4,C",
            "5,A",
            "5,A",
            "6,A",
            "7,C",
            "7,C",
            "7,C",
            "8,C",
            "8,C",
            "9,C",
            "9,C",
            "9,C" ];

            MemoryStream stream = new();
            StreamWriter writer = new(stream);
            for (int i = 0; i < data.Length; i++) {
                writer.WriteLine(data[i].ToCharArray());
                writer.Flush();
            }
            stream.Position = 0;

            // Identify the special categorizer for variable NUMERICAL.
            StreamReader streamReader = new(stream);
            char columnDelimiter = ',';
            IndexCollection numericalColumns = IndexCollection.Range(0, 0);
            bool firstLineContainsColumnHeaders = true;
            int targetColumn = 1;
            IFormatProvider provider = CultureInfo.InvariantCulture;
            var specialCategorizers = CategoricalDataSet.CategorizeByEntropyMinimization(
                streamReader,
                columnDelimiter,
                numericalColumns,
                firstLineContainsColumnHeaders,
                targetColumn,
                provider);

            // Encode the categorical data set using the special categorizer.
            stream.Position = 0;
            IndexCollection extractedColumns = IndexCollection.Range(0, 1);
            CategoricalDataSet dataset = CategoricalDataSet.Encode(
                streamReader,
                columnDelimiter,
                extractedColumns,
                firstLineContainsColumnHeaders,
                specialCategorizers,
                provider);

            // Decode and show the data set.
            Console.WriteLine("Decoded data set:");
            Console.WriteLine();
            var decodedDataSet = dataset.Decode();
            int numberOfVariables = dataset.Data.NumberOfColumns;

            foreach (var variable in dataset.Variables) {
                Console.Write(variable.Name + ",");
            }
            Console.WriteLine();

            for (int i = 0; i < numberOfInstances; i++) {
                for (int j = 0; j < numberOfVariables; j++) {
                    Console.Write(decodedDataSet[i][j] + ",");
                }
                Console.WriteLine();
            }
        }
    }
}

// Executing method Main() produces the following output:
// 
// Decoded data set:
// 
// NUMERICAL,TARGET,
// ]-Inf, 2.5],A,
// ]-Inf, 2.5],A,
// ]-Inf, 2.5],A,
// ]-Inf, 2.5],B,
// ]-Inf, 2.5],B,
// ]-Inf, 2.5],B,
// ]-Inf, 2.5],B,
// ]-Inf, 2.5],B,
// ]-Inf, 2.5],B,
// ]2.5, Inf[,C,
// ]2.5, Inf[,C,
// ]2.5, Inf[,C,
// ]2.5, Inf[,B,
// ]2.5, Inf[,B,
// ]2.5, Inf[,B,
// ]2.5, Inf[,C,
// ]2.5, Inf[,A,
// ]2.5, Inf[,A,
// ]2.5, Inf[,A,
// ]2.5, Inf[,C,
// ]2.5, Inf[,C,
// ]2.5, Inf[,C,
// ]2.5, Inf[,C,
// ]2.5, Inf[,C,
// ]2.5, Inf[,C,
// ]2.5, Inf[,C,
// ]2.5, Inf[,C,

Exceptions

ArgumentNullException	reader is null. -or- numericalColumns is null. -or- provider is null.
ArgumentOutOfRangeException	targetColumn is negative.
InvalidDataException	The stream accessed by reader contains no data rows. -or- There is at least a row which contains not enough data for any column specified by targetColumn or numericalColumns. This can happen if there are missing columns, or if strings representing target category labels, are null or consist only of white-space characters, or if strings representing numerical values cannot be converted to an equivalent double-precision floating-point number. In some cases, the InnerException property is set to add further details about the occurred error.

Bibliography

[1] Fayyad, U.M. and Irani, K.B., Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning, in: Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, pp. 1022-1027. San Francisco, CA: Morgan Kaufmann. (1993), http://ijcai.org/Proceedings/93-2/Papers/022.pdf