CategoricalDataSetEncode(TextReader, Char, IndexCollection, Boolean, DictionaryInt32, Categorizer, IFormatProvider) Method

Encodes categorical or numerical data from the stream underlying the specified text reader applying specific numerical data categorizers.

Definition

Namespace: Novacta.Analytics
Assembly: Novacta.Analytics (in Novacta.Analytics.dll) Version: 2.1.0+428f3840cfab98dda567bb0ed350b302533e273a

C#

public static CategoricalDataSet Encode(
	TextReader reader,
	char columnDelimiter,
	IndexCollection extractedColumns,
	bool firstLineContainsVariableNames,
	Dictionary<int, Categorizer> specialCategorizers,
	IFormatProvider provider
)

VB

Public Shared Function Encode ( 
	reader As TextReader,
	columnDelimiter As Char,
	extractedColumns As IndexCollection,
	firstLineContainsVariableNames As Boolean,
	specialCategorizers As Dictionary(Of Integer, Categorizer),
	provider As IFormatProvider
) As CategoricalDataSet

C++

public:
static CategoricalDataSet^ Encode(
	TextReader^ reader, 
	wchar_t columnDelimiter, 
	IndexCollection^ extractedColumns, 
	bool firstLineContainsVariableNames, 
	Dictionary<int, Categorizer^>^ specialCategorizers, 
	IFormatProvider^ provider
)

F#

static member Encode : 
        reader : TextReader * 
        columnDelimiter : char * 
        extractedColumns : IndexCollection * 
        firstLineContainsVariableNames : bool * 
        specialCategorizers : Dictionary<int, Categorizer> * 
        provider : IFormatProvider -> CategoricalDataSet

Parameters

reader TextReader: The reader having access to the data stream.
columnDelimiter Char: The delimiter used to separate columns in data lines.
extractedColumns IndexCollection: The zero-based indexes of the columns from which data are to be extracted.
firstLineContainsVariableNames Boolean: If set to true signals that the first line contains variable names.
specialCategorizers DictionaryInt32, Categorizer: A mapping from a subset of extracted column indexes to a set of categorizers, to be executed when extracting data from the corresponding columns.
provider IFormatProvider: An object that provides formatting information to parse numeric values.

Return Value

CategoricalDataSet
The dataset containing information about the streamed data.

Remarks

Data Extraction

Each line from the stream is interpreted as the information about variables observed at a given instance. A line is split in tokens, each corresponding to a (zero-based) column, which in turn stores the data of a given variable. Columns are assumed to be separated each other by the character passed as columnDelimiter. Data from a variable are extracted only if the corresponding column index is in the collection extractedColumns.

Special Categorization

By default, tokens in a column are interpreted as category labels of the corresponding variable, which is inserted in the dataset as such. This behavior can be overridden by mapping a special categorizer to a given column by inserting, in the dictionary specialCategorizers, the categorizer as a value keyed with the index of the column whose data are to be categorized. A special categorizer can be useful if a given column corresponds to a numerical variable which must be discretized before its insertion in the dataset. For categorizers obtained by entropy minimization, see, for example, CategorizeByEntropyMinimization(TextReader, Char, IndexCollection, Boolean, Int32, IFormatProvider).

Example

In the following example, a stream contains two columns, the first corresponding to a numerical variable, and the second to a categorical one, which is interpreted as the target. A special categorizer, obtained by intra interval entropy minimization, is assigned to the first column to discretize its data, then both columns are encoded in a categorical dataset.

Categorizing numerical data by intra interval entropy minimization and subsequent encoding in a categorical dataset

using System;
using System.Globalization;
using System.IO;

namespace Novacta.Analytics.CodeExamples
{
    public class CategoricalEncodeExample1  
    {
        public void Main()
        {
            // Create a data stream.
            const int numberOfInstances = 27;
            string[] data = [
            "NUMERICAL,TARGET",
            "0,A",
            "0,A",
            "0,A",
            "1,B",
            "1,B",
            "1,B",
            "1,B",
            "2,B",
            "2,B",
            "3,C",
            "3,C",
            "3,C",
            "4,B",
            "4,B",
            "4,B",
            "4,C",
            "5,A",
            "5,A",
            "6,A",
            "7,C",
            "7,C",
            "7,C",
            "8,C",
            "8,C",
            "9,C",
            "9,C",
            "9,C" ];

            MemoryStream stream = new();
            StreamWriter writer = new(stream);
            for (int i = 0; i < data.Length; i++) {
                writer.WriteLine(data[i].ToCharArray());
                writer.Flush();
            }
            stream.Position = 0;

            // Identify the special categorizer for variable NUMERICAL.
            StreamReader streamReader = new(stream);
            char columnDelimiter = ',';
            IndexCollection numericalColumns = IndexCollection.Range(0, 0);
            bool firstLineContainsColumnHeaders = true;
            int targetColumn = 1;
            IFormatProvider provider = CultureInfo.InvariantCulture;
            var specialCategorizers = CategoricalDataSet.CategorizeByEntropyMinimization(
                streamReader,
                columnDelimiter,
                numericalColumns,
                firstLineContainsColumnHeaders,
                targetColumn,
                provider);

            // Encode the categorical data set using the special categorizer.
            stream.Position = 0;
            IndexCollection extractedColumns = IndexCollection.Range(0, 1);
            CategoricalDataSet dataset = CategoricalDataSet.Encode(
                streamReader,
                columnDelimiter,
                extractedColumns,
                firstLineContainsColumnHeaders,
                specialCategorizers,
                provider);

            // Decode and show the data set.
            Console.WriteLine("Decoded data set:");
            Console.WriteLine();
            var decodedDataSet = dataset.Decode();
            int numberOfVariables = dataset.Data.NumberOfColumns;

            foreach (var variable in dataset.Variables) {
                Console.Write(variable.Name + ",");
            }
            Console.WriteLine();

            for (int i = 0; i < numberOfInstances; i++) {
                for (int j = 0; j < numberOfVariables; j++) {
                    Console.Write(decodedDataSet[i][j] + ",");
                }
                Console.WriteLine();
            }
        }
    }
}

// Executing method Main() produces the following output:
// 
// Decoded data set:
// 
// NUMERICAL,TARGET,
// ]-Inf, 2.5],A,
// ]-Inf, 2.5],A,
// ]-Inf, 2.5],A,
// ]-Inf, 2.5],B,
// ]-Inf, 2.5],B,
// ]-Inf, 2.5],B,
// ]-Inf, 2.5],B,
// ]-Inf, 2.5],B,
// ]-Inf, 2.5],B,
// ]2.5, Inf[,C,
// ]2.5, Inf[,C,
// ]2.5, Inf[,C,
// ]2.5, Inf[,B,
// ]2.5, Inf[,B,
// ]2.5, Inf[,B,
// ]2.5, Inf[,C,
// ]2.5, Inf[,A,
// ]2.5, Inf[,A,
// ]2.5, Inf[,A,
// ]2.5, Inf[,C,
// ]2.5, Inf[,C,
// ]2.5, Inf[,C,
// ]2.5, Inf[,C,
// ]2.5, Inf[,C,
// ]2.5, Inf[,C,
// ]2.5, Inf[,C,
// ]2.5, Inf[,C,

In the following example, a data stream is read to encode a categorical dataset. The stream contains two columns, the first corresponding to a categorical variable, and the second to a numerical one. A special categorizer is assigned to the second column to discretize its data.

Encoding a categorical dataset from a stream containing both categorical and numerical data

using System;
using System.Collections.Generic;
using System.Globalization;
using System.IO;

namespace Novacta.Analytics.CodeExamples
{
    public class CategoricalEncodeExample0  
    {
        public void Main()
        {
            // Create a data stream.
            string[] data = [
            "COLOR,NUMBER",
            "Red,  -2.2",
            "Green, 0.0",
            "Red,  -3.3",
            "Black,-1.1",
            "Black, 4.4" ];

            MemoryStream stream = new();
            StreamWriter writer = new(stream);
            for (int i = 0; i < data.Length; i++) {
                writer.WriteLine(data[i].ToCharArray());
                writer.Flush();
            }
            stream.Position = 0;

            // Define a special categorizer for variable NUMBER
            // using a local function.
            static string numberCategorizer(string token, IFormatProvider provider)
            {
                double datum = Convert.ToDouble(token, provider);
                if (datum == 0)
                {
                    return "Zero";
                }
                else if (datum < 0)
                {
                    return "Negative";
                }
                else
                {
                    return "Positive";
                }
            }

            // Attach the special categorizer to variable NUMBER.
            int numberColumnIndex = 1;
            var specialCategorizers = new Dictionary<int, Categorizer>
            {
                { numberColumnIndex, numberCategorizer }
            };

            // Encode the categorical data set.
            StreamReader streamReader = new(stream);
            char columnDelimiter = ',';
            IndexCollection extractedColumns = IndexCollection.Range(0, 1);
            bool firstLineContainsColumnHeaders = true;
            CategoricalDataSet dataset = CategoricalDataSet.Encode(
                streamReader,
                columnDelimiter,
                extractedColumns,
                firstLineContainsColumnHeaders,
                specialCategorizers,
                CultureInfo.InvariantCulture);

            // Decode and show the data set.
            Console.WriteLine("Decoded data set:");
            Console.WriteLine();
            var decodedDataSet = dataset.Decode();
            int numberOfInstances = dataset.Data.NumberOfRows;
            int numberOfVariables = dataset.Data.NumberOfColumns;

            foreach (var variable in dataset.Variables) {
                Console.Write(variable.Name + ",");
            }
            Console.WriteLine();

            for (int i = 0; i < numberOfInstances; i++) {
                for (int j = 0; j < numberOfVariables; j++) {
                    Console.Write(decodedDataSet[i][j] + ",");
                }
                Console.WriteLine();
            }
        }
    }
}

// Executing method Main() produces the following output:
// 
// Decoded data set:
// 
// COLOR,NUMBER,
// Red,Negative,
// Green,Zero,
// Red,Negative,
// Black,Negative,
// Black,Positive,

Exceptions

ArgumentNullException	reader is null. -or- extractedColumns is null. -or- specialCategorizers is null. -or- provider is null.
ArgumentException	specialCategorizers contains null values or keys which are not in the extractedColumns collection.
InvalidDataException	There are no data rows in the stream accessed by reader. -or- There is at least a row which contains not enough data for any column specified by extractedColumns. This can happen if there are missing columns, or if strings representing variable names or category labels, i.e. tokens extracted from the stream or returned by a special categorizer, are null or consist only of white-space characters. In some cases, the InnerException property is set to add further details about the occurred error.