Click or drag to resize

CategoricalDataSetEncode Method (TextReader, Char, IndexCollection, Boolean, DictionaryInt32, Categorizer, IFormatProvider)

Encodes categorical or numerical data from the stream underlying the specified text reader applying specific data categorizers.

Namespace:  Novacta.Analytics
Assembly:  Novacta.Analytics (in Novacta.Analytics.dll) Version: 2.0.0
Syntax
public static CategoricalDataSet Encode(
	TextReader reader,
	char columnDelimiter,
	IndexCollection extractedColumns,
	bool firstLineContainsVariableNames,
	Dictionary<int, Categorizer> specialCategorizers,
	IFormatProvider provider
)

Parameters

reader
Type: System.IOTextReader
The reader having access to the data stream.
columnDelimiter
Type: SystemChar
The delimiter used to separate columns in data lines.
extractedColumns
Type: Novacta.AnalyticsIndexCollection
The zero-based indexes of the columns from which data are to be extracted.
firstLineContainsVariableNames
Type: SystemBoolean
If set to true signals that the first line contains variable names.
specialCategorizers
Type: System.Collections.GenericDictionaryInt32, Categorizer
A mapping from a subset of extracted column indexes to a set of categorizers, to be executed when extracting data from the corresponding columns.
provider
Type: SystemIFormatProvider
An object that provides formatting information to parse numeric values.

Return Value

Type: CategoricalDataSet
The dataset containing information about the streamed data.
Exceptions
ExceptionCondition
ArgumentNullExceptionreader is null.
-or-
extractedColumns is null.
-or-
specialCategorizers is null.
-or-
provider is null.
ArgumentExceptionspecialCategorizers contains null values or keys which are not in the extractedColumns collection.
InvalidDataException There are no data rows in the stream accessed by reader.
-or-
There is at least a row which contains not enough data for any column specified by extractedColumns. This can happen if there are missing columns, or if strings representing variable names or category labels, i.e. tokens extracted from the stream or returned by a special categorizer, are null or consist only of white-space characters. In some cases, the InnerException property is set to add further details about the occurred error.
Remarks

Data Extraction

Each line from the stream is interpreted as the information about variables observed at a given instance. A line is split in tokens, each corresponding to a (zero-based) column, which in turn stores the data of a given variable. Columns are assumed to be separated each other by the character passed as columnDelimiter. Data from a variable are extracted only if the corresponding column index is in the collection extractedColumns.

Special Categorization

By default, tokens in a column are interpreted as category labels of the corresponding variable, which is inserted in the dataset as such. This behavior can be overridden by mapping a special categorizer to a given column by inserting, in the dictionary specialCategorizers, the categorizer as a value keyed with the index of the column whose data are to be categorized. A special categorizer can be useful if a given column corresponds to a numerical variable which must be discretized before its insertion in the dataset. For categorizers obtained by entropy minimization, see, for example, CategorizeByEntropyMinimization(TextReader, Char, IndexCollection, Boolean, Int32, IFormatProvider).

Examples

In the following example, a stream contains two columns, the first corresponding to a numerical variable, and the second to a categorical one, which is interpreted as the target. A special categorizer, obtained by intra interval entropy minimization, is assigned to the first column to discretize its data, then both columns are encoded in a categorical dataset.

Categorizing numerical data by intra interval entropy minimization and subsequent encoding in a categorical dataset
using System;
using System.Globalization;
using System.IO;

namespace Novacta.Analytics.CodeExamples
{
    public class CategoricalEncodeExample1  
    {
        public void Main()
        {
            // Create a data stream.
            const int numberOfInstances = 27;
            string[] data = new string[numberOfInstances + 1] {
            "NUMERICAL,TARGET",
            "0,A",
            "0,A",
            "0,A",
            "1,B",
            "1,B",
            "1,B",
            "1,B",
            "2,B",
            "2,B",
            "3,C",
            "3,C",
            "3,C",
            "4,B",
            "4,B",
            "4,B",
            "4,C",
            "5,A",
            "5,A",
            "6,A",
            "7,C",
            "7,C",
            "7,C",
            "8,C",
            "8,C",
            "9,C",
            "9,C",
            "9,C" };

            MemoryStream stream = new();
            StreamWriter writer = new(stream);
            for (int i = 0; i < data.Length; i++) {
                writer.WriteLine(data[i].ToCharArray());
                writer.Flush();
            }
            stream.Position = 0;

            // Identify the special categorizer for variable NUMERICAL.
            StreamReader streamReader = new(stream);
            char columnDelimiter = ',';
            IndexCollection numericalColumns = IndexCollection.Range(0, 0);
            bool firstLineContainsColumnHeaders = true;
            int targetColumn = 1;
            IFormatProvider provider = CultureInfo.InvariantCulture;
            var specialCategorizers = CategoricalDataSet.CategorizeByEntropyMinimization(
                streamReader,
                columnDelimiter,
                numericalColumns,
                firstLineContainsColumnHeaders,
                targetColumn,
                provider);

            // Encode the categorical data set using the special categorizer.
            stream.Position = 0;
            IndexCollection extractedColumns = IndexCollection.Range(0, 1);
            CategoricalDataSet dataset = CategoricalDataSet.Encode(
                streamReader,
                columnDelimiter,
                extractedColumns,
                firstLineContainsColumnHeaders,
                specialCategorizers,
                provider);

            // Decode and show the data set.
            Console.WriteLine("Decoded data set:");
            Console.WriteLine();
            var decodedDataSet = dataset.Decode();
            int numberOfVariables = dataset.Data.NumberOfColumns;

            foreach (var variable in dataset.Variables) {
                Console.Write(variable.Name + ",");
            }
            Console.WriteLine();

            for (int i = 0; i < numberOfInstances; i++) {
                for (int j = 0; j < numberOfVariables; j++) {
                    Console.Write(decodedDataSet[i][j] + ",");
                }
                Console.WriteLine();
            }
        }
    }
}

// Executing method Main() produces the following output:
// 
// Decoded data set:
// 
// NUMERICAL,TARGET,
// ]-Inf, 2.5],A,
// ]-Inf, 2.5],A,
// ]-Inf, 2.5],A,
// ]-Inf, 2.5],B,
// ]-Inf, 2.5],B,
// ]-Inf, 2.5],B,
// ]-Inf, 2.5],B,
// ]-Inf, 2.5],B,
// ]-Inf, 2.5],B,
// ]2.5, Inf[,C,
// ]2.5, Inf[,C,
// ]2.5, Inf[,C,
// ]2.5, Inf[,B,
// ]2.5, Inf[,B,
// ]2.5, Inf[,B,
// ]2.5, Inf[,C,
// ]2.5, Inf[,A,
// ]2.5, Inf[,A,
// ]2.5, Inf[,A,
// ]2.5, Inf[,C,
// ]2.5, Inf[,C,
// ]2.5, Inf[,C,
// ]2.5, Inf[,C,
// ]2.5, Inf[,C,
// ]2.5, Inf[,C,
// ]2.5, Inf[,C,
// ]2.5, Inf[,C,

In the following example, a data stream is read to encode a categorical dataset. The stream contains two columns, the first corresponding to a categorical variable, and the second to a numerical one. A special categorizer is assigned to the second column to discretize its data.

Encoding a categorical dataset from a stream containing both categorical and numerical data
using System;
using System.Collections.Generic;
using System.Globalization;
using System.IO;

namespace Novacta.Analytics.CodeExamples
{
    public class CategoricalEncodeExample0  
    {
        public void Main()
        {
            // Create a data stream.
            string[] data = new string[6] {
            "COLOR,NUMBER",
            "Red,  -2.2",
            "Green, 0.0",
            "Red,  -3.3",
            "Black,-1.1",
            "Black, 4.4" };

            MemoryStream stream = new();
            StreamWriter writer = new(stream);
            for (int i = 0; i < data.Length; i++) {
                writer.WriteLine(data[i].ToCharArray());
                writer.Flush();
            }
            stream.Position = 0;

            // Define a special categorizer for variable NUMBER
            // using a local function.
            static string numberCategorizer(string token, IFormatProvider provider)
            {
                double datum = Convert.ToDouble(token, provider);
                if (datum == 0)
                {
                    return "Zero";
                }
                else if (datum < 0)
                {
                    return "Negative";
                }
                else
                {
                    return "Positive";
                }
            }

            // Attach the special categorizer to variable NUMBER.
            int numberColumnIndex = 1;
            var specialCategorizers = new Dictionary<int, Categorizer>
            {
                { numberColumnIndex, numberCategorizer }
            };

            // Encode the categorical data set.
            StreamReader streamReader = new(stream);
            char columnDelimiter = ',';
            IndexCollection extractedColumns = IndexCollection.Range(0, 1);
            bool firstLineContainsColumnHeaders = true;
            CategoricalDataSet dataset = CategoricalDataSet.Encode(
                streamReader,
                columnDelimiter,
                extractedColumns,
                firstLineContainsColumnHeaders,
                specialCategorizers,
                CultureInfo.InvariantCulture);

            // Decode and show the data set.
            Console.WriteLine("Decoded data set:");
            Console.WriteLine();
            var decodedDataSet = dataset.Decode();
            int numberOfInstances = dataset.Data.NumberOfRows;
            int numberOfVariables = dataset.Data.NumberOfColumns;

            foreach (var variable in dataset.Variables) {
                Console.Write(variable.Name + ",");
            }
            Console.WriteLine();

            for (int i = 0; i < numberOfInstances; i++) {
                for (int j = 0; j < numberOfVariables; j++) {
                    Console.Write(decodedDataSet[i][j] + ",");
                }
                Console.WriteLine();
            }
        }
    }
}

// Executing method Main() produces the following output:
// 
// Decoded data set:
// 
// COLOR,NUMBER,
// Red,Negative,
// Green,Zero,
// Red,Negative,
// Black,Negative,
// Black,Positive,

See Also