DoubleMatrixEncode(String, Char, IndexCollection, Boolean, DictionaryInt32, Codifier, IFormatProvider) Method

Encodes categorical or numerical data from the given file into a matrix, applying specific categorical data codifiers.

Definition

Namespace: Novacta.Analytics
Assembly: Novacta.Analytics (in Novacta.Analytics.dll) Version: 2.1.0+428f3840cfab98dda567bb0ed350b302533e273a
C#
public static DoubleMatrix Encode(
	string path,
	char columnDelimiter,
	IndexCollection extractedColumns,
	bool firstLineContainsVariableNames,
	Dictionary<int, Codifier> specialCodifiers,
	IFormatProvider provider
)

Parameters

path  String
The data file to be opened for reading.
columnDelimiter  Char
The delimiter used to separate columns in data lines.
extractedColumns  IndexCollection
The zero-based indexes of the columns from which data are to be extracted.
firstLineContainsVariableNames  Boolean
If set to true signals that the first line contains variable names.
specialCodifiers  DictionaryInt32, Codifier
A mapping from a subset of extracted column indexes to a set of codifiers, to be executed when extracting data from the corresponding columns.
provider  IFormatProvider
An object that provides formatting information.

Return Value

DoubleMatrix
The matrix containing information about the streamed data.

Remarks

Data Extraction

Each line from the stream is interpreted as the information about variables observed at a given instance. A line is split in tokens, each corresponding to a (zero-based) column, which in turn stores the data of a given variable. Columns are assumed to be separated each other by the character passed as columnDelimiter. Data from a variable are extracted only if the corresponding column index is in the collection extractedColumns.

Special Codification

By default, tokens in a column are interpreted as numerical data, which are inserted in the matrix using ToDouble(String, IFormatProvider). This behavior can be overridden by mapping a special codifier to a given column by inserting, in the dictionary specialCodifiers, the codifier as a value keyed with the index of the column whose data are to be transformed. A special codifier can be useful if a given column corresponds to a categorical variable whose labels must be represented via numerical codes in the returned matrix.

Example

In the following example, a stream is read to encode data into a matrix. The stream contains two columns, the first corresponding to a variable representing time instants, and the second to a numerical one. A special codifier is assigned to the first column to define codes for time representations.

Encoding data into a matrix from a stream using a special codifier
using System;
using System.Collections.Generic;
using System.Globalization;
using System.IO;

namespace Novacta.Analytics.CodeExamples
{
    public class MatrixEncodeExample0  
    {
        public void Main()
        {
            // Create a data stream.
            string[] data = [
            "TIME,NUMBER",
            "20200410 09:42:00.000 +00:00, -2.2",
            "20210511 11:51:00.010 +00:00,  0.0",
            "20220612 15:11:31.200 +00:00, -3.3",
            "20230713 17:32:10.749 +00:00, -1.1",
            "20240814 09:42:00.150 +00:00,  4.4" ];

            MemoryStream stream = new();
            StreamWriter writer = new(stream);
            for (int i = 0; i < data.Length; i++)
            {
                writer.WriteLine(data[i].ToCharArray());
                writer.Flush();
            }
            stream.Position = 0;

            // Define a special codifier for variable TIME
            // using a local function.
            static double timeCodifier(string token, IFormatProvider provider)
            {
                double datum = DateTimeOffset.ParseExact(
                    input: token,
                    format: "yyyyMMdd HH:mm:ss.fff zzz",
                    formatProvider: provider).ToUnixTimeMilliseconds();

                return datum;
            }

            // Attach the special codifier to variable TIME.
            int numberColumnIndex = 0;
            var specialCodifiers = new Dictionary<int, Codifier>
            {
                { numberColumnIndex, timeCodifier }
            };

            // Encode the matrix.
            StreamReader streamReader = new(stream);
            char columnDelimiter = ',';
            IndexCollection extractedColumns = IndexCollection.Range(0, 1);
            bool firstLineContainsColumnHeaders = true;
            DoubleMatrix matrix = DoubleMatrix.Encode(
                streamReader,
                columnDelimiter,
                extractedColumns,
                firstLineContainsColumnHeaders,
                specialCodifiers,
                CultureInfo.InvariantCulture);

            // Show the matrix.
            Console.WriteLine("Encoded matrix:");
            Console.WriteLine();
            Console.Write(matrix);
            Console.WriteLine();

            // Decode variable TIME.
            Console.WriteLine("Decoded variable TIME:");
            Console.WriteLine();
            var time = matrix[":", 0];
            for (int i = 0; i < time.Count; i++)
            {
                Console.WriteLine(
                    "Time {0}: {1}",
                    i,
                    DateTimeOffset
                        .FromUnixTimeMilliseconds(Convert.ToInt64(time[i]))
                        .ToString("yyyyMMdd HH:mm:ss.fff zzz"));
            }
        }
    }
}

// Executing method Main() produces the following output:
// 
// Encoded matrix:
// 
// [TIME]           [NUMBER]         
// 1.58651172e+12   -2.2             
// 1.62073386e+12   0                
// 1.65504669e+12   -3.3             
// 1.68926953e+12   -1.1             
// 1.72362852e+12   4.4              
// 
// 
// Decoded variable TIME:
// 
// Time 0: 20200410 09:42:00.000 +00:00
// Time 1: 20210511 11:51:00.010 +00:00
// Time 2: 20220612 15:11:31.200 +00:00
// Time 3: 20230713 17:32:10.749 +00:00
// Time 4: 20240814 09:42:00.150 +00:00

Exceptions

ArgumentNullExceptionpath is null.
-or-
extractedColumns is null.
-or-
specialCodifiers is null.
-or-
provider is null.
ArgumentExceptionspecialCodifiers contains null values or keys which are not in the extractedColumns collection.
-or-
path is an empty string ("").
FileNotFoundException The file cannot be found.
DirectoryNotFoundException The specified path is invalid, such as being on an unmapped drive.
IOExceptionpath includes an incorrect or invalid syntax for file name, directory name, or volume label.
InvalidDataException The stream to file having the specified path contains no data rows.
-or-
There is at least a row which contains not enough data for any column specified by extractedColumns. This can happen if there are missing columns, or if tokens extracted from the stream are null or consist only of white-space characters.

See Also