-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
System Information (please complete the following information):
- OS & Version: Windows 11
- ML.NET Version: ML.NET v4.0.2 & Auto ML.NET v0.22.2
- .NET Version: .NET 10
Describe the bug
When using a transform like DropColumns or CustomMapping the resulting transform is not taken into account as the new schema. This causes Schema validation errors with our even applying the transforms. For instance, with the following code applied outside of the will work when passed outside of the preFeaturizer, but when used as an input it fails saying it cannot find the column. If a custom mapper is used the reverse problems happen where an output schema is not seen and the columns type is rejected.
"csharp var transformedData = ctx.Transforms.DropColumns([HybridClassifierInputModel.imageSource]).Fit(fullData).Transform(fullData);
"
This will also result in errors after training where you will need to manually apply the transform before the predictor.
To Reproduce
Steps to reproduce the behavior:
- Create a IDataView with a schema
- Set up an experiment for the MulticlassExperimentSettings
- Try any preFeaturizer where you would drop a column or try to change the schema too much. DropColumns or SelectColumns are perfect examples
Expected behavior
I would expect the transformer in preFeaturizer to be applied before the validation. Thus, it would allow for column drops or when working with data that must be massaged, or you are using in multiple ways.
Screenshots, Code, Sample Projects
MulticlassExperimentSettings textModelSettings = new MulticlassExperimentSettings()
{
OptimizingMetric = OptimizingMetric,
MaxExperimentTimeInSeconds = maxTrainTimeInSeconds,
//CacheBeforeTrainer = CacheBeforeTrainer.On,
CacheDirectoryName = Environment.CurrentDirectory, // Skip the disk and store in-memory
};
//var transformedData = ctx.Transforms.DropColumns([HybridClassifierInputModel.imageSource]).Fit(fullData).Transform(fullData);
MulticlassClassificationExperiment experiment = ctx.Auto().CreateMulticlassClassificationExperiment(textModelSettings);
TrainTestData trainValidationData = ctx.Data.TrainTestSplit(ctx.Data.ShuffleRows(transformedData), testFraction: 0.2);
ExperimentResult <MulticlassClassificationMetrics> result = experiment.Execute(trainData: trainValidationData.TrainSet,
//preFeaturizer: ctx.Transforms.CustomMapping<HybridClassifierInputModel, TextClassifierInputModel>(HybridToTextCustomAction.CustomAction, nameof(HybridToTextCustomAction), outputSchemaDefinition: SchemaDefinition.Create(typeof(TextClassifierInputModel))),
preFeaturizer: ctx.Transforms.DropColumns([HybridClassifierInputModel.imageSource]),
validationData: trainValidationData.TestSet,
labelColumnName: HybridClassifierInputModel.target,
progressHandler: new TextCPUMlClassifierProgressHandler<IHybridMlClassifierService>(Logger));
Additional context
I would have like to use the data to train more than one model, but it needs some small data changes for either one. I have a custom IDataView that attaches to a DbLite. It would be good if the preFeature worked so that it could stream the data.