Open
Description
** These should all be implemented with #1031 **
=========================================
This is to track implementation of the ML-Features: https://spark.apache.org/docs/latest/ml-features
Bucketizer has been implemented in #378 but there are more features that should be implemented.
- Feature Extractors
- TF-IDF
- Word2Vec (Implement ML Features: Word2Vec #491)
- CountVectorizer (Implement ML/CountVectorizer and ML/CountVectorizerModel #608)
- FeatureHasher (FeatureHasher #652)
- Feature Transformers
- Tokenizer (base class for Feature as lots of methods are shared between the objects (more methods to be added in later pr's) #574)
- StopWordsRemover (Add stop words removers #726 thanks @SARAVANA1501 )
- n-gram (in-progress Add NGram #734)
- Binarizer (in-progress Add Binarizer #744)
- [] PCA (in-progress)
- PolynormalExpansion
- Dicrete Cosine Transform (DCT)
- StringIndexer (in-progress)
- IndexToString
- OneHotEncoderEstimator
- VectorIndexer
- Normalizer
- StandardScaler
- MinMaxScaler
- MaxAbsScaler
- Bucketizer
- ElementwiseProduct
- SQLTransformer (Implement ML Features #381. SQLTransformer class and testcase #781 @ramanathanv)
- VectorAssembler
- VectorSizeHint
- QuantileDiscretizer
- Imputer
- Feature Selectors
- VectorSlicer
- RFormula
- ChiSqSelector
- Locality Sensitive Hashing
- LSH Operations
- Feature Transformation
- Approximate Similarity Join
- Approximate Nearest Neighbour Search
- LSH Algorithms
- Bucketed Random Projection for Euclidean Distance
- MinHash for Jaccard Distance
- LSH Operations
If anyone else is going to implement probably best to put a comment here and I'll keep the list up to date.