Algorithms

A suite of Algorithms available in the form of operators within our Pipeline tool.

REQUEST ACCESS

BASE:
Mlib: RDD-Based
CATEGORY:
Classification/Regression
SUBCATEGORY:
Linear
PYSPARK NAME:
SVMWithSGD
INPUT:
LabeledPoint, weights, intercept, iterations

BASE:
MLlib: RDD-Based
CATEGORY:
Classification/Regression
SUBCATEGORY:
Linear
PYSPARK NAME:
LogisticRegressionWithLBFGS / LogisticRegressionWithSGD
INPUT:
LabeledPoint, weights, intercept, iterations

BASE:
MLlib: RDD-Based
CATEGORY:
Classification/Regression
SUBCATEGORY:
Linear
PYSPARK NAME:

LinearRegressionWithSGD
INPUT:
LabeledPoint, initialWeights, regParam, regType

BASE:
MLlib: RDD-Based
CATEGORY:
Classification/Regression
SUBCATEGORY:
Linear
PYSPARK NAME:

RidgeRegressionWithSGD
INPUT:
LabeledPoint, initialWeights, regParam, regType

BASE:
MLlib: RDD-Based
CATEGORY:
Classification/Regression
SUBCATEGORY:
Linear
PYSPARK NAME:

LassoWithSGD
INPUT:
LabeledPoint, initialWeights, regParam, regType

BASE:
MLlib: RDD-Based
CATEGORY:
Classification/Regression
SUBCATEGORY:
Linear
PYSPARK NAME:

IsotonicRegressionModel
INPUT:
boundaries(LabeledPoint), predictions, isotonic

BASE:
MLlib: RDD-Based
CATEGORY:
Classification/Regression
SUBCATEGORY:
Linear
PYSPARK NAME:

StreamingLinearRegressionWithSGDINPUT:
LabeledPoint, stepSize, numIterations, miniBatchFraction, convergenceTol*

BASE:
MLlib: RDD-Based
CATEGORY:
Collaborative filtering
SUBCATEGORY:
Nonlinear
PYSPARK NAME:
ALS
INPUT:
Ratings, rank, nonnegative

BASE:
MLlib: RDD-Based
CATEGORY:
Clustering
SUBCATEGORY:
Nonlinear
PYSPARK NAME:
KMeans
INPUT:
RDD, k, maxIterations, epsilon

BASE:
MLlib: RDD-Based
CATEGORY:
Clustering
SUBCATEGORY:
Nonlinear
PYSPARK NAME:
GaussianMixture
INPUT:
RDD, k, convergenceTol

BASE:
MLlib: RDD-Based
CATEGORY:
Clustering
SUBCATEGORY:
Nonlinear
PYSPARK NAME:
PowerIterationClustering
INPUT:
RDD, k

BASE:
MLlib: RDD-Based
CATEGORY:
Clustering
SUBCATEGORY:
Nonlinear
PYSPARK NAME:
LDA
INPUT:
RDD, k, docConcentration, topicConcentration, checkpointInterval, optimizer

BASE:
MLlib: RDD-Based
CATEGORY:
Clustering
SUBCATEGORY:
Nonlinear
PYSPARK NAME:
NaiveBayesModel
INPUT:
LabeledPoint, pi, theta

BASE:
MLlib: RDD-Based
CATEGORY:
Hierarchical clustering
SUBCATEGORY:
Nonlinear
PYSPARK NAME:
BisectingKMeans
INPUT:
RDD, k, minDivisibleClusterSize

BASE:
MLlib: RDD-Based
CATEGORY:
Dimensionality reduction
SUBCATEGORY:
Nonlinear
PYSPARK NAME:
RowMatrix.computeSVD
INPUT:
k, computeU, rCond

BASE:
MLlib: RDD-Based
CATEGORY:
Clustering
SUBCATEGORY:
Nonlinear
PYSPARK NAME:
StreamingKMeans
INPUT:
LabeledPoint, k, decayFactor, timeUnit

BASE:
MLlib: RDD-Based
CATEGORY:
Dimensionality reduction
SUBCATEGORY:
Nonlinear
PYSPARK NAME:
RowMatrix.computePrincipalComponents
INPUT:
k

BASE:
MLlib: RDD-Based
CATEGORY:
Frequent Pattern Mining
SUBCATEGORY:
Data mining
PYSPARK NAME:
FPGrowth
INPUT:
RDD, minSupport, numPartitions

BASE:
MLlib: RDD-Based
CATEGORY:
Frequent Pattern Mining
SUBCATEGORY:
Data mining
PYSPARK NAME:
PrefixSpan
INPUT:
RDD, minSupport, maxPatternLength, maxLocalProjDBSize

BASE:
MLlib: RDD-Based
CATEGORY:
Classification / Regression
SUBCATEGORY:
Tree
PYSPARK NAME:
DecisionTreeModel
INPUT:
LabeledPoint, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins

BASE:
MLlib: RDD-Based
CATEGORY:
Classification / Regression
SUBCATEGORY:
Tree
PYSPARK NAME:
RandomForestModel
INPUT:
LabeledPoint, numClasses, categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins

BASE:
MLlib: RDD-Based
CATEGORY:
Classification / Regression
SUBCATEGORY:
Tree
PYSPARK NAME:
GradientBoostedTreesModel
INPUT:
LabeledPoint, categoricalFeaturesInfo, loss, numIterations, learningRate, maxDepth, maxBins

Usage

Base

Category

Subcategory

Pyspark name

Input

SVM

Classification / Regression
Classification / Regression
Linear
SVMWithSGD
LabeledPoint, weights, intercept, iterations
Mlib: RDD-Based

Logistic Regression

MLlib: RDD-Based
Linear
LinearRegressionWithSGD
LogisticRegressionWithLBFGS / LogisticRegressionWithSGD
LogisticRegressionWithLBFGS / LogisticRegressionWithSGD

Linear Regression

MLlib: RDD-Based
Classification / Regression
Linear
LabeledPoint, initialWeights, regParam, regType

Ridge Regression

MLlib: RDD-Based
Classification / Regression
Linear
RidgeRegressionWithSGD
LabeledPoint, initialWeights, regParam, regType

Lasso Regression

MLlib: RDD-Based
Classification / Regression
Linear
LassoWithSGD
LabeledPoint, initialWeights, regParam, regType

Isotonic Regression

MLlib: RDD-Based
Classification / Regression
Linear
IsotonicRegressionModel
boundaries(LabeledPoint), predictions, isotonic

Linear regression on streaming data

MLlib: RDD-Based
Classification / Regression
Linear
LabeledPoint, stepSize, numIterations, miniBatchFraction, convergenceTol*

Alternating least squares (ALS)

MLlib: RDD-Based
Collaborative filtering
Nonlinear
ALS
Ratings, rank, nonnegative

KMeans

MLlib: RDD-Based
Clustering
Nonlinear
KMeans
RDD, k, maxIterations, epsilon

Gaussian Mixture

MLlib: RDD-Based
Clustering
Nonlinear
GaussianMixture
RDD, k, convergenceTol

Power iteration clustering (PIC)

MLlib: RDD-Based
Clustering
Nonlinear
PowerIterationClustering
RDD, k

Latent Dirichlet allocation (LDA)

MLlib: RDD-Based
Clustering
Nonlinear
LDA
RDD, k, docConcentration, topicConcentration, checkpointInterval, optimizer

Naive Bayes

MLlib: RDD-Based
Clustering
Nonlinear
NaiveBayesModel
LabeledPoint, pi, theta

Bisecting K-means

MLlib: RDD-Based
Hierarchical clustering
Nonlinear
BisectingKMeans
RDD, k, minDivisibleClusterSize

Streaming k-means

MLlib: RDD-Based
Clustering
Nonlinear
StreamingKMeans
LabeledPoint, k, decayFactor, timeUnit

Singular value decomposition (SVD)

MLlib: RDD-Based
Dimensionality reduction
Nonlinear
RowMatrix.computeSVD
k, computeU, rCond

Principal component analysis (PCA)

MLlib: RDD-Based
Dimensionality reduction
Nonlinear
RowMatrix.computePrincipalComponents
k

FP-growth

MLlib: RDD-Based
Frequent Pattern Mining
Data mining
FPGrowth
RDD, minSupport, numPartitions

Prefix span

MLlib: RDD-Based
Frequent Pattern Mining
Data mining
PrefixSpan
RDD, minSupport, maxPatternLength, maxLocalProjDBSize

Decision tree

MLlib: RDD-Based
Classification / Regression
Tree
DecisionTreeModel
LabeledPoint, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins

Random forest

MLlib: RDD-Based
Classification / Regression
Tree
RandomForestModel
LabeledPoint, numClasses, categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins

Gradient boost

MLlib: RDD-Based
Classification / Regression
Tree
GradientBoostedTreesModel
LabeledPoint, categoricalFeaturesInfo, loss, numIterations, learningRate, maxDepth, maxBins
StreamingLinearRegressionWithSGD

*RDD-based data is not the same as dataframes, for high computing, RDD is recommended within spark.
These is used in the MLlib library. How to call the libs in python:   from pyspark.mllib.*

MLlib: RDD-based   /   ML: Dataframe-based