The clusteror.core
Module¶
This module contains Clusteror
class capsulating raw data to discover
clusters from, the cleaned data for a clusteror to run on.
The clustering model encompasses two parts:
Neural network: Pre-training (often encountered in Deep Learning context) is implemented to achieve a goal that the neural network maps the input data of higher dimension to a one dimensional representation. Ideally this mapping is one-to-one. A Denoising Autoencoder (DA) or Stacked Denoising Autoencoder (SDA) is implemented for this purpose.
One dimensional clustering model: A separate model segments the samples against the one dimensional representation. Two models are available in this class definition:
- K-Means
- Valley model
The pivot idea here is given the neural network is a good one-to-one mapper the separate clustering model on one dimensional representation is equivalent to a clustering model on the original high dimensional data.
Note
Valley model is explained in details in module clusteror.utils
.
-
class
clusteror.core.
Clusteror
(raw_data)[source]¶ Bases:
object
Clusteror
class can train neural networks DA or SDA, train taggers, or load saved models from files.Parameters: raw_data (Pandas DataFrame) – Dataframe read from data source. It can be original dataset without any preprocessing or with a certain level of manipulation for future analysis. -
_raw_data
¶ Pandas DataFrame – Stores the original dataset. It’s the dataset that later post-clustering performance analysis will be based on.
-
_cleaned_data
¶ Pandas DataFrame – Preprocessed data. Not necessarily has same number of columns with
_raw_data
as a categorical column can derive multiple columns. As thetanh
function is used as activation function for symmetric consideration. All columns should have values in range[-1, 1]
, otherwise anOutRangeError
will be raised.
-
_network
¶ str – da for DA; sda for SDA. Facilating functions called with one or the other algorithm.
-
_da_dim_reducer
¶ Theano function – Keeps the Theano function that is from trained DA model. Reduces the dimension of the cleaned data down to one.
-
_sda_dim_reducer
¶ Theano function – Keeps the Theano function that is from trained SDA model. Reduces the dimension of the cleaned data down to one.
-
_one_dim_data
¶ Numpy Array – The dimension reduced one dimensional data.
-
_valley
¶ Python function – Trained valley model tagging sample with their one dimensional representation.
-
_kmeans
¶ Scikit-Learn K-Means model – Trained K-Means model tagging samples with their one dimensional representation.
-
_tagger
¶ str – Keeps records of which tagger implemented.
-
_field_importance
¶ List – Keeps the list of coefficiences that influence the clustering emphasis.
-
add_cluster
()[source]¶ Tags each sample regarding their reduced one dimensional value. Adds an extra column ‘cluster’ to
raw_data
, seggesting a zero-based cluster ID.
-
cleaned_data
¶ Pandas DataFrame – For assgining cleaned dataframe to
_cleaned_dat
.
-
da_dim_reducer
¶ Theano function – Function that reduces dataset dimension. Attribute
_network
is given da to designate the method of the autoencoder asDA
.
-
field_importance
¶ List – Significance that given to fields when training of neural network is done. Fields with a large number will be given more attention.
Note
The importance is only meaningful relatively between fields. If no values are specified, all fields are treated equally.
Parameters: field_importance (List or Dict, default None (List of Ones)) – - If a list is designated, all fields should be assigned an
importance, viz, the length of the list should be equal to the length of the features training the neural network.
- It can also be given in a dict. In such a case, the fields can
be selectively given a value. Dict key is for field name and value is for the importance. Fields not included will be initiated with the default value one. A warning will be issued when a key is not on the list of field names, mostly because of a typo.
-
classmethod
from_csv
(filepath, **kwargs)[source]¶ Class method for directly reading CSV file.
Parameters: - filepath (str) – Path to the CSV file
- **kwargs (keyword arguments) – Other keyword arguments passed to
pandas.read_csv
-
kmeans
¶ Python function – Trained on the dimension reduced one dimensional data that segregates subjects into concentration of existence in a subset of
[-1, 1]
with K-Means algorithm._tagger
is given valley to facilitate follow-up usages.
-
load_dim_reducer
(filepath='dim_reducer.pk')[source]¶ Loads saved dimension reducer. Need to first name the network type.
Parameters: filepath (str) –
-
load_kmeans
(filepath)[source]¶ Loads a saved K-Means tagger from a file.
- filepath: str
- File path to the file saving the K-Means tagger.
-
load_valley
(filepath)[source]¶ Loads a saved valley tagger from a file. Create the valley function from the saved parameters.
- filepath: str
- File path to the file saving the valley tagger.
-
one_dim_data
¶ Numpy Array – Stores the output of neural network that has dimension one.
-
raw_data
¶ Pandas DataFrame – For assgining new values to
_raw_data
.
-
reduce_to_one_dim
()[source]¶ Reduces the dimension of input dataset to one before the tagging in the next step.
Input of the Theano function is the cleaned data and output is a one dimensional data stored in
_one_dim_data
.
-
save_dim_reducer
(filepath='dim_reducer.pk', include_network=False)[source]¶ Save dimension reducer from the neural network training.
Parameters: - filepath (str) – Filename to store the dimension reducer.
- include_network (boolean) – If true, prefix the filepath with the network type.
-
save_kmeans
(filepath, include_taggername=False)[source]¶ Saves K-Means model to the named file path. Can add a prefix to indicate this saves a K-Means model.
Parameters: - filepath (str) – File path for saving the model.
- include_taggername (boolean, default False) – Include the kmean_ prefix in filename if true.
-
save_valley
(filepath, include_taggername=False)[source]¶ Saves valley tagger.
Parameters: - filepath (str) – File path to save the tagger.
- include_taggername (boolean, default False) – Include the valley_ prefix in filename if true.
-
sda_dim_reducer
¶ Theano function – Function that reduces dataset dimension. Attribute
_network
is given sda to designate the method of the autoencoder asSDA
.
-
tagger
¶ str – Name the tagger if necessary to do so, which will facilitate, e.g. prefixing the filepath.
-
train_da_dim_reducer
(field_importance=None, batch_size=50, corruption_level=0.3, learning_rate=0.002, min_epochs=200, patience=60, patience_increase=2, improvement_threshold=0.98, verbose=False)[source]¶ Trains a
DA
neural network.Parameters: - field_importance (List or Dict, default None (List of Ones)) –
- If a list is designated, all fields should be assigned an
importance, viz, the length of the list should be equal to the length of the features training the neural network.
- It can also be given in a dict. In such a case, the fields can
be selectively given a value. Dict key is for field name and value is for the importance. Fields not included will be initiated with the default value one. A warning will be issued when a key is not on the list of field names, mostly because of a typo.
- batch_size (int) – Size of each training batch. Necessary to derive the number of batches.
- corruption_level (float, between 0 and 1) – Dropout rate in reading input, typical pratice in deep learning to avoid overfitting.
- learning_rate (float) – Propagating step size for gredient descent algorithm.
- min_epochs (int) – The mininum number of training epoch to run. It can be exceeded depending on the setup of patience and ad-hoc training progress.
- patience (int) – True number of training epochs to run if larger than
min_epochs
. Note it is potentially increased during the training if the cost is better than the expectation from current cost. - patience_increase (int) – Coefficient used to increase patience against epochs that have been run.
- improvement_threshold (float, between 0 and 1) – Minimum improvement considered as substantial improvement, i.e. new cost over existing lowest cost lower than this value.
- verbose (boolean, default False) – Prints out training at each epoch if true.
- field_importance (List or Dict, default None (List of Ones)) –
-
train_kmeans
(n_clusters=10, **kwargs)[source]¶ Trains K-Means model on top of the one dimensional data derived from dimension reducers.
Parameters: - n_clusters (int) – The number of clusters required to start a K-Means learning.
- **kwargs (keyword arguments) – Any other keyword arguments passed on to Scikit-Learn K-Means model.
-
train_sda_dim_reducer
(field_importance=None, batch_size=50, hidden_layers_sizes=[20], corruption_levels=[0.3], learning_rate=0.002, min_epochs=200, patience=60, patience_increase=2, improvement_threshold=0.98, verbose=False)[source]¶ Trains a
SDA
neural network.Parameters: - field_importance (List or Dict, default None (List of Ones)) –
- If a list is designated, all fields should be assigned an
importance, viz, the length of the list should be equal to the length of the features training the neural network.
- It can also be given in a dict. In such a case, the fields can
be selectively given a value. Dict key is for field name and value is for the importance. Fields not included will be initiated with the default value one. A warning will be issued when a key is not on the list of field names, mostly because of a typo.
- batch_size (int) – Size of each training batch. Necessary to derive the number of batches.
- hidden_layers_sizes (List of ints) – Number of neurons in the hidden layers (all but the input layer).
- corruption_levels (List of floats, between 0 and 1) – Dropout rate in reading input, typical pratice in deep learning to avoid overfitting.
- learning_rate (float) – Propagating step size for gredient descent algorithm.
- min_epochs (int) – The mininum number of training epoch to run. It can be exceeded depending on the setup of patience and ad-hoc training progress.
- patience (int) – True number of training epochs to run if larger than
min_epochs
. Note it is potentially increased during the training if the cost is better than the expectation from current cost. - patience_increase (int) – Coefficient used to increase patience against epochs that have been run.
- improvement_threshold (float, between 0 and 1) – Minimum improvement considered as substantial improvement, i.e. new cost over existing lowest cost lower than this value.
- verbose (boolean, default False) – Prints out training at each epoch if true.
- field_importance (List or Dict, default None (List of Ones)) –
-
train_valley
(bins=100, contrast=0.3)[source]¶ Trains the ability to cut the universe of samples into clusters based how the dimension reduced dataset assembles in a histogram. Unlike the K-Means, no need to preset the number of clusters.
Parameters: - bins (int) – Number of bins to aggregate the one dimensional data.
- contrast (float, between 0 and 1) – Threshold used to define local minima and local maxima. Detailed
explanation in
utils.find_local_extremes
.
Note
When getting only one cluster, check the distribution of
one_dim_data
. Likely the data points flock too close to each other. Try increasingbins
first. If not working, try different neural networks with more or less layers with more or less neurons.
-
valley
¶ Python function – Trained on the dimension reduced one dimensional data that segregates subjects into concentration of existence in a subset of
[-1, 1]
, by locating the “valley” in the distribution landscape._tagger
is given valley to facilitate follow-up usages.
-