The clusteror.core Module

This module contains Clusteror class capsulating raw data to discover clusters from, the cleaned data for a clusteror to run on.

The clustering model encompasses two parts:

  1. Neural network: Pre-training (often encountered in Deep Learning context) is implemented to achieve a goal that the neural network maps the input data of higher dimension to a one dimensional representation. Ideally this mapping is one-to-one. A Denoising Autoencoder (DA) or Stacked Denoising Autoencoder (SDA) is implemented for this purpose.

  2. One dimensional clustering model: A separate model segments the samples against the one dimensional representation. Two models are available in this class definition:

    • K-Means
    • Valley model

The pivot idea here is given the neural network is a good one-to-one mapper the separate clustering model on one dimensional representation is equivalent to a clustering model on the original high dimensional data.

Note

Valley model is explained in details in module clusteror.utils.

class clusteror.core.Clusteror(raw_data)[source]

Bases: object

Clusteror class can train neural networks DA or SDA, train taggers, or load saved models from files.

Parameters:raw_data (Pandas DataFrame) – Dataframe read from data source. It can be original dataset without any preprocessing or with a certain level of manipulation for future analysis.
_raw_data

Pandas DataFrame – Stores the original dataset. It’s the dataset that later post-clustering performance analysis will be based on.

_cleaned_data

Pandas DataFrame – Preprocessed data. Not necessarily has same number of columns with _raw_data as a categorical column can derive multiple columns. As the tanh function is used as activation function for symmetric consideration. All columns should have values in range [-1, 1], otherwise an OutRangeError will be raised.

_network

strda for DA; sda for SDA. Facilating functions called with one or the other algorithm.

_da_dim_reducer

Theano function – Keeps the Theano function that is from trained DA model. Reduces the dimension of the cleaned data down to one.

_sda_dim_reducer

Theano function – Keeps the Theano function that is from trained SDA model. Reduces the dimension of the cleaned data down to one.

_one_dim_data

Numpy Array – The dimension reduced one dimensional data.

_valley

Python function – Trained valley model tagging sample with their one dimensional representation.

_kmeans

Scikit-Learn K-Means model – Trained K-Means model tagging samples with their one dimensional representation.

_tagger

str – Keeps records of which tagger implemented.

_field_importance

List – Keeps the list of coefficiences that influence the clustering emphasis.

add_cluster()[source]

Tags each sample regarding their reduced one dimensional value. Adds an extra column ‘cluster’ to raw_data, seggesting a zero-based cluster ID.

cleaned_data

Pandas DataFrame – For assgining cleaned dataframe to _cleaned_dat.

da_dim_reducer

Theano function – Function that reduces dataset dimension. Attribute _network is given da to designate the method of the autoencoder as DA.

field_importance

List – Significance that given to fields when training of neural network is done. Fields with a large number will be given more attention.

Note

The importance is only meaningful relatively between fields. If no values are specified, all fields are treated equally.

Parameters:field_importance (List or Dict, default None (List of Ones)) –
  • If a list is designated, all fields should be assigned an

importance, viz, the length of the list should be equal to the length of the features training the neural network.

  • It can also be given in a dict. In such a case, the fields can

be selectively given a value. Dict key is for field name and value is for the importance. Fields not included will be initiated with the default value one. A warning will be issued when a key is not on the list of field names, mostly because of a typo.

classmethod from_csv(filepath, **kwargs)[source]

Class method for directly reading CSV file.

Parameters:
  • filepath (str) – Path to the CSV file
  • **kwargs (keyword arguments) – Other keyword arguments passed to pandas.read_csv
kmeans

Python function – Trained on the dimension reduced one dimensional data that segregates subjects into concentration of existence in a subset of [-1, 1] with K-Means algorithm. _tagger is given valley to facilitate follow-up usages.

load_dim_reducer(filepath='dim_reducer.pk')[source]

Loads saved dimension reducer. Need to first name the network type.

Parameters:filepath (str) –
load_kmeans(filepath)[source]

Loads a saved K-Means tagger from a file.

filepath: str
File path to the file saving the K-Means tagger.
load_valley(filepath)[source]

Loads a saved valley tagger from a file. Create the valley function from the saved parameters.

filepath: str
File path to the file saving the valley tagger.
one_dim_data

Numpy Array – Stores the output of neural network that has dimension one.

raw_data

Pandas DataFrame – For assgining new values to _raw_data.

reduce_to_one_dim()[source]

Reduces the dimension of input dataset to one before the tagging in the next step.

Input of the Theano function is the cleaned data and output is a one dimensional data stored in _one_dim_data.

save_dim_reducer(filepath='dim_reducer.pk', include_network=False)[source]

Save dimension reducer from the neural network training.

Parameters:
  • filepath (str) – Filename to store the dimension reducer.
  • include_network (boolean) – If true, prefix the filepath with the network type.
save_kmeans(filepath, include_taggername=False)[source]

Saves K-Means model to the named file path. Can add a prefix to indicate this saves a K-Means model.

Parameters:
  • filepath (str) – File path for saving the model.
  • include_taggername (boolean, default False) – Include the kmean_ prefix in filename if true.
save_valley(filepath, include_taggername=False)[source]

Saves valley tagger.

Parameters:
  • filepath (str) – File path to save the tagger.
  • include_taggername (boolean, default False) – Include the valley_ prefix in filename if true.
sda_dim_reducer

Theano function – Function that reduces dataset dimension. Attribute _network is given sda to designate the method of the autoencoder as SDA.

tagger

str – Name the tagger if necessary to do so, which will facilitate, e.g. prefixing the filepath.

train_da_dim_reducer(field_importance=None, batch_size=50, corruption_level=0.3, learning_rate=0.002, min_epochs=200, patience=60, patience_increase=2, improvement_threshold=0.98, verbose=False)[source]

Trains a DA neural network.

Parameters:
  • field_importance (List or Dict, default None (List of Ones)) –
    • If a list is designated, all fields should be assigned an

    importance, viz, the length of the list should be equal to the length of the features training the neural network.

    • It can also be given in a dict. In such a case, the fields can

    be selectively given a value. Dict key is for field name and value is for the importance. Fields not included will be initiated with the default value one. A warning will be issued when a key is not on the list of field names, mostly because of a typo.

  • batch_size (int) – Size of each training batch. Necessary to derive the number of batches.
  • corruption_level (float, between 0 and 1) – Dropout rate in reading input, typical pratice in deep learning to avoid overfitting.
  • learning_rate (float) – Propagating step size for gredient descent algorithm.
  • min_epochs (int) – The mininum number of training epoch to run. It can be exceeded depending on the setup of patience and ad-hoc training progress.
  • patience (int) – True number of training epochs to run if larger than min_epochs. Note it is potentially increased during the training if the cost is better than the expectation from current cost.
  • patience_increase (int) – Coefficient used to increase patience against epochs that have been run.
  • improvement_threshold (float, between 0 and 1) – Minimum improvement considered as substantial improvement, i.e. new cost over existing lowest cost lower than this value.
  • verbose (boolean, default False) – Prints out training at each epoch if true.
train_kmeans(n_clusters=10, **kwargs)[source]

Trains K-Means model on top of the one dimensional data derived from dimension reducers.

Parameters:
  • n_clusters (int) – The number of clusters required to start a K-Means learning.
  • **kwargs (keyword arguments) – Any other keyword arguments passed on to Scikit-Learn K-Means model.
train_sda_dim_reducer(field_importance=None, batch_size=50, hidden_layers_sizes=[20], corruption_levels=[0.3], learning_rate=0.002, min_epochs=200, patience=60, patience_increase=2, improvement_threshold=0.98, verbose=False)[source]

Trains a SDA neural network.

Parameters:
  • field_importance (List or Dict, default None (List of Ones)) –
    • If a list is designated, all fields should be assigned an

    importance, viz, the length of the list should be equal to the length of the features training the neural network.

    • It can also be given in a dict. In such a case, the fields can

    be selectively given a value. Dict key is for field name and value is for the importance. Fields not included will be initiated with the default value one. A warning will be issued when a key is not on the list of field names, mostly because of a typo.

  • batch_size (int) – Size of each training batch. Necessary to derive the number of batches.
  • hidden_layers_sizes (List of ints) – Number of neurons in the hidden layers (all but the input layer).
  • corruption_levels (List of floats, between 0 and 1) – Dropout rate in reading input, typical pratice in deep learning to avoid overfitting.
  • learning_rate (float) – Propagating step size for gredient descent algorithm.
  • min_epochs (int) – The mininum number of training epoch to run. It can be exceeded depending on the setup of patience and ad-hoc training progress.
  • patience (int) – True number of training epochs to run if larger than min_epochs. Note it is potentially increased during the training if the cost is better than the expectation from current cost.
  • patience_increase (int) – Coefficient used to increase patience against epochs that have been run.
  • improvement_threshold (float, between 0 and 1) – Minimum improvement considered as substantial improvement, i.e. new cost over existing lowest cost lower than this value.
  • verbose (boolean, default False) – Prints out training at each epoch if true.
train_valley(bins=100, contrast=0.3)[source]

Trains the ability to cut the universe of samples into clusters based how the dimension reduced dataset assembles in a histogram. Unlike the K-Means, no need to preset the number of clusters.

Parameters:
  • bins (int) – Number of bins to aggregate the one dimensional data.
  • contrast (float, between 0 and 1) – Threshold used to define local minima and local maxima. Detailed explanation in utils.find_local_extremes.

Note

When getting only one cluster, check the distribution of one_dim_data. Likely the data points flock too close to each other. Try increasing bins first. If not working, try different neural networks with more or less layers with more or less neurons.

valley

Python function – Trained on the dimension reduced one dimensional data that segregates subjects into concentration of existence in a subset of [-1, 1], by locating the “valley” in the distribution landscape. _tagger is given valley to facilitate follow-up usages.

exception clusteror.core.OutRangeError[source]

Bases: Exception

Exceptions thrown as cleaned data go beyond range [-1, 1].