clusteror package

clusteror.core module

This module contains Clusteror class capsulating raw data to discover clusters from, the cleaned data for a clusteror to run on.

The clustering model encompasses two parts:

  1. Neural network: Pre-training (often encountered in Deep Learning context) is implemented to achieve a goal that the neural network maps the input data of higher dimension to a one dimensional representation. Ideally this mapping is one-to-one. A Denoising Autoencoder (DA) or Stacked Denoising Autoencoder (SDA) is implemented for this purpose.

  2. One dimensional clustering model: A separate model segments the samples against the one dimensional representation. Two models are available in this class definition:

    • K-Means
    • Valley model

The pivot idea here is given the neural network is a good one-to-one mapper the separate clustering model on one dimensional representation is equivalent to a clustering model on the original high dimensional data.

Note

Valley model is explained in details in module clusteror.utils.

class clusteror.core.Clusteror(raw_data)[source]

Bases: object

Clusteror class can train neural networks DA or SDA, train taggers, or load saved models from files.

Parameters:raw_data (Pandas DataFrame) – Dataframe read from data source. It can be original dataset without any preprocessing or with a certain level of manipulation for future analysis.
_raw_data

Pandas DataFrame – Stores the original dataset. It’s the dataset that later post-clustering performance analysis will be based on.

_cleaned_data

Pandas DataFrame – Preprocessed data. Not necessarily has same number of columns with _raw_data as a categorical column can derive multiple columns. As the tanh function is used as activation function for symmetric consideration. All columns should have values in range [-1, 1], otherwise an OutRangeError will be raised.

_network

strda for DA; sda for SDA. Facilating functions called with one or the other algorithm.

_da_dim_reducer

Theano function – Keeps the Theano function that is from trained DA model. Reduces the dimension of the cleaned data down to one.

_sda_dim_reducer

Theano function – Keeps the Theano function that is from trained SDA model. Reduces the dimension of the cleaned data down to one.

_one_dim_data

Numpy Array – The dimension reduced one dimensional data.

_valley

Python function – Trained valley model tagging sample with their one dimensional representation.

_kmeans

Scikit-Learn K-Means model – Trained K-Means model tagging samples with their one dimensional representation.

_tagger

str – Keeps records of which tagger implemented.

_field_importance

List – Keeps the list of coefficiences that influence the clustering emphasis.

add_cluster()[source]

Tags each sample regarding their reduced one dimensional value. Adds an extra column ‘cluster’ to raw_data, seggesting a zero-based cluster ID.

cleaned_data

Pandas DataFrame – For assgining cleaned dataframe to _cleaned_dat.

da_dim_reducer

Theano function – Function that reduces dataset dimension. Attribute _network is given da to designate the method of the autoencoder as DA.

field_importance

List – Significance that given to fields when training of neural network is done. Fields with a large number will be given more attention.

Note

The importance is only meaningful relatively between fields. If no values are specified, all fields are treated equally.

Parameters:field_importance (List or Dict, default None (List of Ones)) –
  • If a list is designated, all fields should be assigned an

importance, viz, the length of the list should be equal to the length of the features training the neural network.

  • It can also be given in a dict. In such a case, the fields can

be selectively given a value. Dict key is for field name and value is for the importance. Fields not included will be initiated with the default value one. A warning will be issued when a key is not on the list of field names, mostly because of a typo.

classmethod from_csv(filepath, **kwargs)[source]

Class method for directly reading CSV file.

Parameters:
  • filepath (str) – Path to the CSV file
  • **kwargs (keyword arguments) – Other keyword arguments passed to pandas.read_csv
kmeans

Python function – Trained on the dimension reduced one dimensional data that segregates subjects into concentration of existence in a subset of [-1, 1] with K-Means algorithm. _tagger is given valley to facilitate follow-up usages.

load_dim_reducer(filepath='dim_reducer.pk')[source]

Loads saved dimension reducer. Need to first name the network type.

Parameters:filepath (str) –
load_kmeans(filepath)[source]

Loads a saved K-Means tagger from a file.

filepath: str
File path to the file saving the K-Means tagger.
load_valley(filepath)[source]

Loads a saved valley tagger from a file. Create the valley function from the saved parameters.

filepath: str
File path to the file saving the valley tagger.
one_dim_data

Numpy Array – Stores the output of neural network that has dimension one.

raw_data

Pandas DataFrame – For assgining new values to _raw_data.

reduce_to_one_dim()[source]

Reduces the dimension of input dataset to one before the tagging in the next step.

Input of the Theano function is the cleaned data and output is a one dimensional data stored in _one_dim_data.

save_dim_reducer(filepath='dim_reducer.pk', include_network=False)[source]

Save dimension reducer from the neural network training.

Parameters:
  • filepath (str) – Filename to store the dimension reducer.
  • include_network (boolean) – If true, prefix the filepath with the network type.
save_kmeans(filepath, include_taggername=False)[source]

Saves K-Means model to the named file path. Can add a prefix to indicate this saves a K-Means model.

Parameters:
  • filepath (str) – File path for saving the model.
  • include_taggername (boolean, default False) – Include the kmean_ prefix in filename if true.
save_valley(filepath, include_taggername=False)[source]

Saves valley tagger.

Parameters:
  • filepath (str) – File path to save the tagger.
  • include_taggername (boolean, default False) – Include the valley_ prefix in filename if true.
sda_dim_reducer

Theano function – Function that reduces dataset dimension. Attribute _network is given sda to designate the method of the autoencoder as SDA.

tagger

str – Name the tagger if necessary to do so, which will facilitate, e.g. prefixing the filepath.

train_da_dim_reducer(field_importance=None, batch_size=50, corruption_level=0.3, learning_rate=0.002, min_epochs=200, patience=60, patience_increase=2, improvement_threshold=0.98, verbose=False)[source]

Trains a DA neural network.

Parameters:
  • field_importance (List or Dict, default None (List of Ones)) –
    • If a list is designated, all fields should be assigned an

    importance, viz, the length of the list should be equal to the length of the features training the neural network.

    • It can also be given in a dict. In such a case, the fields can

    be selectively given a value. Dict key is for field name and value is for the importance. Fields not included will be initiated with the default value one. A warning will be issued when a key is not on the list of field names, mostly because of a typo.

  • batch_size (int) – Size of each training batch. Necessary to derive the number of batches.
  • corruption_level (float, between 0 and 1) – Dropout rate in reading input, typical pratice in deep learning to avoid overfitting.
  • learning_rate (float) – Propagating step size for gredient descent algorithm.
  • min_epochs (int) – The mininum number of training epoch to run. It can be exceeded depending on the setup of patience and ad-hoc training progress.
  • patience (int) – True number of training epochs to run if larger than min_epochs. Note it is potentially increased during the training if the cost is better than the expectation from current cost.
  • patience_increase (int) – Coefficient used to increase patience against epochs that have been run.
  • improvement_threshold (float, between 0 and 1) – Minimum improvement considered as substantial improvement, i.e. new cost over existing lowest cost lower than this value.
  • verbose (boolean, default False) – Prints out training at each epoch if true.
train_kmeans(n_clusters=10, **kwargs)[source]

Trains K-Means model on top of the one dimensional data derived from dimension reducers.

Parameters:
  • n_clusters (int) – The number of clusters required to start a K-Means learning.
  • **kwargs (keyword arguments) – Any other keyword arguments passed on to Scikit-Learn K-Means model.
train_sda_dim_reducer(field_importance=None, batch_size=50, hidden_layers_sizes=[20], corruption_levels=[0.3], learning_rate=0.002, min_epochs=200, patience=60, patience_increase=2, improvement_threshold=0.98, verbose=False)[source]

Trains a SDA neural network.

Parameters:
  • field_importance (List or Dict, default None (List of Ones)) –
    • If a list is designated, all fields should be assigned an

    importance, viz, the length of the list should be equal to the length of the features training the neural network.

    • It can also be given in a dict. In such a case, the fields can

    be selectively given a value. Dict key is for field name and value is for the importance. Fields not included will be initiated with the default value one. A warning will be issued when a key is not on the list of field names, mostly because of a typo.

  • batch_size (int) – Size of each training batch. Necessary to derive the number of batches.
  • hidden_layers_sizes (List of ints) – Number of neurons in the hidden layers (all but the input layer).
  • corruption_levels (List of floats, between 0 and 1) – Dropout rate in reading input, typical pratice in deep learning to avoid overfitting.
  • learning_rate (float) – Propagating step size for gredient descent algorithm.
  • min_epochs (int) – The mininum number of training epoch to run. It can be exceeded depending on the setup of patience and ad-hoc training progress.
  • patience (int) – True number of training epochs to run if larger than min_epochs. Note it is potentially increased during the training if the cost is better than the expectation from current cost.
  • patience_increase (int) – Coefficient used to increase patience against epochs that have been run.
  • improvement_threshold (float, between 0 and 1) – Minimum improvement considered as substantial improvement, i.e. new cost over existing lowest cost lower than this value.
  • verbose (boolean, default False) – Prints out training at each epoch if true.
train_valley(bins=100, contrast=0.3)[source]

Trains the ability to cut the universe of samples into clusters based how the dimension reduced dataset assembles in a histogram. Unlike the K-Means, no need to preset the number of clusters.

Parameters:
  • bins (int) – Number of bins to aggregate the one dimensional data.
  • contrast (float, between 0 and 1) – Threshold used to define local minima and local maxima. Detailed explanation in utils.find_local_extremes.

Note

When getting only one cluster, check the distribution of one_dim_data. Likely the data points flock too close to each other. Try increasing bins first. If not working, try different neural networks with more or less layers with more or less neurons.

valley

Python function – Trained on the dimension reduced one dimensional data that segregates subjects into concentration of existence in a subset of [-1, 1], by locating the “valley” in the distribution landscape. _tagger is given valley to facilitate follow-up usages.

exception clusteror.core.OutRangeError[source]

Bases: Exception

Exceptions thrown as cleaned data go beyond range [-1, 1].

clusteror.nn module

This module comprises of classes for neural networks.

class clusteror.nn.SdA(n_ins, hidden_layers_sizes, np_rs=None, theano_rs=None, field_importance=None, input_data=None)[source]

Bases: object

Stacked Denoising Autoencoder (SDA) class.

A SdA model is obtained by stacking several DAs. The hidden layer of the dA at layer i becomes the input of the dA at layer i+1. The first layer dA gets as input the input of the SdA, and the hidden layer of the last dA represents the output. Note that after pretraining, the SdA is dealt with as a normal MLP, the dAs are only used to initialize the weights.

Parameters:
  • n_ins (int) – Input dimension.
  • hidden_layers_sizes (list of int) – Each int will be assgined to each hidden layer. Same number of hidden layers will be created.
  • np_rs (Numpy function) – Numpy random state.
  • theano_rs (Theano function) – Theano random generator that gives symbolic random values.
  • field_importance (list or Numpy array) – Put on each field when calculating the cost. If not given, all fields given equal weight ones.
  • input_data (Theano symbolic variable) – Variable for input data.
theano_rs

Theano function – Theano random generator that gives symbolic random values.

field_importance

list or Numpy array – Put on each field when calculating the cost. If not given, all fields given equal weight ones.

W

Theano shared variable – Weight matrix. Dimension (n_visible, n_hidden).

W_prime

Theano shared variable – Transposed weight matrix. Dimension (n_hidden, n_visible).

bhid

Theano shared variable – Bias on output side. Dimension n_hidden.

bvis

Theano shared variable – Bias on input side. Dimension n_visible.

x

Theano symbolic variable – Used as input to build graph.

params

list – List packs neural network paramters.

dA_layers

list – List that keeps dA instances.

n_layers

int – Number of hidden layers, len(dA_layers).

get_final_hidden_layer(input_data)[source]

Computes the values of the last hidden layer.

Parameters:input_data (Theano symbolic variable) – Data input to neural network.
Returns:A graph with output as the hidden layer values.
Return type:Theano graph
get_first_reconstructed_input(hidden)[source]

Computes the reconstructed input given the values of the last hidden layer.

Parameters:hidden (Theano symbolic variable) – Data input to neural network at the hidden layer side.
Returns:A graph with output as the reconstructed data at the visible side.
Return type:Theano graph
pretraining_functions(train_set, batch_size)[source]

This function computes the cost and the updates for one trainng step of the dA.

Parameters:
  • train_set (Theano shared variable) – The complete training dataset.
  • batch_size (int) – Number of rows for each mini-batch.
Returns:

Theano functions that run one step training on each dA layers.

Return type:

List

class clusteror.nn.dA(n_visible, n_hidden, np_rs=None, theano_rs=None, field_importance=None, initial_W=None, initial_bvis=None, initial_bhid=None, input_data=None)[source]

Bases: object

Denoising Autoencoder (DA) class.

Parameters:
  • n_visible (int) – Input dimension.
  • n_hidden (int) – Output dimension.
  • np_rs (Numpy function) – Numpy random state.
  • theano_rs (Theano function) – Theano random generator that gives symbolic random values.
  • field_importance (list or Numpy array) – Put on each field when calculating the cost. If not given, all fields given equal weight ones.
  • initial_W (Numpy matrix) – Initial weight matrix. Dimension (n_visible, n_hidden).
  • initial_bvis (Numpy array) – Initial bias on input side. Dimension n_visible.
  • initial_bhid (Numpy arry) – Initial bias on output side. Dimension n_hidden.
  • input_data (Theano symbolic variable) – Variable for input data.
theano_rs

Theano function – Theano random generator that gives symbolic random values.

field_importance

list or Numpy array – Put on each field when calculating the cost. If not given, all fields given equal weight ones.

W

Theano shared variable – Weight matrix. Dimension (n_visible, n_hidden).

W_prime

Theano shared variable – Transposed weight matrix. Dimension (n_hidden, n_visible).

bhid

Theano shared variable – Bias on output side. Dimension n_hidden.

bvis

Theano shared variable – Bias on input side. Dimension n_visible.

x

Theano symbolic variable – Used as input to build graph.

params

list – List packs neural network paramters.

get_corrupted_input(input_data, corruption_level)[source]

Corrupts the input by multiplying input with an array of zeros and ones that is generated by binomial trials.

Parameters:
  • input_data (Theano symbolic variable) – Data input to neural network.
  • corruption_level (float or Theano symbolic variable) – Probability to corrupt a bit in the input data. Between 0 and 1.
Returns:

A graph with output as the corrupted input.

Return type:

Theano graph

get_cost_updates(corruption_level, learning_rate)[source]

This function computes the cost and the updates for one trainng step of the dA.

Parameters:
  • corruption_level (float or Theano symbolic variable) – Probability to corrupt a bit in the input data. Between 0 and 1.
  • learning_rate (float or Theano symbolic variable) – Step size for Gradient Descent algorithm.
Returns:

  • cost (Theano graph) – A graph with output as the cost.
  • updates (List of tuples) – Instructions of how to update parameters. Used in training stage to update parameters.

get_hidden_values(input_data)[source]

Computes the values of the hidden layer.

Parameters:input_data (Theano symbolic variable) – Data input to neural network.
Returns:A graph with output as the hidden layer values.
Return type:Theano graph
get_reconstructed_input(hidden)[source]

Computes the reconstructed input given the values of the hidden layer.

Parameters:hidden (Theano symbolic variable) – Data input to neural network at the hidden layer side.
Returns:A graph with output as the reconstructed data at the visible side.
Return type:Theano graph

clusteror.plot module

Plotting tools relevant for illustrating and comparing clustering results can be found in this module.

clusteror.plot.group_occurance_plot(one_dim_data, cat_label, labels, group_label, colors=None, figsize=(10, 6), bbox_to_anchor=(1.01, 1), loc=2, grid=True, show=True, filepath=None, **kwargs)[source]

Plot the distribution of a one dimensional ordinal or categorical data in a bar chart. This tool is useful to check the clustering impact in this one-dimensional sub-space.

Parameters:
  • one_dim_data (list, Pandas Series, Numpy Array, or any iterable) – A sequence of data. Each element if for an instance.
  • cat_label (str) – Field name will be used for the one dimensional data.
  • labels (list, Pandas Series, Numpy Array, or any iterable) – The segment label for each sample in one_dim_data.
  • group_label (str) – Field name will be used for the cluster ID.
  • colors (list, default None) – Colours for each category existing in this one dimensional data. Default colour scheme used if not supplied.
  • figsize (tuple) – Figure size (width, height).
  • bbox_to_anchor (tuple) – Instruction to placing the legend box relative to the axes. Details refer to Matplotlib document.
  • loc (int) – The corner of the legend box to anchor. Details refer to Matplotlib document.
  • grid (boolean, default True) – Show grid.
  • show (boolean, default True) – Show figure in pop-up windows if true. Save to files if False.
  • filepath (str) – File name to saving the plot. Must be assigned a valid filepath if show is False.
  • **kwargs (keyword arguments) – Other keyword arguemnts passed on to matplotlib.pyplot.scatter.

Note

Instances in a same cluster does not necessarily assemble together in all one dimensional sub-spaces. There can be possibly no clustering capaility for certain features. Additionally certain features play a secondary role in clustering as having less importance in field_importance in clusteror module.

clusteror.plot.hist_plot_one_dim_group_data(one_dim_data, labels, bins=11, colors=None, figsize=(10, 6), xlabel='Dimension Reduced Data', ylabel='Occurance', bbox_to_anchor=(1.01, 1), loc=2, grid=True, show=True, filepath=None, **kwargs)[source]

Plot the distribution of a one dimensional numerical data in a histogram. This tool is useful to check the clustering impact in this one-dimensional sub-space.

Parameters:
  • one_dim_data (list, Pandas Series, Numpy Array, or any iterable) – A sequence of data. Each element if for an instance.
  • labels (list, Pandas Series, Numpy Array, or any iterable) – The segment label for each sample in one_dim_data.
  • bins (int or iterable) – If an integer, bins - 1 bins created or a list of the delimiters.
  • colors (list, default None) – Colours for each group. Use equally distanced colours on colour map if not supplied.
  • figsize (tuple) – Figure size (width, height).
  • xlabel (str) – Plot xlabel.
  • ylabel (str) – Plot ylabel.
  • bbox_to_anchor (tuple) – Instruction to placing the legend box relative to the axes. Details refer to Matplotlib document.
  • loc (int) – The corner of the legend box to anchor. Details refer to Matplotlib document.
  • grid (boolean, default True) – Show grid.
  • show (boolean, default True) – Show figure in pop-up windows if true. Save to files if False.
  • filepath (str) – File name to saving the plot. Must be assigned a valid filepath if show is False.
  • **kwargs (keyword arguments) – Other keyword arguemnts passed on to matplotlib.pyplot.scatter.

Note

Instances in a same cluster does not necessarily assemble together in all one dimensional sub-spaces. There can be possibly no clustering capaility for certain features. Additionally certain features play a secondary role in clustering as having less importance in field_importance in clusteror module.

clusteror.plot.scatter_plot_two_dim_group_data(two_dim_data, labels, markers=None, colors=None, figsize=(10, 6), xlim=None, ylim=None, alpha=0.8, bbox_to_anchor=(1.01, 1), loc=2, grid=True, show=True, filepath=None, **kwargs)[source]

Plot the distribution of a two dimensional data against clustering groups in a scatter plot.

A point represents an instance in the dataset. Points in a same cluster are painted with a same colour.

This tool is useful to check the clustering impact in this two-dimensional sub-space.

Parameters:
  • two_dim_data (Pandas DataFrame) – A dataframe with two columns. The first column goes to the x-axis, and the second column goes to the y-axis.
  • labels (list, Pandas Series, Numpy Array, or any iterable) – The segment label for each sample in two_dim_data.
  • markers (list) – Marker names for each group.
  • bbox_to_anchor (tuple) – Instruction to placing the legend box relative to the axes. Details refer to Matplotlib document.
  • colors (list, default None) – Colours for each group. Use equally distanced colours on colour map if not supplied.
  • figsize (tuple) – Figure size (width, height).
  • xlim (tuple) – X-axis limits.
  • ylim (tuple) – Y-axis limits.
  • alpha (float, between 0 and 1) – Marker transparency. From 0 to 1: from transparent to opaque.
  • loc (int) – The corner of the legend box to anchor. Details refer to Matplotlib document.
  • grid (boolean, default True) – Show grid.
  • show (boolean, default True) – Show figure in pop-up windows if true. Save to files if False.
  • filepath (str) – File name to saving the plot. Must be assigned a valid filepath if show is False.
  • **kwargs (keyword arguments) – Other keyword arguemnts passed on to matplotlib.pyplot.scatter.

Note

Instances in a same cluster does not necessarily assemble together in all two dimensional sub-spaces. There can be possibly no clustering capaility for certain features. Additionally certain features play a secondary role in clustering as having less importance in field_importance in clusteror module.

clusteror.settings module

clusteror.utils module

This module works as a transient store of useful functions. New standalone functions will be first placed here. As they grow in number and can be consolidated into an independent class, module, or even a new package.

clusteror.utils.find_local_extremes(series, contrast)[source]

Finds local minima and maxima according to contrast. In theory, they can be determined by first derivative and second derivative. The result derived this way is of no value in dealing with a very noisy, zig-zag data as too many local extremes would be found for any turn-around. The method presented here compares the point currently looked at and the opposite potential extreme that is updated as scanning through the data sequence. For instance, a potential maximum is 10, then a data point of value smaller than 10 / (1 + contrast) is written down as a local minimum.

Parameters:
  • series (Pandas Series) – One dimenional data to find local extremes in.
  • contrast (float) – A value between 0 and 1 as a threshold between minimum and maximum.
Returns:

  • local_min_inds (list) – List of indices for local minima.
  • local_mins (list) – List of minimum values.
  • local_max_inds (list) – List of indices for local maxima.
  • local_maxs (list) – List of maximum values.