clusteror package¶
clusteror.core module¶
This module contains Clusteror
class capsulating raw data to discover
clusters from, the cleaned data for a clusteror to run on.
The clustering model encompasses two parts:
Neural network: Pre-training (often encountered in Deep Learning context) is implemented to achieve a goal that the neural network maps the input data of higher dimension to a one dimensional representation. Ideally this mapping is one-to-one. A Denoising Autoencoder (DA) or Stacked Denoising Autoencoder (SDA) is implemented for this purpose.
One dimensional clustering model: A separate model segments the samples against the one dimensional representation. Two models are available in this class definition:
- K-Means
- Valley model
The pivot idea here is given the neural network is a good one-to-one mapper the separate clustering model on one dimensional representation is equivalent to a clustering model on the original high dimensional data.
Note
Valley model is explained in details in module clusteror.utils
.
-
class
clusteror.core.
Clusteror
(raw_data)[source]¶ Bases:
object
Clusteror
class can train neural networks DA or SDA, train taggers, or load saved models from files.Parameters: raw_data (Pandas DataFrame) – Dataframe read from data source. It can be original dataset without any preprocessing or with a certain level of manipulation for future analysis. -
_raw_data
¶ Pandas DataFrame – Stores the original dataset. It’s the dataset that later post-clustering performance analysis will be based on.
-
_cleaned_data
¶ Pandas DataFrame – Preprocessed data. Not necessarily has same number of columns with
_raw_data
as a categorical column can derive multiple columns. As thetanh
function is used as activation function for symmetric consideration. All columns should have values in range[-1, 1]
, otherwise anOutRangeError
will be raised.
-
_network
¶ str – da for DA; sda for SDA. Facilating functions called with one or the other algorithm.
-
_da_dim_reducer
¶ Theano function – Keeps the Theano function that is from trained DA model. Reduces the dimension of the cleaned data down to one.
-
_sda_dim_reducer
¶ Theano function – Keeps the Theano function that is from trained SDA model. Reduces the dimension of the cleaned data down to one.
-
_one_dim_data
¶ Numpy Array – The dimension reduced one dimensional data.
-
_valley
¶ Python function – Trained valley model tagging sample with their one dimensional representation.
-
_kmeans
¶ Scikit-Learn K-Means model – Trained K-Means model tagging samples with their one dimensional representation.
-
_tagger
¶ str – Keeps records of which tagger implemented.
-
_field_importance
¶ List – Keeps the list of coefficiences that influence the clustering emphasis.
-
add_cluster
()[source]¶ Tags each sample regarding their reduced one dimensional value. Adds an extra column ‘cluster’ to
raw_data
, seggesting a zero-based cluster ID.
-
cleaned_data
¶ Pandas DataFrame – For assgining cleaned dataframe to
_cleaned_dat
.
-
da_dim_reducer
¶ Theano function – Function that reduces dataset dimension. Attribute
_network
is given da to designate the method of the autoencoder asDA
.
-
field_importance
¶ List – Significance that given to fields when training of neural network is done. Fields with a large number will be given more attention.
Note
The importance is only meaningful relatively between fields. If no values are specified, all fields are treated equally.
Parameters: field_importance (List or Dict, default None (List of Ones)) – - If a list is designated, all fields should be assigned an
importance, viz, the length of the list should be equal to the length of the features training the neural network.
- It can also be given in a dict. In such a case, the fields can
be selectively given a value. Dict key is for field name and value is for the importance. Fields not included will be initiated with the default value one. A warning will be issued when a key is not on the list of field names, mostly because of a typo.
-
classmethod
from_csv
(filepath, **kwargs)[source]¶ Class method for directly reading CSV file.
Parameters: - filepath (str) – Path to the CSV file
- **kwargs (keyword arguments) – Other keyword arguments passed to
pandas.read_csv
-
kmeans
¶ Python function – Trained on the dimension reduced one dimensional data that segregates subjects into concentration of existence in a subset of
[-1, 1]
with K-Means algorithm._tagger
is given valley to facilitate follow-up usages.
-
load_dim_reducer
(filepath='dim_reducer.pk')[source]¶ Loads saved dimension reducer. Need to first name the network type.
Parameters: filepath (str) –
-
load_kmeans
(filepath)[source]¶ Loads a saved K-Means tagger from a file.
- filepath: str
- File path to the file saving the K-Means tagger.
-
load_valley
(filepath)[source]¶ Loads a saved valley tagger from a file. Create the valley function from the saved parameters.
- filepath: str
- File path to the file saving the valley tagger.
-
one_dim_data
¶ Numpy Array – Stores the output of neural network that has dimension one.
-
raw_data
¶ Pandas DataFrame – For assgining new values to
_raw_data
.
-
reduce_to_one_dim
()[source]¶ Reduces the dimension of input dataset to one before the tagging in the next step.
Input of the Theano function is the cleaned data and output is a one dimensional data stored in
_one_dim_data
.
-
save_dim_reducer
(filepath='dim_reducer.pk', include_network=False)[source]¶ Save dimension reducer from the neural network training.
Parameters: - filepath (str) – Filename to store the dimension reducer.
- include_network (boolean) – If true, prefix the filepath with the network type.
-
save_kmeans
(filepath, include_taggername=False)[source]¶ Saves K-Means model to the named file path. Can add a prefix to indicate this saves a K-Means model.
Parameters: - filepath (str) – File path for saving the model.
- include_taggername (boolean, default False) – Include the kmean_ prefix in filename if true.
-
save_valley
(filepath, include_taggername=False)[source]¶ Saves valley tagger.
Parameters: - filepath (str) – File path to save the tagger.
- include_taggername (boolean, default False) – Include the valley_ prefix in filename if true.
-
sda_dim_reducer
¶ Theano function – Function that reduces dataset dimension. Attribute
_network
is given sda to designate the method of the autoencoder asSDA
.
-
tagger
¶ str – Name the tagger if necessary to do so, which will facilitate, e.g. prefixing the filepath.
-
train_da_dim_reducer
(field_importance=None, batch_size=50, corruption_level=0.3, learning_rate=0.002, min_epochs=200, patience=60, patience_increase=2, improvement_threshold=0.98, verbose=False)[source]¶ Trains a
DA
neural network.Parameters: - field_importance (List or Dict, default None (List of Ones)) –
- If a list is designated, all fields should be assigned an
importance, viz, the length of the list should be equal to the length of the features training the neural network.
- It can also be given in a dict. In such a case, the fields can
be selectively given a value. Dict key is for field name and value is for the importance. Fields not included will be initiated with the default value one. A warning will be issued when a key is not on the list of field names, mostly because of a typo.
- batch_size (int) – Size of each training batch. Necessary to derive the number of batches.
- corruption_level (float, between 0 and 1) – Dropout rate in reading input, typical pratice in deep learning to avoid overfitting.
- learning_rate (float) – Propagating step size for gredient descent algorithm.
- min_epochs (int) – The mininum number of training epoch to run. It can be exceeded depending on the setup of patience and ad-hoc training progress.
- patience (int) – True number of training epochs to run if larger than
min_epochs
. Note it is potentially increased during the training if the cost is better than the expectation from current cost. - patience_increase (int) – Coefficient used to increase patience against epochs that have been run.
- improvement_threshold (float, between 0 and 1) – Minimum improvement considered as substantial improvement, i.e. new cost over existing lowest cost lower than this value.
- verbose (boolean, default False) – Prints out training at each epoch if true.
- field_importance (List or Dict, default None (List of Ones)) –
-
train_kmeans
(n_clusters=10, **kwargs)[source]¶ Trains K-Means model on top of the one dimensional data derived from dimension reducers.
Parameters: - n_clusters (int) – The number of clusters required to start a K-Means learning.
- **kwargs (keyword arguments) – Any other keyword arguments passed on to Scikit-Learn K-Means model.
-
train_sda_dim_reducer
(field_importance=None, batch_size=50, hidden_layers_sizes=[20], corruption_levels=[0.3], learning_rate=0.002, min_epochs=200, patience=60, patience_increase=2, improvement_threshold=0.98, verbose=False)[source]¶ Trains a
SDA
neural network.Parameters: - field_importance (List or Dict, default None (List of Ones)) –
- If a list is designated, all fields should be assigned an
importance, viz, the length of the list should be equal to the length of the features training the neural network.
- It can also be given in a dict. In such a case, the fields can
be selectively given a value. Dict key is for field name and value is for the importance. Fields not included will be initiated with the default value one. A warning will be issued when a key is not on the list of field names, mostly because of a typo.
- batch_size (int) – Size of each training batch. Necessary to derive the number of batches.
- hidden_layers_sizes (List of ints) – Number of neurons in the hidden layers (all but the input layer).
- corruption_levels (List of floats, between 0 and 1) – Dropout rate in reading input, typical pratice in deep learning to avoid overfitting.
- learning_rate (float) – Propagating step size for gredient descent algorithm.
- min_epochs (int) – The mininum number of training epoch to run. It can be exceeded depending on the setup of patience and ad-hoc training progress.
- patience (int) – True number of training epochs to run if larger than
min_epochs
. Note it is potentially increased during the training if the cost is better than the expectation from current cost. - patience_increase (int) – Coefficient used to increase patience against epochs that have been run.
- improvement_threshold (float, between 0 and 1) – Minimum improvement considered as substantial improvement, i.e. new cost over existing lowest cost lower than this value.
- verbose (boolean, default False) – Prints out training at each epoch if true.
- field_importance (List or Dict, default None (List of Ones)) –
-
train_valley
(bins=100, contrast=0.3)[source]¶ Trains the ability to cut the universe of samples into clusters based how the dimension reduced dataset assembles in a histogram. Unlike the K-Means, no need to preset the number of clusters.
Parameters: - bins (int) – Number of bins to aggregate the one dimensional data.
- contrast (float, between 0 and 1) – Threshold used to define local minima and local maxima. Detailed
explanation in
utils.find_local_extremes
.
Note
When getting only one cluster, check the distribution of
one_dim_data
. Likely the data points flock too close to each other. Try increasingbins
first. If not working, try different neural networks with more or less layers with more or less neurons.
-
valley
¶ Python function – Trained on the dimension reduced one dimensional data that segregates subjects into concentration of existence in a subset of
[-1, 1]
, by locating the “valley” in the distribution landscape._tagger
is given valley to facilitate follow-up usages.
-
clusteror.nn module¶
This module comprises of classes for neural networks.
-
class
clusteror.nn.
SdA
(n_ins, hidden_layers_sizes, np_rs=None, theano_rs=None, field_importance=None, input_data=None)[source]¶ Bases:
object
Stacked Denoising Autoencoder (SDA) class.
A SdA model is obtained by stacking several DAs. The hidden layer of the dA at layer i becomes the input of the dA at layer i+1. The first layer dA gets as input the input of the SdA, and the hidden layer of the last dA represents the output. Note that after pretraining, the SdA is dealt with as a normal MLP, the dAs are only used to initialize the weights.
Parameters: - n_ins (int) – Input dimension.
- hidden_layers_sizes (list of int) – Each int will be assgined to each hidden layer. Same number of hidden layers will be created.
- np_rs (Numpy function) – Numpy random state.
- theano_rs (Theano function) – Theano random generator that gives symbolic random values.
- field_importance (list or Numpy array) – Put on each field when calculating the cost. If not given, all fields given equal weight ones.
- input_data (Theano symbolic variable) – Variable for input data.
-
theano_rs
¶ Theano function – Theano random generator that gives symbolic random values.
-
field_importance
¶ list or Numpy array – Put on each field when calculating the cost. If not given, all fields given equal weight ones.
-
W
¶ Theano shared variable – Weight matrix. Dimension (n_visible, n_hidden).
-
W_prime
¶ Theano shared variable – Transposed weight matrix. Dimension (n_hidden, n_visible).
-
bhid
¶ Theano shared variable – Bias on output side. Dimension n_hidden.
-
bvis
¶ Theano shared variable – Bias on input side. Dimension n_visible.
-
x
¶ Theano symbolic variable – Used as input to build graph.
-
params
¶ list – List packs neural network paramters.
-
dA_layers
¶ list – List that keeps dA instances.
-
n_layers
¶ int – Number of hidden layers, len(dA_layers).
Computes the values of the last hidden layer.
Parameters: input_data (Theano symbolic variable) – Data input to neural network. Returns: A graph with output as the hidden layer values. Return type: Theano graph
-
get_first_reconstructed_input
(hidden)[source]¶ Computes the reconstructed input given the values of the last hidden layer.
Parameters: hidden (Theano symbolic variable) – Data input to neural network at the hidden layer side. Returns: A graph with output as the reconstructed data at the visible side. Return type: Theano graph
-
pretraining_functions
(train_set, batch_size)[source]¶ This function computes the cost and the updates for one trainng step of the dA.
Parameters: - train_set (Theano shared variable) – The complete training dataset.
- batch_size (int) – Number of rows for each mini-batch.
Returns: Theano functions that run one step training on each dA layers.
Return type: List
-
class
clusteror.nn.
dA
(n_visible, n_hidden, np_rs=None, theano_rs=None, field_importance=None, initial_W=None, initial_bvis=None, initial_bhid=None, input_data=None)[source]¶ Bases:
object
Denoising Autoencoder (DA) class.
Parameters: - n_visible (int) – Input dimension.
- n_hidden (int) – Output dimension.
- np_rs (Numpy function) – Numpy random state.
- theano_rs (Theano function) – Theano random generator that gives symbolic random values.
- field_importance (list or Numpy array) – Put on each field when calculating the cost. If not given, all fields given equal weight ones.
- initial_W (Numpy matrix) – Initial weight matrix. Dimension (n_visible, n_hidden).
- initial_bvis (Numpy array) – Initial bias on input side. Dimension n_visible.
- initial_bhid (Numpy arry) – Initial bias on output side. Dimension n_hidden.
- input_data (Theano symbolic variable) – Variable for input data.
-
theano_rs
¶ Theano function – Theano random generator that gives symbolic random values.
-
field_importance
¶ list or Numpy array – Put on each field when calculating the cost. If not given, all fields given equal weight ones.
-
W
¶ Theano shared variable – Weight matrix. Dimension (n_visible, n_hidden).
-
W_prime
¶ Theano shared variable – Transposed weight matrix. Dimension (n_hidden, n_visible).
-
bhid
¶ Theano shared variable – Bias on output side. Dimension n_hidden.
-
bvis
¶ Theano shared variable – Bias on input side. Dimension n_visible.
-
x
¶ Theano symbolic variable – Used as input to build graph.
-
params
¶ list – List packs neural network paramters.
-
get_corrupted_input
(input_data, corruption_level)[source]¶ Corrupts the input by multiplying input with an array of zeros and ones that is generated by binomial trials.
Parameters: - input_data (Theano symbolic variable) – Data input to neural network.
- corruption_level (float or Theano symbolic variable) – Probability to corrupt a bit in the input data. Between 0 and 1.
Returns: A graph with output as the corrupted input.
Return type: Theano graph
-
get_cost_updates
(corruption_level, learning_rate)[source]¶ This function computes the cost and the updates for one trainng step of the dA.
Parameters: - corruption_level (float or Theano symbolic variable) – Probability to corrupt a bit in the input data. Between 0 and 1.
- learning_rate (float or Theano symbolic variable) – Step size for Gradient Descent algorithm.
Returns: - cost (Theano graph) – A graph with output as the cost.
- updates (List of tuples) – Instructions of how to update parameters. Used in training stage to update parameters.
Computes the values of the hidden layer.
Parameters: input_data (Theano symbolic variable) – Data input to neural network. Returns: A graph with output as the hidden layer values. Return type: Theano graph
-
get_reconstructed_input
(hidden)[source]¶ Computes the reconstructed input given the values of the hidden layer.
Parameters: hidden (Theano symbolic variable) – Data input to neural network at the hidden layer side. Returns: A graph with output as the reconstructed data at the visible side. Return type: Theano graph
clusteror.plot module¶
Plotting tools relevant for illustrating and comparing clustering results can be found in this module.
-
clusteror.plot.
group_occurance_plot
(one_dim_data, cat_label, labels, group_label, colors=None, figsize=(10, 6), bbox_to_anchor=(1.01, 1), loc=2, grid=True, show=True, filepath=None, **kwargs)[source]¶ Plot the distribution of a one dimensional ordinal or categorical data in a bar chart. This tool is useful to check the clustering impact in this one-dimensional sub-space.
Parameters: - one_dim_data (list, Pandas Series, Numpy Array, or any iterable) – A sequence of data. Each element if for an instance.
- cat_label (str) – Field name will be used for the one dimensional data.
- labels (list, Pandas Series, Numpy Array, or any iterable) – The segment label for each sample in one_dim_data.
- group_label (str) – Field name will be used for the cluster ID.
- colors (list, default None) – Colours for each category existing in this one dimensional data. Default colour scheme used if not supplied.
- figsize (tuple) – Figure size (width, height).
- bbox_to_anchor (tuple) – Instruction to placing the legend box relative to the axes. Details
refer to
Matplotlib
document. - loc (int) – The corner of the legend box to anchor. Details refer to
Matplotlib
document. - grid (boolean, default True) – Show grid.
- show (boolean, default True) – Show figure in pop-up windows if true. Save to files if False.
- filepath (str) – File name to saving the plot. Must be assigned a valid filepath if
show
is False. - **kwargs (keyword arguments) – Other keyword arguemnts passed on to
matplotlib.pyplot.scatter
.
Note
Instances in a same cluster does not necessarily assemble together in all one dimensional sub-spaces. There can be possibly no clustering capaility for certain features. Additionally certain features play a secondary role in clustering as having less importance in
field_importance
inclusteror
module.
-
clusteror.plot.
hist_plot_one_dim_group_data
(one_dim_data, labels, bins=11, colors=None, figsize=(10, 6), xlabel='Dimension Reduced Data', ylabel='Occurance', bbox_to_anchor=(1.01, 1), loc=2, grid=True, show=True, filepath=None, **kwargs)[source]¶ Plot the distribution of a one dimensional numerical data in a histogram. This tool is useful to check the clustering impact in this one-dimensional sub-space.
Parameters: - one_dim_data (list, Pandas Series, Numpy Array, or any iterable) – A sequence of data. Each element if for an instance.
- labels (list, Pandas Series, Numpy Array, or any iterable) – The segment label for each sample in
one_dim_data
. - bins (int or iterable) – If an integer, bins - 1 bins created or a list of the delimiters.
- colors (list, default None) – Colours for each group. Use equally distanced colours on colour map if not supplied.
- figsize (tuple) – Figure size (width, height).
- xlabel (str) – Plot xlabel.
- ylabel (str) – Plot ylabel.
- bbox_to_anchor (tuple) – Instruction to placing the legend box relative to the axes. Details
refer to
Matplotlib
document. - loc (int) – The corner of the legend box to anchor. Details refer to
Matplotlib
document. - grid (boolean, default True) – Show grid.
- show (boolean, default True) – Show figure in pop-up windows if true. Save to files if False.
- filepath (str) – File name to saving the plot. Must be assigned a valid filepath if
show
is False. - **kwargs (keyword arguments) – Other keyword arguemnts passed on to
matplotlib.pyplot.scatter
.
Note
Instances in a same cluster does not necessarily assemble together in all one dimensional sub-spaces. There can be possibly no clustering capaility for certain features. Additionally certain features play a secondary role in clustering as having less importance in
field_importance
inclusteror
module.
-
clusteror.plot.
scatter_plot_two_dim_group_data
(two_dim_data, labels, markers=None, colors=None, figsize=(10, 6), xlim=None, ylim=None, alpha=0.8, bbox_to_anchor=(1.01, 1), loc=2, grid=True, show=True, filepath=None, **kwargs)[source]¶ Plot the distribution of a two dimensional data against clustering groups in a scatter plot.
A point represents an instance in the dataset. Points in a same cluster are painted with a same colour.
This tool is useful to check the clustering impact in this two-dimensional sub-space.
Parameters: - two_dim_data (Pandas DataFrame) – A dataframe with two columns. The first column goes to the x-axis, and the second column goes to the y-axis.
- labels (list, Pandas Series, Numpy Array, or any iterable) – The segment label for each sample in
two_dim_data
. - markers (list) – Marker names for each group.
- bbox_to_anchor (tuple) – Instruction to placing the legend box relative to the axes. Details
refer to
Matplotlib
document. - colors (list, default None) – Colours for each group. Use equally distanced colours on colour map if not supplied.
- figsize (tuple) – Figure size (width, height).
- xlim (tuple) – X-axis limits.
- ylim (tuple) – Y-axis limits.
- alpha (float, between 0 and 1) – Marker transparency. From 0 to 1: from transparent to opaque.
- loc (int) – The corner of the legend box to anchor. Details refer to
Matplotlib
document. - grid (boolean, default True) – Show grid.
- show (boolean, default True) – Show figure in pop-up windows if true. Save to files if False.
- filepath (str) – File name to saving the plot. Must be assigned a valid filepath if
show
is False. - **kwargs (keyword arguments) – Other keyword arguemnts passed on to
matplotlib.pyplot.scatter
.
Note
Instances in a same cluster does not necessarily assemble together in all two dimensional sub-spaces. There can be possibly no clustering capaility for certain features. Additionally certain features play a secondary role in clustering as having less importance in
field_importance
inclusteror
module.
clusteror.settings module¶
clusteror.utils module¶
This module works as a transient store of useful functions. New standalone functions will be first placed here. As they grow in number and can be consolidated into an independent class, module, or even a new package.
-
clusteror.utils.
find_local_extremes
(series, contrast)[source]¶ Finds local minima and maxima according to
contrast
. In theory, they can be determined by first derivative and second derivative. The result derived this way is of no value in dealing with a very noisy, zig-zag data as too many local extremes would be found for any turn-around. The method presented here compares the point currently looked at and the opposite potential extreme that is updated as scanning through the data sequence. For instance, a potential maximum is 10, then a data point of value smaller than 10 / (1 + contrast) is written down as a local minimum.Parameters: - series (Pandas Series) – One dimenional data to find local extremes in.
- contrast (float) – A value between 0 and 1 as a threshold between minimum and maximum.
Returns: - local_min_inds (list) – List of indices for local minima.
- local_mins (list) – List of minimum values.
- local_max_inds (list) – List of indices for local maxima.
- local_maxs (list) – List of maximum values.