cluster.method.hierarchical

class cluster.method.hierarchical.HierarchicalClustering(data, distance_function, linkage=None, num_processes=1, progress_callback=None)

Bases: cluster.method.base.BaseClusterMethod

Implementation of the hierarchical clustering method as explained in a tutorial by matteucc.

Object prerequisites:

  • Items must be sortable (See issue #11)
  • Items must be hashable.

Example:

>>> from cluster import HierarchicalClustering
>>> # or: from cluster import *
>>> cl = HierarchicalClustering([123,334,345,242,234,1,3],
        lambda x,y: float(abs(x-y)))
>>> cl.getlevel(90)
[[345, 334], [234, 242], [123], [3, 1]]

Note that all of the returned clusters are more than 90 (getlevel(90)) apart.

See BaseClusterMethod for more details.

Parameters:
  • data – The collection of items to be clustered.
  • distance_function – A function which takes two elements of data and returns a distance between both elements (note that the distance should not be returned as negative value!)
  • linkage – The method used to determine the distance between two clusters. See set_linkage_method() for possible values.
  • num_processes – If you want to use multiprocessing to split up the work and run genmatrix() in parallel, specify num_processes > 1 and this number of workers will be spun up, the work split up amongst them evenly.
  • progress_callback – A function to be called on each iteration to publish the progress. The function is called with two integer arguments which represent the total number of elements in the cluster, and the remaining elements to be clustered.
cluster(matrix=None, level=None, sequence=None)

Perform hierarchical clustering.

Parameters:
  • matrix – The 2D list that is currently under processing. The matrix contains the distances of each item with each other
  • level – The current level of clustering
  • sequence – The sequence number of the clustering
display()

Prints a simple dendogram-like representation of the full cluster to the console.

getlevel(threshold)

Returns all clusters with a maximum distance of threshold in between each other

Parameters:threshold – the maximum distance between clusters.

See getlevel()

publish_progress(total, current)

If a progress function was supplied, this will call that function with the total number of elements, and the remaining number of elements.

Parameters:
  • total – The total number of elements.
  • remaining – The remaining number of elements.
set_linkage_method(method)

Sets the method to determine the distance between two clusters.

Parameters:method – The method to use. It can be one of 'single', 'complete', 'average' or 'uclus', or a callable. The callable should take two collections as parameters and return a distance value between both collections.