Clustering analysis

Unsupervised or supervised clustering analysis has been an integral part of transcriptomics, playing an important role in the identification of sub-groups.

Breaking: Some well-used and comprehensive clustering tools are provided in this section.


7.2.1 Nbclust

Introduction: The R package provides 30 indices which determine the number of clusters in a data set and it offers also the best clustering scheme from different results to the user. In addition, it provides a function to perform k-means and hierarchical clustering with different distance measures and aggregation methods. Any combination of validation indices and clustering methods can be requested in a single function call. This enables the user to simultaneously evaluate several clustering schemes while varying the number of clusters, to help determining the most appropriate number of clusters for the data set of interest.

Installation: To install this package, start R (version “4.2”) and enter:

install.packages(Nbclust)

Application: The Nbclust vignette can be found at https://cran.r-project.org/web/packages/NbClust/NbClust.pdf.


7.2.2 ClustVis

Introduction: This web tool allows users to upload their own data and easily create Principal Component Analysis (PCA) plots and heatmaps. Data can be uploaded as a file or by copy-pasteing it to the text box. Data format is shown under “Help” tab.

Data Source Statistics
Type Number
species 9
Data sources 22
Network 7

Installation:To use the Docker image, you need to have Docker installed. Then use the following code:

sudo docker pull taunometsalu/clustvis
mkdir ~/customClustvis/
cd ~/customClustvis/
wget https://github.com/taunometsalu/ClustVis/archive/master.zip
unzip master.zip
chmod -R go+rx ~/customClustvis/
sudo docker run -d \
	--name customClustvis \
	-p <myPort>:3838 \
    -v ~/customClustvis/ClustVis-master/:/srv/shiny-server/:ro \
    taunometsalu/clustvis

To start using ClustVis R package, you can look at the examples in the vignette that comes with the package from github:

install.packages("BiocManager")
BiocManager::install("pcaMethods")
library(devtools)
install_github("taunometsalu/pheatmap")
install_github("taunometsalu/clustvis/Rpackage", build_vignettes = TRUE)
vignette("vignette", "clustvis")

Application: The ClustVis tutorial can be found at https://biit.cs.ut.ee/clustvis/

Web server: ClustVis. ClustVis


7.2.3 Medusa

Introduction: Medusa3 is a java standalone application for visualization and clustering analysis of biological networks in 2D. It is highly interactive and it supports weighted and multi-edged graphs where each edge between two bioentities can represent a different biological concept. Comparing to previous versions, it is currently enriched with a variety of layout and clustering methods for more intuitive visualizations. It is easy to integrate with web application since it is offered as an applet.

Data Source Statistics
Type Number
species 9
Data sources 22
Network 7

Installation:The latest version 3.0 can be downloaded directly from here. https://docs.google.com/uc?id=0B-FSkXfyY2AFZTUxYWIxMTgtMjJiYy00NTMyLTlmMzUtZDdiZDFlZGRhYjUw&export=download&hl=en

Application: The Medusa tutorial can be found at https://sites.google.com/site/medusa3visualization/tutorial

Web server: Medusa. Medusa


7.2.4 wcd

Introduction: wcd is an open source tool for clustering expressed sequence tags. It implements matching using either d2 or edit distance, which provides sensitive matching. Heuristics speed up the clustering process. wcd parallelises well under MPI and also provides Pthreads clustering. wcd has a very compact memory representation which means that large files can be clustered on a single processor (and it can run quite happily without MPI or Pthreads headers or libraries being available).

Application: The wcd tutorial can be found at https://code.google.com/archive/about.

Web server: wcd. wcd


7.2.5 TimeClust

Introduction: TimeClust is an user-friendly software package to cluster genes according to their temporal expression profiles. It can be conveniently used to analyze data obtained from DNA microarray time-course experiments. It implements two original algorithms expressed designed for clustering short time series together with hierarchical clustering and self-organizing maps. TimeClust executable files for Windows and LINUX platforms can be downloaded free of charge for non-profit institutions from the following web site http://aimed11.unipv.it/TimeClust.

Installation: The TimeClust software can be downloaded freely for non-profit institutions from http://aimed11.unipv.it/TimeClust. Executable files for Windows and LINUX platforms are available. The program requires the user to install the MATLAB runtime libraries, that are distributed together with the TimeClust files. The installation steps are here detailed.

How to install TimeClust:

  1. Download the distribution file for the desired platform.
  2. Extract all files from the zip file on a temporary directory.
  3. Install the MATLAB libraries trough the MCRInstaller (if the same version of the MATLAB library files are not already installed) in the following way:
  • for Windows machines
  1. run the MCRInstaller.exe. The setup wizard starts.

  2. add the following directory to your system path:

  • for Unix machines
  1. copy the MCRInstaller.zip in the installation directory and unzip the MCRInstaller.zip file.

  2. update your dynamic library path as follows:

Linux x86 32bit

 export LD_LIBRARY_PATH="<mcr_root>/<ver>/bin/glnx86:<mcr_root>/<ver>/runtime/glnx86:<mcr_root>/<ver>/sys/os/glnx86:<mcr_root>/<ver>/sys/java/jre/glnx86/jre1.5.0/lib/i386/native_threads:<mcr_root>/<ver>/sys/java/jre/glnx86/jre1.5.0/lib/i386/client:<mcr_root>/<ver>/sys/java/jre/glnx86/jre1.5.0/lib/i386:"
export XAPPLRESDIR="<mcr_root>/<ver>/X11/app-defaults"

Linux x86-64 64bit

export LD_LIBRARY_PATH="<mcr_root>/<ver>/runtime/glnxa64:<mcr_root>/<ver>/sys/os/glnxa64:<mcr_root>/<ver>/sys/java/jre/glnxa64/jre1.4.2/lib/amd64/native_threads:<mcr_root>/<ver>/sys/java/jre/glnxa64/jre1.4.2/lib/amd64/client:<mcr_root>/<ver>/sys/java/jre/glnxa64/jre1.4.2/lib/amd64:"
export XAPPLRESDIR="<mcr_root>/<ver>/X11/app-defaults"

Note. If you have already installed MATLAB on your machine there is a limitation regarding directories on your path: this tool requires that mcr_root directory is first on the path, MATLAB requires that the matlabroot directory is first on the path.

  1. Copy the TimeClust executable file and the TimeClust.CTF file in your application root directory.

  2. Run for the first time the TimeClust executable file and close the application. In this step the TimeClust_mcr directory and its subdirectories are created.

  3. Open the config.txt file in the TimeClust_mcr/TimeClust1.3 directory to configure your application. In particular, set:

  • at the line 2 of this file, the desired web site to open from the TimeClust tool (default site is www.geneontology.org)

  • at the line 4, the command line to start the preferred editor to show text files (this step is not necessary for Windows platforms, the default text editor will be used).

  • at line 6, the command line to start the web browser to use (this step is not necessary for Windows platforms, the default web browser will be used).

  1. Run again the TimeClust executable file and try with the small data set distributed together the software tool if all works as reported in the user’s guide.

Application: The TimeClust vignette can be found at http://aimed11.unipv.it/TimeClust/example.pdf.


7.2.6 hiplot

Introduction: we newly developed a pool of biomedical visualization and modeling functions (240+), and proposed a scalable and emerging one-stop web service, Hiplot, with versatile task plugins, such as basic statistics, regression, clustering, genomics, transcriptomics, time series, meta-analysis, and risk models. It provides more simple web and command-line interfaces compared with other similar platforms, and is supported and maintained by the openbiox open source community.

Data Source Statistics
Type Number
species 9
Data sources 22
Network 7

Application: The hiplot tutorial can be found at https://hiplot-academic.com/docs/ Web server: hiplot. hiplot


7.2.7 cola

Introduction: The cola package provides a general framework for subgroup classification by consensus partitioning. It has the following features: 1. It modularizes the consensus partitioning processes that various methods can be easily integrated. 2. It provides rich visualizations for interpreting the results. 3. It allows running multiple methods at the same time and provides functionalities to straightforward compare results. 4. It provides a new method to extract features which are more efficient to separate subgroups. 5. It automatically generates detailed reports for the complete analysis. 6. It allows applying consensus partitioning in a hierarchical manner.

Installation:To install this package, start R (version “4.2”) and enter:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("cola")

Application: The cola vignette can be found at https://bioconductor.org/packages/release/bioc/vignettes/cola/inst/doc/cola.html