Digital Biology Laboratory
EXCAVATOR Version 1.0:
Introduction
Web Server
Installation
Running
Java GUI
References
FAQ
   

General Guidance

The EXCAVATOR Java interface handles the I/O and calls the shell command. For PC version, you can double click to launch the program. For Unix/Linux version, you may run the "excavator" file in the installation directory. To call it directly from a Unix/Linux shell, you may link the "excavator" file to a directory of your Unix command path, e.g., /usr/local/bin. You may also make an alias to it in your .cshrc or .login file, e.g.,

	alias excavator-java /home/john/software/excavator-sun/excavator
You can specify your parameters at various pull-down menus. The selected parameters will be memorized. So you may want to refresh the options by choosing "File/Reset parameters..." before analyzing a different data set. All the ascii files can be visualized under "File/View file...".

Inputs

The gene expression profiles can be saved in an input file. Each line in the input file represents the expression profile for one gene, except lines starting with "#" or "REMARK", which will be ignored as comments. You can load you data file from "File/Load dataset...". The parameters for the input can be specified under "Input/Parameter setting for input...". EXCAVATOR can read the input data in three formats (specified in "Format"):

  • Entries separated by tabs with annotation (default, see an example):
    GeneId	annotations	data-1	data-2	...
    
    "GeneId" is a string without containing any space or tab. The annotation can be any strings without containing any tab (space is fine).

  • Entries separated by tabs without annotation (see an example):
    GeneId	data-1	data-2	...
    

  • Entries separated by space without annotation (see an example):
    GeneId data-1 data-2 ...
    

If the data have not been applied by log, then select "Yes" in "Log".

Sometimes, the gene expression data may not be complete for all data points in every gene. For the first two formats of the data, one can indicate a missing data point by two consecutive tabs (see an example). You can use "Missing data handling" to specify how you want to replace the missing data:

  • Use 1 as the ratio of the gene expression level.

  • Use the average over other genes at the same column of the data series.

  • Use the average over all the other known data points of the same gene.

  • Use the average over two neighboring known data points of the same gene.

After the select your options in "Input/Parameter setting for input...", click "OK". If you do not click or you click "Cancel", EXCAVATOR will still use the default parameters. If you like to use default setting, you do not need to start the "Input/Parameter setting for input..." panel. This is the case for all other panels in EXCAVATOR.

You may also remove some less regulated genes by using the "-remove" flag followed by two values,

	excavator -remove val1 val2 input.data
If the minimum value among all the data point of a gene is larger than val1 and the maximum value among all the data point of the gene is less than val2, this gene will be removed. The data left will be saved in "Filter.data" (see an example).

Similarity Measure

The similarity measure represents how to calculate the distance between gene expression profiles. EXCAVATOR uses the "-dist n" flag for the options of similarity measure:

  • "-dist 1" or no "-dist" flag (default): (1 - correlation coefficient).

  • "-dist 11": (1 - square of correlation coefficient).

  • "-dist 12": (1 - absolute value of correlation coefficient).

  • "-dist 2": Euclidean distance.

  • "-dist 21": square of Euclidean distance.

  • "-dist 3": sine square of the angle between two vectors.

Clustering Methods

EXCAVATOR offers the following methods for clustering algorithms, which can be specified at "Method/Clustering...":

  • Hierarchical clustering (default) based on the objective function for the selected similarity measure (to optimize the sum of the distance between a gene and the center of its cluster hierarchically).

  • Non-hierarchical clustering based on the objective function for the selected similarity measure using an iterative approach (to optimize the sum of the distance between a gene and the center of its cluster iteratively; the clustering result may not reach the global optimal solution for the objective function).

  • Non-hierarchical clustering to optimize the sum of the distance between a gene and its best representative gene in the cluster. The globally optimal solution is guaranteed, but it takes much longer time than other methods.

  • Hierarchical clustering by simply cutting longest edges on the minimum spanning tree. It is the fastest method, but the result may not be desired.

Number of Clusters

The method to determine the number of clusters can be selected in "Method/Clustering...". A user can either specify the number of clusters or let EXCAVATOR determine it automatically. To specify the number of clusters, use "-ncluster n" flag, where "n" is the number of clusters, e.g.,

	excavator -ncluster 3 input.data
EXCAVATOR can determine the number of clusters through calculating the objective functions for different numbers of clusters up to "MaxNCluster" cluster, where "MaxNCluster" can be specified by the "-maxncluster MaxNCluster" flag. e.g.,
	excavator -maxncluster 100 input.data
Without either "-ncluster" or "-maxncluster" flag (i.e., default), EXCAVATOR will automatically give the value of "MaxNCluster" (up to 1/3 of the total number of genes, depending on the clustering methods). If you like to see the the objective functions for different numbers of clusters first, you can run
	excavator -profile input.data
which will generate "quality.data" and "diff.data" (see the following).

  • "-cutoff CutoffDistance -mne MinNumElement": cutting all the edges longer than "CutoffDistance" and remove all the small clusters with less than "MinNumElement" elements.

    Constraints

    EXCAVATOR allows a user to add constraints so that certain specified genes will stay in the same cluster. The flag "-constraint ConstraintFile" will enforce the constraints. The format of the "ConstraintFile" is that genes in the same line (separated by spaces or tabs) are forced into the same cluster (e.g., see list.cons). For example, by using the file CEFH.cons,

    	excavator -input 2 -cons CEFH.cons -ncluster 4 CEFH.dat
    
    will force the genes "eTYE7" and "cRPT3" in the same cluster.

    Another related option in conjunction with the "-cutoff CutoffDistance" flag is to cut long edges with distance longer than "CutoffDistance" and then select the clusters which contain the genes specified in "ConstraintFile". In this case, a "-cn" flag is needed, e.g.,

    	excavator -input 2 -cons CEFH.cons -cutoff 0.1 -cn CEFH.dat
    
    In this case, only 26 out of the original 68 genes in CEFH.dat will be left and they contain the genes "eTYE7" and "cRPT3" specified in "CEFH.cons". Instead of using the "-cutoff CutoffDistance" flag, you can also choose the number of genes to be left instead of "CutoffDistance" by using the "-ns NumLeft" flag, e.g.,
    	excavator -input 2 -cons CEFH.cons -cn -ns 20 CEFH.dat
    
    will leave 20 out of the original 68 genes in CEFH.dat, and they also contain the genes "eTYE7" and "cRPT3" specified in "CEFH.cons". Sometimes the number of genes left may differ a little from what is specified ("NumLeft") due to the constraints.

    Output Files

    EXCAVATOR may produce the following files, depending on the flags used in the command line:

    • "cluster.out": the major clustering result file (see an example). The comments in the file are self-explanatory. Each line of gene expression profile gives the "GeneId", annotation, and the data points (without log being added).

    • "cluster.tree": the binary tree for the clustering result (see an example). The comments in the file are self-explanatory. Each line after "#Tree links" shows four two-dimensional coordinates. Linking these coordinates will produce the binary tree.

    • "MST.data": the minimum spanning tree, where each line shows the distance and the vertices associated with the distance (see an example).

    • "quality.data": the optimized assessment function (the second column) versus the number of clusters (the first column) (see an example).

    • "diff.data": the transition profile (the second column) as a function of the number of clusters (the first column) (see an example). The peak of the this function often indicate "natural" number of clusters.

    • "dist-distri.data": the distribution for the length of the minimum spanning tree (see an example). This file can be generated using the "-filter" flag, e.g.,
      excavator -filter stanford.dat
      

    • "Filter.data": the genes left after the filtering by the "-remove", "-cutoff" or "-cn" flag (see an example).

    • "FLAG": an empty file indicating the calculating is finished properly.
    Please note that EXCAVATOR will overwrite existing files with the same names as specified above. Hence, if you want to save the results of your previous run, you need to rename the related files.

    Comparing Clustering Results

    You can compare two clustering results (in the cluster.out format) on the same data set using the "-comp" file:

    	excavator -comp cluster.out cluster.out
    
    It will give a value between 0 (most different clustering results) and 1 (identical clustering results).
Dept. of Computer Science College of Engineering University of Missouri-Columbia Department of Computer Science College of Engineering University of Missouri-Columbia