|
|
|
|
General Guidance
The EXCAVATOR Java interface handles
the I/O and calls the shell command. For PC version, you can double
click to launch the program. For Unix/Linux version, you may run
the "excavator" file in the installation directory. To call it
directly from a Unix/Linux shell, you may link the "excavator"
file to a directory of your Unix command path, e.g., /usr/local/bin.
You may also make an alias to it in your .cshrc or .login file,
e.g.,
alias excavator-java /home/john/software/excavator-sun/excavator
You can specify your parameters at various pull-down menus. The
selected parameters will be memorized. So you may want to refresh
the options by choosing "File/Reset parameters..." before analyzing
a different data set. All the ascii files can be visualized under
"File/View file...".
Inputs
The gene expression profiles can be saved in an input file. Each
line in the input file represents the expression profile for one
gene, except lines starting with "#" or "REMARK", which will be
ignored as comments. You can load you data file from "File/Load
dataset...". The parameters for the input can be specified under
"Input/Parameter
setting for input...". EXCAVATOR can read the input data in
three formats (specified in "Format"):
If the data have not been applied by log, then select "Yes" in
"Log".
Sometimes, the gene expression data may not be complete for all
data points in every gene. For the first two formats of the data,
one can indicate a missing data point by two consecutive tabs
(see an example). You can use "Missing data handling" to specify
how you want to replace the missing data:
- Use 1 as the ratio of the gene expression level.
- Use the average over other genes at the same column of the
data series.
- Use the average over all the other known data points of the
same gene.
- Use the average over two neighboring known data points of
the same gene.
After the select your options in "Input/Parameter
setting for input...", click "OK". If you do not click or
you click "Cancel", EXCAVATOR will still use the default parameters.
If you like to use default setting, you do not need to start the
"Input/Parameter setting for input..." panel. This is the case
for all other panels in EXCAVATOR.
You may also remove some less regulated genes by using the "-remove"
flag followed by two values,
excavator -remove val1 val2 input.data
If the minimum value among all the data point of a gene is larger
than val1 and the maximum value among all the data point of the
gene is less than val2, this gene will be removed. The data left
will be saved in "Filter.data" (see an example).
Similarity Measure
The similarity measure represents how to calculate the distance
between gene expression profiles. EXCAVATOR uses the "-dist n"
flag for the options of similarity measure:
- "-dist 1" or no "-dist" flag (default): (1 - correlation coefficient).
- "-dist 11": (1 - square of correlation coefficient).
- "-dist 12": (1 - absolute value of correlation coefficient).
- "-dist 2": Euclidean distance.
- "-dist 21": square of Euclidean distance.
- "-dist 3": sine square of the angle between two vectors.
Clustering Methods
EXCAVATOR offers the following methods for clustering algorithms,
which can be specified at "Method/Clustering...":
- Hierarchical clustering (default) based on the objective function
for the selected similarity measure (to optimize the sum of
the distance between a gene and the center of its cluster hierarchically).
- Non-hierarchical clustering based on the objective function
for the selected similarity measure using an iterative approach
(to optimize the sum of the distance between a gene and the
center of its cluster iteratively; the clustering result may
not reach the global optimal solution for the objective function).
- Non-hierarchical clustering to optimize the sum of the distance
between a gene and its best representative gene in the cluster.
The globally optimal solution is guaranteed, but it takes much
longer time than other methods.
- Hierarchical clustering by simply cutting longest edges on
the minimum spanning tree. It is the fastest method, but the
result may not be desired.
Number of Clusters
The method to determine the number of clusters can be selected
in "Method/Clustering...".
A user can either specify the number of clusters or let EXCAVATOR
determine it automatically. To specify the number of clusters,
use "-ncluster n" flag, where "n" is the number of clusters, e.g.,
excavator -ncluster 3 input.data
EXCAVATOR can determine the number of clusters through calculating
the objective functions for different numbers of clusters up to
"MaxNCluster" cluster, where "MaxNCluster" can be specified by the
"-maxncluster MaxNCluster" flag. e.g.,
excavator -maxncluster 100 input.data
Without either "-ncluster" or "-maxncluster" flag (i.e., default),
EXCAVATOR will automatically give the value of "MaxNCluster" (up
to 1/3 of the total number of genes, depending on the clustering
methods). If you like to see the the objective functions for different
numbers of clusters first, you can run
excavator -profile input.data
which will generate "quality.data" and "diff.data" (see the following).
- "-cutoff CutoffDistance -mne MinNumElement": cutting all the
edges longer than "CutoffDistance" and remove all the small
clusters with less than "MinNumElement" elements.
Constraints
EXCAVATOR allows a user to add constraints so that certain
specified genes will stay in the same cluster. The flag "-constraint
ConstraintFile" will enforce the constraints. The format of
the "ConstraintFile" is that genes in the same line (separated
by spaces or tabs) are forced into the same cluster (e.g.,
see list.cons). For example, by using the file CEFH.cons,
excavator -input 2 -cons CEFH.cons -ncluster 4 CEFH.dat
will force the genes "eTYE7" and "cRPT3" in the same cluster.
Another related option in conjunction with the "-cutoff CutoffDistance"
flag is to cut long edges with distance longer than "CutoffDistance"
and then select the clusters which contain the genes specified
in "ConstraintFile". In this case, a "-cn" flag is needed,
e.g.,
excavator -input 2 -cons CEFH.cons -cutoff 0.1 -cn CEFH.dat
In this case, only 26 out of the original 68 genes in CEFH.dat
will be left and they contain the genes "eTYE7" and "cRPT3"
specified in "CEFH.cons". Instead of using the "-cutoff CutoffDistance"
flag, you can also choose the number of genes to be left instead
of "CutoffDistance" by using the "-ns NumLeft" flag, e.g.,
excavator -input 2 -cons CEFH.cons -cn -ns 20 CEFH.dat
will leave 20 out of the original 68 genes in CEFH.dat, and
they also contain the genes "eTYE7" and "cRPT3" specified in
"CEFH.cons". Sometimes the number of genes left may differ a
little from what is specified ("NumLeft") due to the constraints.
Output Files
EXCAVATOR may produce the following files, depending on the
flags used in the command line:
Please note that EXCAVATOR will overwrite existing files with
the same names as specified above. Hence, if you want to save
the results of your previous run, you need to rename the related
files.
Comparing Clustering Results
You can compare two clustering results (in the cluster.out format) on the same data set using the "-comp"
file:
excavator -comp cluster.out cluster.out
It will give a value between 0 (most different clustering results)
and 1 (identical clustering results).
|