DIRECTION
Discriminative IntegRative whole Epigenome Classification Toolkit at single nucleotide resolutION
Manual
Modes
Mode | Use |
---|---|
training | This mode outputs a trained model (SVM or RF) for a given feature set, by sampling balanced sets from given input data. |
testing | This mode performs testing to provide prediction metrics. It outputs the precision/recall metric on a testing data using user-specified previously trained model with feature set. |
testing_whole_genome | This mode performs testing to provide predictions. It outputs the actual predictions of the model for a given input dataset, using user-specified previously trained model with feature set. It can be used to perform imputation, prediction limited to a certain set of genomic regions, or whole genome prediction. |
crossvalidation | This mode samples balanced sets from the training data, partitions them into k sets for cross validation purposes, and performs testing for each partition, by training on the remaining partition, for a given feature set. It outputs precision/recall metrics across different test partitions, along with the associated model and feature set. |
beamsearch | This mode takes as input an initial feature set, and training data. It performs beam search to identify the best feature set (a subset of the initial feature set) according to user specified prediction metric, and outputs the best feature set, trained model along with the metric. It also outputs all evaluated feature sets with respective metrics. It has the cross-validation mode embedded inside it. |
create_bed_to_binary | This mode is for database management. It takes as input multiple feature sets in bed format, and merges them together into 1 matrix and saves it in binary file (.mat binary MATLAB format) format. This ensures a consistent, compact input matrix which enhances overall computational performance. |
append_bed_to_binary | This mode is also for database management. It takes a single feature in .bed format as an input, appends it to a larger matrix in a user-specified .mat file and eventually updates the .mat binary file. |
Usage
From the Matlab terminal:>> toplevel_script <config_file_path>For each mode, we provide a sample configuration file, to help the user understand how to run DIRECTION in this mode.
>> toplevel_script training_mode.cfg >> toplevel_script testing_mode.cfg >> toplevel_script testing_whole_genome_mode.cfg >> toplevel_script crossvalidation_mode.cfg >> toplevel_script beamsearch_mode.cfg >> toplevel_script create_bed_to_binary_mode.cfg >> toplevel_script append_bed_to_binary_mode.cfgAdditional files required to run the tool in these modes are also provided, with the exception of bedgraph files required to generate the binary data matrix. An example binary data matrix is instead provided directly for convenience.
Input and output files for each mode
Configuration file | Featureset file | Constraint files | Model file | Feature value files | Data matrix file | Prediction file(s) | Metrics file | Candidate featureset list file | Log file | |
---|---|---|---|---|---|---|---|---|---|---|
Format | ||||||||||
Encoding | ASCII | ASCII | ASCII | Binary | ASCII | Binary | ASCII | ASCII | ASCII | ASCII | Format | TSV (space / tab separated) | TSV (space / tab separated) | TSV (space / tab separated) | Matlab MAT file (binary data) | bedgraph (4 column, tab separated) | Matlab MAT file (binary data) | bedgraph (4/5 column, tab separated) | text file | text file | text file | Input path | <config_file_path> |
<path_to_feature_list> |
<path_to_pos_constraints> , <path_to_neg_constraints> to select eg.s for 2 classes |
<path_to_model> |
DATAPATH/chr*.bed |
./internal/binarydata/<dataset_name>.mat |
Output path | OUTPATH/<model_type>/*.mat |
./internal/binarydata/<dataset_name>.mat |
OUTPATH/chr*.bed or OUTPATH/predictions.txt (crossvalidation mode) |
OUTPATH/*_metrics.txt |
OUTPATH/out.txt |
./internal/logfile.txt |
File extension | .cfg (recommended) | .ftr (recommended) | .cst (recommended) | .mat (required) | .bed (recommended) | .mat (required) | .bed | .txt | .txt | .txt |
Mode requirements | ||||||||||
crossvalidation |
input | input | input | output | input | output | output | output | ||
training |
input | input | input | output | input | output | ||||
testing |
input | input | input | output | output | |||||
testing_whole_genome |
input | input | input | output | output | |||||
beamsearch |
input | input | input | output | output | |||||
create_bed_to_binary |
input | input | output | output | ||||||
append_bed_to_binary |
input | input | input & output (updated) | output |
*
represents a 'NIX environment wildcard match<config_file_path>
is specified as a command line parameter<model_type>
is "svm" or "rf", depending on which model is used <mode>
is specified in mode
field of configuration file<dataset_name>
is specified in dataset_name
field of configuration file<experiment_name>
is specified in experiment_name
field of configuration file<path_to_feature_list>
is specified in path_to_feature_list
field of configuration file<path_to_pos_constraints>
is specified in path_to_pos_constraints
field of configuration file<path_to_neg_constraints_path>
is specified in path_to_neg_constraints
field of configuration file<path_to_model>
is specified in path_to_model
field of configuration file<path_to_data>
is specified in path_to_data
field of configuration file<path_to_additional_feature>
is specified in path_to_additional_feature
field of configuration file<feature_name>
is specified in <feature_name>
field of configuration fileOUTPATH =
./output/<mode>/<experiment_name>
DATAPATH =
<path_to_data>/*
for add_bed_to_binary
mode, i.e. all subfolders under <path_to_data>/*
DATAPATH =
<path_to_additional_feature>
for append_bed_to_binary
modeFile formats
bedgraph files (feature value files and prediction files)
Notes: One file is created per chromosome, and named using the chromosome id (eg. chr12.bed) by convention.Structure : Each line is tab separated, with columns as follows :
Column 1: chromosome name
Column 2: starting position of genomic region [ cytosine location ] (1-based coordinate system)
Column 3: ending position of genomic region - 1 [cytosine location + 1 ] (1-based coordinate system)
Column 4: value associated with genomic region [ feature value for feature value bedgraph files, predicted state for prediction bedgraph files ]
For predictions.txt file, an additional fifth column is present. Headers are also present in this file (unlike normal bedgraph files)
Example:
Excerpt from chr1.bed
chr1 10697 10698 high chr1 10760 10761 high chr1 15526 15527 high chr1 16141 16142 high chr1 16244 16245 high chr1 23974 23975 high chr1 28908 28909 high chr1 28908 28909 high
Featureset files
Notes: Tab or space separated file containing one line per feature contained in./internal/<dataset_name>.mat
Structure : Each line is tab separated, with columns as follows :
Column 1: 0 or 1 [depending on whether (1) or not (0) the feature will be used for training the model]
Column 2: name of the feature, as present in
./internal/<dataset_name>.mat
Example:
Excerpt from small_list.ftr
0 bisulfite_coverage 0 bisulfite_level 1 bp_to_cgi 0 cg_sat_100 0 cg_sat_200 0 cg_sat_400 0 cg_sat_50 0 cpg_to_cgi 1 DNase 0 g_sat_100 0 H3K4me2 1 H3K4me3 0 H3K79me1 0 H3K9me3 0 H4K8ac 0 H4K91ac 1 histone_states 1 repeats
Constraint files
Notes: Tab or space separated file containing one line per constraint for selecting each class, hence two files are required for binary classification.Structure : Each line is tab separated, with columns as follows :
Column 1: name of the feature, as present in
./internal/<dataset_name>.mat
Column 2: binary operator > >= == ~= < <= supported
Column 3: corresponding value of feature
Example:
low_methylation.cfg
bisulfite_level < 0.49999 bisulfite_coverage >= 20high_methylation.cfg
bisulfite_level >= 0.5 bisulfite_coverage >= 20
Configuration files
Notes: Tab or space separated file containing one line per parameter for running the tool. Any particular order of parameters is not required.Structure : Each line is tab separated, with columns as follows :
Column 1: Parameter name
Column 2: Parameter value
Example:
crossvalidation_mode.cfg
mode crossvalidation experiment_name toy_methylation dataset_name toy pos_sample_set_size 5000 neg_sample_set_size 5000 crossvalidation_fold 5 iterations_for_svm_convergence 10e7 kkt_violation_fraction 0.05 tries_for_svm_convergence 3 path_to_feature_list ./feature_list/feature_list.txt path_to_pos_constraints ./constraints/high_methylation.txt path_to_neg_constraints ./constraints/low_methylation.txt split_feature crossvalidation
MAT files
Notes: Matlab's internal binary format for storing variables in binary files. Forward and reverse compatible among versions of Matlab. These files are used to store the data matrix, and trained SVM/RF models. It is important to note that while creating the data matrix, 4 feature names are reserved.bisulfite_coverage
and bisulfite_level
are required feature names (for the feature vectors with BS-seq coverage and CCRs respectively) that is required by the tool for methylation status prediction, and tab_seq_coverage
and tab_seq_level
are required feature names (for the feature vectors with TAB-seq coverage and CCRs respectively) that is required by the tool for 5hmC status prediction.Structure : Not human readable.
Metric files
Notes: For cross-validation, DIRECTION outputs two files : crossvalidation_batch_metrics.txt and crossvalidation_overall_metrics.txt .Structure : Tab separated, with one line per metric. Batch metrics file reports metrics per crossvalidation fold. Overall metrics file reports metrics by pooling all crossvalidation folds.
Example:
crossvalidation_batch_metrics.txt
TP 974 972 948 969 974 TN 786 795 812 804 788 FP 26 28 52 31 26 FN 214 205 188 196 212 Recall 0.81 0.82 0.83 0.83 0.82 FPR 0.03 0.03 0.06 0.03 0.03 Accuracy 0.88 0.88 0.88 0.88 0.88 Specificity 0.96 0.96 0.93 0.96 0.96 Precision 0.97 0.97 0.94 0.96 0.97 NPV 0.78 0.79 0.81 0.80 0.78 FDR 0.02 0.02 0.05 0.03 0.02 Fscore 0.89 0.89 0.88 0.89 0.89crossvalidation_overall_metrics.txt
TP 4837 TN 3985 FP 163 FN 1015 Recall 0.82 FPR 0.03 Accuracy 0.88 Specificity 0.96 Precision 0.96 NPV 0.79 FDR 0.03 Fscore 0.89
Candidate featureset list files
Notes: For beam search, DIRECTION outputs a ranked list of candidate featuresets, sorted according to the classification metric of our choice .Structure : Tab separated, with two lines per featureset. The first line lists feature names for candidate featuresets as present in
./internal/<dataset_name>.mat
.(feature names in the data matrix files are without quotes). The second line lists three metrics: Fscore, precision, recall, and order of evaluation of featureset.
Example:
out.txt
"DNase" "H3K4me3" "histone_states" "repeats" 0.8808 0.9460 0.8240 4 "DNase" "H3K4me3" "bp_to_cgi" "histone_states" "repeats" 0.8803 0.9670 0.8079 1 "DNase" "bp_to_cgi" "histone_states" "repeats" 0.8801 0.9580 0.8139 3 "DNase" "H3K4me3" "bp_to_cgi" "histone_states" 0.8730 0.9490 0.8083 6 "DNase" "H3K4me3" "bp_to_cgi" "repeats" 0.8660 0.9660 0.7847 5
Log files
Notes: For all files a log file is written in./internal/logfile.txt
, for diagnostic purposes. Structure : One line per timestamp.
Example:
logfile.txt
2017-4-14 11:53:15 Run started 2017-4-14 11:53:15 beamsearch mode started 2017-4-14 11:53:15 Feature data loading started 2017-4-14 11:53:15 Filtering started 2017-4-14 11:53:15 Evaluation 1 started 2017-4-14 11:53:15 Cross validation fold 1 started 2017-4-14 11:53:18 Cross validation fold 2 started 2017-4-14 11:53:21 Cross validation fold 3 started 2017-4-14 11:53:23 Cross validation fold 4 started 2017-4-14 11:53:26 Cross validation fold 5 started 2017-4-14 11:53:29 Evaluation 2 started 2017-4-14 11:53:29 Cross validation fold 1 started 2017-4-14 11:53:32 Cross validation fold 2 started 2017-4-14 11:53:34 Cross validation fold 3 started 2017-4-14 11:53:37 Cross validation fold 4 started 2017-4-14 11:53:39 Cross validation fold 5 started 2017-4-14 11:53:41 Evaluation 3 started 2017-4-14 11:53:41 Cross validation fold 1 started 2017-4-14 11:53:44 Cross validation fold 2 started 2017-4-14 11:53:44 Cross validation fold 3 started 2017-4-14 11:53:46 Cross validation fold 4 started 2017-4-14 11:53:47 Cross validation fold 5 started 2017-4-14 11:53:52 Sorting f-scores and writing best feature sets to file 2017-4-14 11:53:52 beamsearch mode ended
Configuration file parameters
Parameter name | Parameter value | Relevant modes | Required/optional (Default values) | Notes |
---|---|---|---|---|
mode |
crossvalidation | beamsearch | training | testing | testing_whole_genome | create_bed_to_binary | append_bed_to_binary |
All modes | Required | For configuring DIRECTION to run in particular mode, one of these modes have to be specified |
experiment_name |
one word unique string id for the experiment being run | All modes except create_bed_to_binary and append_bed_to_binary | Required | All output files will be stored under ./output/<mode>/<experiment_name>/ |
dataset_name |
one word string dataset_name |
All modes | Required | Needs to correspond to the binary data matrix file being used in the experiment at ./internal/binarydata/<dataset_name>.mat |
path_to_feature_list |
absolute or relative path to featureset file, including filename | training, crossvalidation, beamsearch | Required | Modes where a trained model is provided as input do not require this input. A single ungapped string is required. Constructs like "./", or "../" are OS dependent, and may not be portable. Absolute paths are preferred. |
path_to_pos_constraints |
absolute or relative path to constraints file for positive examples, including filename | training, crossvalidation, beamsearch | Required | Feature names in file need to be commensurate with feature names in the binary data matrix file. A single ungapped string is required. Constructs like "./", or "../" are OS dependent, and may not be portable. Absolute paths are preferred. |
path_to_neg_constraints |
absolute or relative path to constraints file for positive examples, including filename | training, crossvalidation, beamsearch | Required | Feature names in file need to be commensurate with feature names in the binary data matrix file. A single ungapped string is required. Constructs like "./", or "../" are OS dependent, and may not be portable. Absolute paths are preferred. |
path_to_data |
absolute or relative path to data folder, and without a trailing "/" | create_bed_to_binary | Required | A single ungapped string is required. Subfolders under the path each correspond to one feature, and subfolder names will correspond to created feature names. Inside each subfolder, one bedgraph file per chromosome, with one line per cytosine, is expected. |
path_to_additional_feature |
absolute or relative path to data folder, and without a trailing "/" | append_bed_to_binary | Required | A single ungapped string is required. Inside the folder, one bedgraph file per chromosome, with one line per cytosine, is expected. The feature name is not generated from the folder name, but from the feature_name>/code> parameter in the configuration file. |
feature_name |
string corresponding to the name of the appended feature in the data matrix | append_bed_to_binary | Required | A single ungapped string is required. The feature name will correspond to the feature data inside path_to_additional_feature . The feature name cannot match existing feature names in the binary data matrix. |
pos_sample_set_size |
positive integer representing no. of positive examples to be used for training | training, crossvalidation, beamsearch | Optional (default: 5000) | Takes significantly longer to converge above 10000 |
neg_sample_set_size |
positive integer representing no. of negative examples to be used for training | training, crossvalidation, beamsearch | Optional (default: 5000) | Takes significantly longer to converge above 10000 |
iterations_for_svm_convergence |
positive integer representing maximum no. of iterations for SVM training convergence | training, crossvalidation, beamsearch (SVM only) | Optional (default: 10^7) | Typically set within 5 orders of magnitude between 10^5 and 10^9 |
kkt_violation_fraction |
positive fraction between 0 and 1 specifying the number of variables | training, crossvalidation, beamsearch (SVM only) | Optional (default: 0.05) | Refer to (Karush, 1939) or (Kuhn and Tucker,1951) for details on formulating nonlinear programming. Presently Sequential Minimal Optimization is used in DIRECTION. |
tries_for_model_convergence |
Non-negative integer | training, crossvalidation, beamsearch | Optional (default: 3) | From trial and error, a value > 3 for SVM training does not yield improved results. For RF training, since the TreeBagger native function terminates deterministically, this value is ignored currently. |
crossvalidationfold |
positive integer specifying k in k-fold cross validation | crossvalidation, beamsearch | Optional (default: 5) | Each fold should have at least 2000 data points for a robust estimation of test set error rate per fold. |
beam_width |
positive integer corresponding to the beam width of the search | beamsearch | Optional (default: 3) | For beam width = n, where n is the number of features in the initial feature set, search becomes A* search. For beam width = 1, search becomes greedy hill climbing. |
priority_metric |
fscore | precision | recall |
beamsearch | Required | Additional metrics to be added in the future |
max_evaluations |
positive integer capping the number of feature sets evaluated in the search | beamsearch | Required | 1500 allows for a practical, prioritized search of feature subset space landscape generated from ~ 15 features |
num_of_top_evaluations |
positive integer corresponding to the number of feature sets to report in out.txt based on featuresets with the highest classification metrics | beamsearch | Required | Should be <= max_evaluations . |
path_to_model |
absolute or relative path to constraints file for positive examples, including filename | testing, testing_whole_genome | Required | A single ungapped string is required. The feature names in the feature vector associated with the model must match the feature names in the binary data matrix. |
testing_size |
positive integer corresponding to the number of data points used by DIRECTION to perform prediction | testing | Required | Maximizing testing_size to the dataset size converts testing mode to testing_whole_genome_mode. |
batch_size |
positive integer corresponding to the number of data points per batch of testing by DIRECTION | testing, testing_whole_genome | Required (currently set up to 5000) | To be deprecated in the future. |
split_feature |
crossvalidation |
crossvalidation | Required (currently always set to the string "crossvalidation") | For future use. |