pradipta ray


DIRECTION

Discriminative IntegRative whole Epigenome Classification Toolkit at single nucleotide resolutION

Manual

Modes

Mode Use
training This mode outputs a trained model (SVM or RF) for a given feature set, by sampling balanced sets from given input data.
testing This mode performs testing to provide prediction metrics. It outputs the precision/recall metric on a testing data using user-specified previously trained model with feature set.
testing_whole_genome This mode performs testing to provide predictions. It outputs the actual predictions of the model for a given input dataset, using user-specified previously trained model with feature set. It can be used to perform imputation, prediction limited to a certain set of genomic regions, or whole genome prediction.
crossvalidation This mode samples balanced sets from the training data, partitions them into k sets for cross validation purposes, and performs testing for each partition, by training on the remaining partition, for a given feature set. It outputs precision/recall metrics across different test partitions, along with the associated model and feature set.
beamsearch This mode takes as input an initial feature set, and training data. It performs beam search to identify the best feature set (a subset of the initial feature set) according to user specified prediction metric, and outputs the best feature set, trained model along with the metric. It also outputs all evaluated feature sets with respective metrics. It has the cross-validation mode embedded inside it.
create_bed_to_binary This mode is for database management. It takes as input multiple feature sets in bed format, and merges them together into 1 matrix and saves it in binary file (.mat binary MATLAB format) format. This ensures a consistent, compact input matrix which enhances overall computational performance.
append_bed_to_binary This mode is also for database management. It takes a single feature in .bed format as an input, appends it to a larger matrix in a user-specified .mat file and eventually updates the .mat binary file.

Usage

From the Matlab terminal:
>> toplevel_script <config_file_path> 
For each mode, we provide a sample configuration file, to help the user understand how to run DIRECTION in this mode.
>> toplevel_script training_mode.cfg
>> toplevel_script testing_mode.cfg
>> toplevel_script testing_whole_genome_mode.cfg
>> toplevel_script crossvalidation_mode.cfg
>> toplevel_script beamsearch_mode.cfg
>> toplevel_script create_bed_to_binary_mode.cfg
>> toplevel_script append_bed_to_binary_mode.cfg
Additional files required to run the tool in these modes are also provided, with the exception of bedgraph files required to generate the binary data matrix. An example binary data matrix is instead provided directly for convenience.

Input and output files for each mode

Configuration file Featureset file Constraint files Model file Feature value files Data matrix file Prediction file(s) Metrics file Candidate featureset list file Log file
Format
Encoding ASCII ASCII ASCII Binary ASCII Binary ASCII ASCII ASCII ASCII
Format TSV (space / tab separated) TSV (space / tab separated) TSV (space / tab separated) Matlab MAT file (binary data) bedgraph (4 column, tab separated) Matlab MAT file (binary data) bedgraph (4/5 column, tab separated) text file text file text file
Input path <config_file_path> <path_to_feature_list> <path_to_pos_constraints> , <path_to_neg_constraints> to select eg.s for 2 classes <path_to_model> DATAPATH/chr*.bed ./internal/binarydata/<dataset_name>.mat
Output path OUTPATH/<model_type>/*.mat ./internal/binarydata/<dataset_name>.mat OUTPATH/chr*.bed or OUTPATH/predictions.txt (crossvalidation mode) OUTPATH/*_metrics.txt OUTPATH/out.txt ./internal/logfile.txt
File extension .cfg (recommended) .ftr (recommended) .cst (recommended) .mat (required) .bed (recommended) .mat (required) .bed .txt .txt .txt
Mode requirements
crossvalidation input input input output input output output output
training input input input output input output
testing input input input output output
testing_whole_genome input input input output output
beamsearch input input input output output
create_bed_to_binary input input output output
append_bed_to_binary input input input & output (updated) output
* represents a 'NIX environment wildcard match
<config_file_path> is specified as a command line parameter
<model_type> is "svm" or "rf", depending on which model is used

<mode> is specified in mode field of configuration file
<dataset_name> is specified in dataset_name field of configuration file
<experiment_name> is specified in experiment_name field of configuration file
<path_to_feature_list> is specified in path_to_feature_list field of configuration file
<path_to_pos_constraints> is specified in path_to_pos_constraints field of configuration file
<path_to_neg_constraints_path> is specified in path_to_neg_constraints field of configuration file
<path_to_model> is specified in path_to_model field of configuration file
<path_to_data> is specified in path_to_data field of configuration file
<path_to_additional_feature> is specified in path_to_additional_feature field of configuration file
<feature_name> is specified in <feature_name> field of configuration file

OUTPATH = ./output/<mode>/<experiment_name>

DATAPATH = <path_to_data>/* for add_bed_to_binary mode, i.e. all subfolders under <path_to_data>/*
DATAPATH = <path_to_additional_feature> for append_bed_to_binary mode

File formats

bedgraph files (feature value files and prediction files)
Notes: One file is created per chromosome, and named using the chromosome id (eg. chr12.bed) by convention.

Structure : Each line is tab separated, with columns as follows :
Column 1: chromosome name
Column 2: starting position of genomic region [ cytosine location ] (1-based coordinate system)
Column 3: ending position of genomic region - 1 [cytosine location + 1 ] (1-based coordinate system)
Column 4: value associated with genomic region [ feature value for feature value bedgraph files, predicted state for prediction bedgraph files ]

For predictions.txt file, an additional fifth column is present. Headers are also present in this file (unlike normal bedgraph files)

Example:
Excerpt from chr1.bed
chr1	10697	10698	high
chr1	10760	10761	high
chr1	15526	15527	high
chr1	16141	16142	high
chr1	16244	16245	high
chr1	23974	23975	high
chr1	28908	28909	high
chr1	28908	28909	high
Featureset files
Notes: Tab or space separated file containing one line per feature contained in ./internal/<dataset_name>.mat

Structure : Each line is tab separated, with columns as follows :
Column 1: 0 or 1 [depending on whether (1) or not (0) the feature will be used for training the model]
Column 2: name of the feature, as present in ./internal/<dataset_name>.mat

Example:
Excerpt from small_list.ftr
0	bisulfite_coverage
0	bisulfite_level
1	bp_to_cgi
0	cg_sat_100
0	cg_sat_200
0	cg_sat_400
0	cg_sat_50
0	cpg_to_cgi
1	DNase
0	g_sat_100
0	H3K4me2
1	H3K4me3
0	H3K79me1
0	H3K9me3
0	H4K8ac
0	H4K91ac
1	histone_states
1	repeats
Constraint files
Notes: Tab or space separated file containing one line per constraint for selecting each class, hence two files are required for binary classification.

Structure : Each line is tab separated, with columns as follows :
Column 1: name of the feature, as present in ./internal/<dataset_name>.mat
Column 2: binary operator > >= == ~= < <= supported
Column 3: corresponding value of feature

Example:
low_methylation.cfg
bisulfite_level	<	0.49999
bisulfite_coverage	>=	20
high_methylation.cfg
bisulfite_level	>=	0.5
bisulfite_coverage	>=	20
Configuration files
Notes: Tab or space separated file containing one line per parameter for running the tool. Any particular order of parameters is not required.

Structure : Each line is tab separated, with columns as follows :
Column 1: Parameter name
Column 2: Parameter value

Example:
crossvalidation_mode.cfg
mode	crossvalidation
experiment_name	toy_methylation
dataset_name	toy
pos_sample_set_size	5000
neg_sample_set_size	5000
crossvalidation_fold	5
iterations_for_svm_convergence	10e7
kkt_violation_fraction	0.05
tries_for_svm_convergence	3
path_to_feature_list	./feature_list/feature_list.txt
path_to_pos_constraints	./constraints/high_methylation.txt
path_to_neg_constraints	./constraints/low_methylation.txt
split_feature	crossvalidation
MAT files
Notes: Matlab's internal binary format for storing variables in binary files. Forward and reverse compatible among versions of Matlab. These files are used to store the data matrix, and trained SVM/RF models. It is important to note that while creating the data matrix, 4 feature names are reserved. bisulfite_coverage and bisulfite_level are required feature names (for the feature vectors with BS-seq coverage and CCRs respectively) that is required by the tool for methylation status prediction, and tab_seq_coverage and tab_seq_level are required feature names (for the feature vectors with TAB-seq coverage and CCRs respectively) that is required by the tool for 5hmC status prediction.

Structure : Not human readable.
Metric files
Notes: For cross-validation, DIRECTION outputs two files : crossvalidation_batch_metrics.txt and crossvalidation_overall_metrics.txt .

Structure : Tab separated, with one line per metric. Batch metrics file reports metrics per crossvalidation fold. Overall metrics file reports metrics by pooling all crossvalidation folds.

Example:
crossvalidation_batch_metrics.txt
TP	974	972	948	969	974
TN	786	795	812	804	788
FP	26	28	52	31	26
FN	214	205	188	196	212
Recall	0.81	0.82	0.83	0.83	0.82
FPR	0.03	0.03	0.06	0.03	0.03
Accuracy	0.88	0.88	0.88	0.88	0.88
Specificity	0.96	0.96	0.93	0.96	0.96
Precision	0.97	0.97	0.94	0.96	0.97
NPV	0.78	0.79	0.81	0.80	0.78
FDR	0.02	0.02	0.05	0.03	0.02
Fscore	0.89	0.89	0.88	0.89	0.89
crossvalidation_overall_metrics.txt
TP	4837
TN	3985
FP	163
FN	1015
Recall	0.82
FPR	0.03
Accuracy	0.88
Specificity	0.96
Precision	0.96
NPV	0.79
FDR	0.03
Fscore	0.89
Candidate featureset list files
Notes: For beam search, DIRECTION outputs a ranked list of candidate featuresets, sorted according to the classification metric of our choice .

Structure : Tab separated, with two lines per featureset. The first line lists feature names for candidate featuresets as present in ./internal/<dataset_name>.mat.
(feature names in the data matrix files are without quotes). The second line lists three metrics: Fscore, precision, recall, and order of evaluation of featureset.
Example:
out.txt
"DNase"	"H3K4me3"	"histone_states"	"repeats"
0.8808	0.9460	0.8240	4
"DNase"	"H3K4me3"	"bp_to_cgi"	"histone_states"	"repeats"
0.8803	0.9670	0.8079	1
"DNase"	"bp_to_cgi"	"histone_states"	"repeats"
0.8801	0.9580	0.8139	3
"DNase"	"H3K4me3"	"bp_to_cgi"	"histone_states"
0.8730	0.9490	0.8083	6
"DNase"	"H3K4me3"	"bp_to_cgi"	"repeats"
0.8660	0.9660	0.7847	5
Log files
Notes: For all files a log file is written in ./internal/logfile.txt, for diagnostic purposes.

Structure : One line per timestamp.

Example:
logfile.txt
2017-4-14 11:53:15 Run started
2017-4-14 11:53:15 beamsearch mode started
2017-4-14 11:53:15 Feature data loading started
2017-4-14 11:53:15 Filtering started
2017-4-14 11:53:15 Evaluation 1 started
2017-4-14 11:53:15 Cross validation fold 1 started
2017-4-14 11:53:18 Cross validation fold 2 started
2017-4-14 11:53:21 Cross validation fold 3 started
2017-4-14 11:53:23 Cross validation fold 4 started
2017-4-14 11:53:26 Cross validation fold 5 started
2017-4-14 11:53:29 Evaluation 2 started
2017-4-14 11:53:29 Cross validation fold 1 started
2017-4-14 11:53:32 Cross validation fold 2 started
2017-4-14 11:53:34 Cross validation fold 3 started
2017-4-14 11:53:37 Cross validation fold 4 started
2017-4-14 11:53:39 Cross validation fold 5 started
2017-4-14 11:53:41 Evaluation 3 started
2017-4-14 11:53:41 Cross validation fold 1 started
2017-4-14 11:53:44 Cross validation fold 2 started
2017-4-14 11:53:44 Cross validation fold 3 started
2017-4-14 11:53:46 Cross validation fold 4 started
2017-4-14 11:53:47 Cross validation fold 5 started
2017-4-14 11:53:52 Sorting f-scores and writing best feature sets to file
2017-4-14 11:53:52 beamsearch mode ended

Configuration file parameters

Parameter name Parameter value Relevant modes Required/optional (Default values) Notes
mode crossvalidation | beamsearch | training | testing | testing_whole_genome | create_bed_to_binary | append_bed_to_binary All modes Required For configuring DIRECTION to run in particular mode, one of these modes have to be specified
experiment_name one word unique string id for the experiment being run All modes except create_bed_to_binary and append_bed_to_binary Required All output files will be stored under ./output/<mode>/<experiment_name>/
dataset_name one word string dataset_name All modes Required Needs to correspond to the binary data matrix file being used in the experiment at ./internal/binarydata/<dataset_name>.mat
path_to_feature_list absolute or relative path to featureset file, including filename training, crossvalidation, beamsearch Required Modes where a trained model is provided as input do not require this input. A single ungapped string is required. Constructs like "./", or "../" are OS dependent, and may not be portable. Absolute paths are preferred.
path_to_pos_constraints absolute or relative path to constraints file for positive examples, including filename training, crossvalidation, beamsearch Required Feature names in file need to be commensurate with feature names in the binary data matrix file. A single ungapped string is required. Constructs like "./", or "../" are OS dependent, and may not be portable. Absolute paths are preferred.
path_to_neg_constraints absolute or relative path to constraints file for positive examples, including filename training, crossvalidation, beamsearch Required Feature names in file need to be commensurate with feature names in the binary data matrix file. A single ungapped string is required. Constructs like "./", or "../" are OS dependent, and may not be portable. Absolute paths are preferred.
path_to_data absolute or relative path to data folder, and without a trailing "/" create_bed_to_binary Required A single ungapped string is required. Subfolders under the path each correspond to one feature, and subfolder names will correspond to created feature names. Inside each subfolder, one bedgraph file per chromosome, with one line per cytosine, is expected.
path_to_additional_feature absolute or relative path to data folder, and without a trailing "/" append_bed_to_binary Required A single ungapped string is required. Inside the folder, one bedgraph file per chromosome, with one line per cytosine, is expected. The feature name is not generated from the folder name, but from the feature_name>/code> parameter in the configuration file.
feature_name string corresponding to the name of the appended feature in the data matrix append_bed_to_binary Required A single ungapped string is required. The feature name will correspond to the feature data inside path_to_additional_feature. The feature name cannot match existing feature names in the binary data matrix.
pos_sample_set_size positive integer representing no. of positive examples to be used for training training, crossvalidation, beamsearch Optional (default: 5000) Takes significantly longer to converge above 10000
neg_sample_set_size positive integer representing no. of negative examples to be used for training training, crossvalidation, beamsearch Optional (default: 5000) Takes significantly longer to converge above 10000
iterations_for_svm_convergence positive integer representing maximum no. of iterations for SVM training convergence training, crossvalidation, beamsearch (SVM only) Optional (default: 10^7) Typically set within 5 orders of magnitude between 10^5 and 10^9
kkt_violation_fraction positive fraction between 0 and 1 specifying the number of variables training, crossvalidation, beamsearch (SVM only) Optional (default: 0.05) Refer to (Karush, 1939) or (Kuhn and Tucker,1951) for details on formulating nonlinear programming. Presently Sequential Minimal Optimization is used in DIRECTION.
tries_for_model_convergence Non-negative integer training, crossvalidation, beamsearch Optional (default: 3) From trial and error, a value > 3 for SVM training does not yield improved results. For RF training, since the TreeBagger native function terminates deterministically, this value is ignored currently.
crossvalidationfold positive integer specifying k in k-fold cross validation crossvalidation, beamsearch Optional (default: 5) Each fold should have at least 2000 data points for a robust estimation of test set error rate per fold.
beam_width positive integer corresponding to the beam width of the search beamsearch Optional (default: 3) For beam width = n, where n is the number of features in the initial feature set, search becomes A* search. For beam width = 1, search becomes greedy hill climbing.
priority_metric fscore | precision | recall beamsearch Required Additional metrics to be added in the future
max_evaluations positive integer capping the number of feature sets evaluated in the search beamsearch Required 1500 allows for a practical, prioritized search of feature subset space landscape generated from ~ 15 features
num_of_top_evaluations positive integer corresponding to the number of feature sets to report in out.txt based on featuresets with the highest classification metrics beamsearch Required Should be <= max_evaluations.
path_to_model absolute or relative path to constraints file for positive examples, including filename testing, testing_whole_genome Required A single ungapped string is required. The feature names in the feature vector associated with the model must match the feature names in the binary data matrix.
testing_size positive integer corresponding to the number of data points used by DIRECTION to perform prediction testing Required Maximizing testing_size to the dataset size converts testing mode to testing_whole_genome_mode.
batch_size positive integer corresponding to the number of data points per batch of testing by DIRECTION testing, testing_whole_genome Required (currently set up to 5000) To be deprecated in the future.
split_feature crossvalidation crossvalidation Required (currently always set to the string "crossvalidation") For future use.

Licensing

Code is distributed under CC-BY-SA 3.0 license ( human readable URL, legalese URL ) . For commercial uses, please email prad...@utdallas.edu .

Contact

For additional help / feedback, contact prad...@utdallas.edu .