Pradipta Ray

DIRECTION

Discriminative IntegRative whole Epigenome Classification Toolkit at single nucleotide resolutION

Manual

Modes

Mode	Use
training	This mode outputs a trained model (SVM or RF) for a given feature set, by sampling balanced sets from given input data.
testing	This mode performs testing to provide prediction metrics. It outputs the precision/recall metric on a testing data using user-specified previously trained model with feature set.
testing_whole_genome	This mode performs testing to provide predictions. It outputs the actual predictions of the model for a given input dataset, using user-specified previously trained model with feature set. It can be used to perform imputation, prediction limited to a certain set of genomic regions, or whole genome prediction.
crossvalidation	This mode samples balanced sets from the training data, partitions them into k sets for cross validation purposes, and performs testing for each partition, by training on the remaining partition, for a given feature set. It outputs precision/recall metrics across different test partitions, along with the associated model and feature set.
beamsearch	This mode takes as input an initial feature set, and training data. It performs beam search to identify the best feature set (a subset of the initial feature set) according to user specified prediction metric, and outputs the best feature set, trained model along with the metric. It also outputs all evaluated feature sets with respective metrics. It has the cross-validation mode embedded inside it.
create_bed_to_binary	This mode is for database management. It takes as input multiple feature sets in bed format, and merges them together into 1 matrix and saves it in binary file (.mat binary MATLAB format) format. This ensures a consistent, compact input matrix which enhances overall computational performance.
append_bed_to_binary	This mode is also for database management. It takes a single feature in .bed format as an input, appends it to a larger matrix in a user-specified .mat file and eventually updates the .mat binary file.

Usage

From the Matlab terminal:

>> toplevel_script <config_file_path>

For each mode, we provide a sample configuration file, to help the user understand how to run DIRECTION in this mode.

>> toplevel_script training_mode.cfg
>> toplevel_script testing_mode.cfg
>> toplevel_script testing_whole_genome_mode.cfg
>> toplevel_script crossvalidation_mode.cfg
>> toplevel_script beamsearch_mode.cfg
>> toplevel_script create_bed_to_binary_mode.cfg
>> toplevel_script append_bed_to_binary_mode.cfg

Additional files required to run the tool in these modes are also provided, with the exception of bedgraph files required to generate the binary data matrix. An example binary data matrix is instead provided directly for convenience.

Input and output files for each mode

	Configuration file	Featureset file	Constraint files	Model file	Feature value files	Data matrix file	Prediction file(s)	Metrics file	Candidate featureset list file	Log file
Format
Encoding	ASCII	ASCII	ASCII	Binary	ASCII	Binary	ASCII	ASCII	ASCII	ASCII
Format	TSV (space / tab separated)	TSV (space / tab separated)	TSV (space / tab separated)	Matlab MAT file (binary data)	bedgraph (4 column, tab separated)	Matlab MAT file (binary data)	bedgraph (4/5 column, tab separated)	text file	text file	text file
Input path	`<config_file_path>`	`<path_to_feature_list>`	`<path_to_pos_constraints>` , `<path_to_neg_constraints>` to select eg.s for 2 classes	`<path_to_model>`	`DATAPATH/chr*.bed`	`./internal/binarydata/<dataset_name>.mat`
Output path				`OUTPATH/<model_type>/*.mat`		`./internal/binarydata/<dataset_name>.mat`	`OUTPATH/chr*.bed` or `OUTPATH/predictions.txt` (crossvalidation mode)	`OUTPATH/*_metrics.txt`	`OUTPATH/out.txt`	`./internal/logfile.txt`
File extension	.cfg (recommended)	.ftr (recommended)	.cst (recommended)	.mat (required)	.bed (recommended)	.mat (required)	.bed	.txt	.txt	.txt
Mode requirements
`crossvalidation`	input	input	input	output		input	output	output		output
`training`	input	input	input	output		input				output
`testing`	input			input		input	output			output
`testing_whole_genome`	input			input		input	output			output
`beamsearch`	input		input			input			output	output
`create_bed_to_binary`	input				input	output				output
`append_bed_to_binary`	input				input	input & output (updated)				output

* represents a 'NIX environment wildcard match
<config_file_path> is specified as a command line parameter
<model_type> is "svm" or "rf", depending on which model is used

<mode> is specified in mode field of configuration file
<dataset_name> is specified in dataset_name field of configuration file
<experiment_name> is specified in experiment_name field of configuration file
<path_to_feature_list> is specified in path_to_feature_list field of configuration file
<path_to_pos_constraints> is specified in path_to_pos_constraints field of configuration file
<path_to_neg_constraints_path> is specified in path_to_neg_constraints field of configuration file
<path_to_model> is specified in path_to_model field of configuration file
<path_to_data> is specified in path_to_data field of configuration file
<path_to_additional_feature> is specified in path_to_additional_feature field of configuration file
<feature_name> is specified in <feature_name> field of configuration file

OUTPATH = ./output/<mode>/<experiment_name>

DATAPATH = <path_to_data>/* for add_bed_to_binary mode, i.e. all subfolders under <path_to_data>/*
DATAPATH = <path_to_additional_feature> for append_bed_to_binary mode

File formats

bedgraph files (feature value files and prediction files)

Notes: One file is created per chromosome, and named using the chromosome id (eg. chr12.bed) by convention.

Structure : Each line is tab separated, with columns as follows :
Column 1: chromosome name
Column 2: starting position of genomic region [ cytosine location ] (1-based coordinate system)
Column 3: ending position of genomic region - 1 [cytosine location + 1 ] (1-based coordinate system)
Column 4: value associated with genomic region [ feature value for feature value bedgraph files, predicted state for prediction bedgraph files ]

For predictions.txt file, an additional fifth column is present. Headers are also present in this file (unlike normal bedgraph files)

Example:
Excerpt from chr1.bed

chr1	10697	10698	high
chr1	10760	10761	high
chr1	15526	15527	high
chr1	16141	16142	high
chr1	16244	16245	high
chr1	23974	23975	high
chr1	28908	28909	high
chr1	28908	28909	high

Featureset files

Notes: Tab or space separated file containing one line per feature contained in ./internal/<dataset_name>.mat

Structure : Each line is tab separated, with columns as follows :
Column 1: 0 or 1 [depending on whether (1) or not (0) the feature will be used for training the model]
Column 2: name of the feature, as present in ./internal/<dataset_name>.mat

Example:
Excerpt from small_list.ftr

0	bisulfite_coverage
0	bisulfite_level
1	bp_to_cgi
0	cg_sat_100
0	cg_sat_200
0	cg_sat_400
0	cg_sat_50
0	cpg_to_cgi
1	DNase
0	g_sat_100
0	H3K4me2
1	H3K4me3
0	H3K79me1
0	H3K9me3
0	H4K8ac
0	H4K91ac
1	histone_states
1	repeats

Constraint files

Notes: Tab or space separated file containing one line per constraint for selecting each class, hence two files are required for binary classification.

Structure : Each line is tab separated, with columns as follows :
Column 1: name of the feature, as present in ./internal/<dataset_name>.mat
Column 2: binary operator > >= == ~= < <= supported
Column 3: corresponding value of feature

Example:
low_methylation.cfg

bisulfite_level	<	0.49999
bisulfite_coverage	>=	20

high_methylation.cfg

bisulfite_level	>=	0.5
bisulfite_coverage	>=	20

Configuration files

Notes: Tab or space separated file containing one line per parameter for running the tool. Any particular order of parameters is not required.

Structure : Each line is tab separated, with columns as follows :
Column 1: Parameter name
Column 2: Parameter value

Example:
crossvalidation_mode.cfg

mode	crossvalidation
experiment_name	toy_methylation
dataset_name	toy
pos_sample_set_size	5000
neg_sample_set_size	5000
crossvalidation_fold	5
iterations_for_svm_convergence	10e7
kkt_violation_fraction	0.05
tries_for_svm_convergence	3
path_to_feature_list	./feature_list/feature_list.txt
path_to_pos_constraints	./constraints/high_methylation.txt
path_to_neg_constraints	./constraints/low_methylation.txt
split_feature	crossvalidation

MAT files

Notes: Matlab's internal binary format for storing variables in binary files. Forward and reverse compatible among versions of Matlab. These files are used to store the data matrix, and trained SVM/RF models. It is important to note that while creating the data matrix, 4 feature names are reserved. bisulfite_coverage and bisulfite_level are required feature names (for the feature vectors with BS-seq coverage and CCRs respectively) that is required by the tool for methylation status prediction, and tab_seq_coverage and tab_seq_level are required feature names (for the feature vectors with TAB-seq coverage and CCRs respectively) that is required by the tool for 5hmC status prediction.

Structure : Not human readable.

Metric files

Notes: For cross-validation, DIRECTION outputs two files : crossvalidation_batch_metrics.txt and crossvalidation_overall_metrics.txt .

Structure : Tab separated, with one line per metric. Batch metrics file reports metrics per crossvalidation fold. Overall metrics file reports metrics by pooling all crossvalidation folds.

Example:
crossvalidation_batch_metrics.txt

TP	974	972	948	969	974
TN	786	795	812	804	788
FP	26	28	52	31	26
FN	214	205	188	196	212
Recall	0.81	0.82	0.83	0.83	0.82
FPR	0.03	0.03	0.06	0.03	0.03
Accuracy	0.88	0.88	0.88	0.88	0.88
Specificity	0.96	0.96	0.93	0.96	0.96
Precision	0.97	0.97	0.94	0.96	0.97
NPV	0.78	0.79	0.81	0.80	0.78
FDR	0.02	0.02	0.05	0.03	0.02
Fscore	0.89	0.89	0.88	0.89	0.89

crossvalidation_overall_metrics.txt

TP	4837
TN	3985
FP	163
FN	1015
Recall	0.82
FPR	0.03
Accuracy	0.88
Specificity	0.96
Precision	0.96
NPV	0.79
FDR	0.03
Fscore	0.89

Candidate featureset list files

Notes: For beam search, DIRECTION outputs a ranked list of candidate featuresets, sorted according to the classification metric of our choice .

Structure : Tab separated, with two lines per featureset. The first line lists feature names for candidate featuresets as present in ./internal/<dataset_name>.mat.
(feature names in the data matrix files are without quotes). The second line lists three metrics: Fscore, precision, recall, and order of evaluation of featureset.
Example:
out.txt

"DNase"	"H3K4me3"	"histone_states"	"repeats"
0.8808	0.9460	0.8240	4
"DNase"	"H3K4me3"	"bp_to_cgi"	"histone_states"	"repeats"
0.8803	0.9670	0.8079	1
"DNase"	"bp_to_cgi"	"histone_states"	"repeats"
0.8801	0.9580	0.8139	3
"DNase"	"H3K4me3"	"bp_to_cgi"	"histone_states"
0.8730	0.9490	0.8083	6
"DNase"	"H3K4me3"	"bp_to_cgi"	"repeats"
0.8660	0.9660	0.7847	5

Log files

Notes: For all files a log file is written in ./internal/logfile.txt, for diagnostic purposes.

Structure : One line per timestamp.

Example:
logfile.txt

2017-4-14 11:53:15 Run started
2017-4-14 11:53:15 beamsearch mode started
2017-4-14 11:53:15 Feature data loading started
2017-4-14 11:53:15 Filtering started
2017-4-14 11:53:15 Evaluation 1 started
2017-4-14 11:53:15 Cross validation fold 1 started
2017-4-14 11:53:18 Cross validation fold 2 started
2017-4-14 11:53:21 Cross validation fold 3 started
2017-4-14 11:53:23 Cross validation fold 4 started
2017-4-14 11:53:26 Cross validation fold 5 started
2017-4-14 11:53:29 Evaluation 2 started
2017-4-14 11:53:29 Cross validation fold 1 started
2017-4-14 11:53:32 Cross validation fold 2 started
2017-4-14 11:53:34 Cross validation fold 3 started
2017-4-14 11:53:37 Cross validation fold 4 started
2017-4-14 11:53:39 Cross validation fold 5 started
2017-4-14 11:53:41 Evaluation 3 started
2017-4-14 11:53:41 Cross validation fold 1 started
2017-4-14 11:53:44 Cross validation fold 2 started
2017-4-14 11:53:44 Cross validation fold 3 started
2017-4-14 11:53:46 Cross validation fold 4 started
2017-4-14 11:53:47 Cross validation fold 5 started
2017-4-14 11:53:52 Sorting f-scores and writing best feature sets to file
2017-4-14 11:53:52 beamsearch mode ended

Configuration file parameters

Parameter name	Parameter value	Relevant modes	Required/optional (Default values)	Notes
`mode`	`crossvalidation \| beamsearch \| training \| testing \| testing_whole_genome \| create_bed_to_binary \| append_bed_to_binary`	All modes	Required	For configuring DIRECTION to run in particular mode, one of these modes have to be specified
`experiment_name`	one word unique string id for the experiment being run	All modes except create_bed_to_binary and append_bed_to_binary	Required	All output files will be stored under `./output/<mode>/<experiment_name>/`
`dataset_name`	one word string `dataset_name`	All modes	Required	Needs to correspond to the binary data matrix file being used in the experiment at `./internal/binarydata/<dataset_name>.mat`
`path_to_feature_list`	absolute or relative path to featureset file, including filename	training, crossvalidation, beamsearch	Required	Modes where a trained model is provided as input do not require this input. A single ungapped string is required. Constructs like "./", or "../" are OS dependent, and may not be portable. Absolute paths are preferred.
`path_to_pos_constraints`	absolute or relative path to constraints file for positive examples, including filename	training, crossvalidation, beamsearch	Required	Feature names in file need to be commensurate with feature names in the binary data matrix file. A single ungapped string is required. Constructs like "./", or "../" are OS dependent, and may not be portable. Absolute paths are preferred.
`path_to_neg_constraints`	absolute or relative path to constraints file for positive examples, including filename	training, crossvalidation, beamsearch	Required	Feature names in file need to be commensurate with feature names in the binary data matrix file. A single ungapped string is required. Constructs like "./", or "../" are OS dependent, and may not be portable. Absolute paths are preferred.
`path_to_data`	absolute or relative path to data folder, and without a trailing "/"	create_bed_to_binary	Required	A single ungapped string is required. Subfolders under the path each correspond to one feature, and subfolder names will correspond to created feature names. Inside each subfolder, one bedgraph file per chromosome, with one line per cytosine, is expected.
`path_to_additional_feature`	absolute or relative path to data folder, and without a trailing "/"	append_bed_to_binary	Required	A single ungapped string is required. Inside the folder, one bedgraph file per chromosome, with one line per cytosine, is expected. The feature name is not generated from the folder name, but from the `feature_name>/code> parameter in the configuration file.`
`feature_name`	string corresponding to the name of the appended feature in the data matrix	append_bed_to_binary	Required	A single ungapped string is required. The feature name will correspond to the feature data inside `path_to_additional_feature`. The feature name cannot match existing feature names in the binary data matrix.
`pos_sample_set_size`	positive integer representing no. of positive examples to be used for training	training, crossvalidation, beamsearch	Optional (default: 5000)	Takes significantly longer to converge above 10000
`neg_sample_set_size`	positive integer representing no. of negative examples to be used for training	training, crossvalidation, beamsearch	Optional (default: 5000)	Takes significantly longer to converge above 10000
`iterations_for_svm_convergence`	positive integer representing maximum no. of iterations for SVM training convergence	training, crossvalidation, beamsearch (SVM only)	Optional (default: 10^7)	Typically set within 5 orders of magnitude between 10^5 and 10^9
`kkt_violation_fraction`	positive fraction between 0 and 1 specifying the number of variables	training, crossvalidation, beamsearch (SVM only)	Optional (default: 0.05)	Refer to (Karush, 1939) or (Kuhn and Tucker,1951) for details on formulating nonlinear programming. Presently Sequential Minimal Optimization is used in DIRECTION.
`tries_for_model_convergence`	Non-negative integer	training, crossvalidation, beamsearch	Optional (default: 3)	From trial and error, a value > 3 for SVM training does not yield improved results. For RF training, since the TreeBagger native function terminates deterministically, this value is ignored currently.
`crossvalidationfold`	positive integer specifying k in k-fold cross validation	crossvalidation, beamsearch	Optional (default: 5)	Each fold should have at least 2000 data points for a robust estimation of test set error rate per fold.
`beam_width`	positive integer corresponding to the beam width of the search	beamsearch	Optional (default: 3)	For beam width = n, where n is the number of features in the initial feature set, search becomes A* search. For beam width = 1, search becomes greedy hill climbing.
`priority_metric`	`fscore \| precision \| recall`	beamsearch	Required	Additional metrics to be added in the future
`max_evaluations`	positive integer capping the number of feature sets evaluated in the search	beamsearch	Required	1500 allows for a practical, prioritized search of feature subset space landscape generated from ~ 15 features
`num_of_top_evaluations`	positive integer corresponding to the number of feature sets to report in out.txt based on featuresets with the highest classification metrics	beamsearch	Required	Should be <= `max_evaluations`.
`path_to_model`	absolute or relative path to constraints file for positive examples, including filename	testing, testing_whole_genome	Required	A single ungapped string is required. The feature names in the feature vector associated with the model must match the feature names in the binary data matrix.
`testing_size`	positive integer corresponding to the number of data points used by DIRECTION to perform prediction	testing	Required	Maximizing `testing_size` to the dataset size converts testing mode to testing_whole_genome_mode.
`batch_size`	positive integer corresponding to the number of data points per batch of testing by DIRECTION	testing, testing_whole_genome	Required (currently set up to 5000)	To be deprecated in the future.
`split_feature`	`crossvalidation`	crossvalidation	Required (currently always set to the string "crossvalidation")	For future use.

Licensing

Code is distributed under CC-BY-SA 3.0 license ( human readable URL, legalese URL ) . For commercial uses, please email prad...@utdallas.edu .

Contact

For additional help / feedback, contact prad...@utdallas.edu .