Start ./gedevo with no parameters for a short description of the command line params. GEDEVO v1.0 supplementary README [May 2015] Newer versions of this text may be found at: http://cytogedevo.compbio.sdu.dk/ For questions, contact malek ät tugraz döt at. ============ FILE FORMATS ============ *** Edgelist file *** Text file with protein interactions. Each line represents a directed interaction. Two entries in the opposite direction each are considered an undirected edge. It looks like this: protein1 protein2 protein2 protein3 protein2 protein1 [...] Anything else after the first 2 words per line is ignored. There may be an optional column header (use either --edgelist to load a file without a header, or --edgelisth to load a file with a header by skipping the first line) *** SIF file *** Text file with protein interactions. Common format for import/export with Cytoscape. Only one network can be stored in a .sif file. It looks like this: protein1 interaction protein2 protein2 interaction protein3 protein42 interaction protein55 [...] An interaction can be one of the following: DirectedEdge directed d UndirectedEdge undirected u The interaction type is not case sensitive. If the interaction type is none of the above, undirected is assumed. *** Pairlist file *** Like the edge list file, but with one additional column to hold an edge score. Used as data matrix files to import addtional scores. protein1 protein2 score [...] ======================= INTERNAL CONFIG OPTIONS ======================= These are the available parameters for --config opt=value (or -c opt=value). If in doubt, leave these at the defaults, or grep the source code for the variable of interest. The recongized options are listed with their default values. Some of these have a direct equivalent command line parameter, for example: --pop = --config maxAgents=N *** Final weights *** * weightGED = 0 * weightGraphlets = 0 * weightNodeDist = 0 * weightPairsum = 1 These weights scale the relative influence of every scoring function when evaluating the final score of an individual; the resulting value defines survival chance. If a weight is zero, the corresponding function is not used for the final score. weightNodeDist is the simple node degree distance, which is equal to the 2-graphlet signature distance. Should not be used. weightPairsum is all pair scores summed up (see below). *** Pairwise weights *** * pairWeightGED = 1 * pairWeightGraphlets = 0.5 * pairWeightNodeDist = 0 These are similar to the final weights, but they are applied to every pair; the resulting value measures the difference of one specific node pair and the probability of it being picked up and re-assigned by the EA during offspring generation. For example, pairs with a low score are likely to be kept intact, whereas pairs with a high score are likely to be broken and re-assigned. It is important that the pair score and the final score use sane multipliers, that is, offspring generation should go in a direction that will be considered good in the evaluation phase. In other words, both scores must correlate, otherwise the EA will work against itself, give no useful results, and will take very long to converge. pairWeightNodeDist is the same as above, but per-pair. Do not use it. ** Default pair score ** * pairNullValue = 1 (parameter: --density ) This is used for scores where no better value is available, i.e. when mapping a valid node against nil. This value controls how "forcefully" the two graphs are aligned, in other words, how high the penalty for mapping a node to nil is. A value of 1 is the worst score possible, so the algorithm will try to prevent node <-> nil pairs at all costs. Lower values will cause exceptionally bad pairs to be considered worse than mapping against nil, so preferably hard-to-align nodes will be paired with nil and thus counted as inserted or deleted. In short: a value of 1 produces an alignment as compact as possible, therefore minimizing the GED, lower values may produce more biologically meaningful alignments, and very low values align only nodes that fit really well, thus causing the alignment to fall apart into unconnected subgraphs. *** Data preprocessing *** * forceUndirectedEdges = 0 (parameter: --undirected / -u) If enabled, any directed edges will be transformed into undirected edges. This is useful when the input network data are known to be undirected, but only edges in one direction are present. Additionally, if only undirected edges exist, a slightly faster implementation for GED score calculation is automatically chosen. * matchSameNames = 1 (parameter: --prematch) This allows pre-matching proteins that, based on their name, are known to be equal. Pre-matched pairs will never be separated, even if the other data are incomplete and the resulting score would cause these pairs to be broken quickly. (Names are case sensitive!) *** GED penalties *** * ged_eAdd = 1 [edge added] * ged_eRm = 1 [edge removed] * ged_eSub = 0 [edge substituted / stays as-is] * ged_eFlip = 0.8 [directed edge flipped] * ged_eD2U = 0.2 [directed edge changed to undirected] * ged_eU2D = 0.2 [undirected edge changed to directed] * ged_nAdd = 0 [node added] * ged_nRm = 0 [node removed] The raw GED for a mapping is calculated from the number of removed/added/substituted/flipped edges required to transform one graph into the other graph, given this mapping. The raw counts are multiplied with these penalties and summed up, producing the actual GED score. A substitution is desirable and should not contribute to the GED score; the same is true for node insertions and deletions, because any of the latter already cause insertion/deletion of all edges associated with that node. *** Greedy settings *** (Do not use this, it is really slow and not necessary!) * evo_greedyInitOnInit = 0 (parameter: --greedy-init) If enabled, the first iteration will perform a greedy initialization step. * evo_greedyInitPerRound = 0 ### EXPERIMENTAL AND USELESS DO NOT USE ### Because greedy initialization is a costly operation, this is a special setting that defines the relative number of individuals to mutate using greedy. The value can be in [0 .. 1], where 0 means greedy is not used and 1 means that all individuals use greedy. However, individuals initialized with this method are not mutated otherwise, therefore a value of 1 is entirely useless. A value of 0.0075 has been observed to improve convergence for a limited number of early iterations; values above 0.1 are too slow to have any benefit in practice. * evo_greedyInitScoreLimit = 0.4 This setting defines the upper score limit that greedy considers worth keeping. If a pair of nodes has a score worse than this value, it is ignored as a potential match. If there are no usable matches satisfying the threshold, the node in question is mapped to nil. *** Population settings *** * maxAgents = 400 (parameter: --pop ) This controls how many new individuals are created in each offspring generation phase. The number of new individuals in each iteration has the greatest impact on convergence and execution speed, and some impact on memory utilization. Each individual has to be created and later evaluated, both taking most of the runtime of the algorithm. A value of 100 - 1000, depending on the network size, has been found to work well (1000 for smaller networks up to 500 nodes, 200 or more for large graphs up to 10000 nodes). * basicHealth = 100 * maxHealthDrop = 100 The first value is used as the initial health for each individual, which is then decreased by the second value relative to it. The worst individual will be reduced by the full amount, better ones by linearly interpolated amounts. This setting is only useful to prolong or shorten survival times, and has no direct influence on convergence, only on the total number of individuals existing at any given time. *** Miscellaneous *** * varGroup = 0 If this is 0, G1 is aligned to G2 (the nodes of G2 are the reference mapping); if it is 1, G2 is aligned to G1 (the nodes of G1 are the reference mapping). The setting has a little impact on execution speed if the number of nodes in both graphsdiffers greatly; some optimizations can be performed faster if the reference mapping contains nodes of the larger graph. (It is never necessary to touch this setting, unless you have a DIST file that requires a certain network order, but you want to explicitly specify which network is permuted and which one stays the static reference. Implementation details...) * numThreads = 0 (parameter: --threads ) Number of worker threads to use. If 0, the program will detect the number of available CPU coress automatically and create one thread per core. Set to 1 to run in single-threaded mode.