MACHINE_TYPEin the Makefile.
Unarchive T3 into a directory and edit the Makefile. Set the
SRILM variable to point to your srilm install directory, and ensure
SRILM_LIB is set appropriately for your architecture. Also
set the include and library location of boost, if you've installed it in
a non-standard location. You can also change the compilers, but I can't
guarantee that the code will compile except with gcc/g++. I've compiled
successfully with g++ 3.4.6, 4.0.1 and 4.1.2 for 32-bit and 64-bit platforms
(intel, amd and PPC). To build a 64-bit version, you'll have to add the flag
CXXFLAGS, CFLAGS and
Makefile files in the source directory and in
make, which will build four binaries. These perform training,
classification and grammar extraction (in two different ways). Note that there
are a number of python scripts in the
scripts directory. To run
these you'll need a moderately modern version of Python (I use 2.5, but I
think 2.3 would be sufficient).
sampledirectory. There's both a training set and a testing set of ten tree-pairs each. These are stored in penn-treebank format with one tree per line and source and target trees on alternate lines. The trees are predicted parses from Bikel's implcodeentation of the Collins parser. To illustrate, the final pair of lines in
(S (NP (DT this)) (VP (VBZ has) (VP (VBN involved) (NP (VBG thinning) (NNS trees)) (, ,) (S (VP (VP (VBG opening) (PRT (RP up)) (NP (NP (NNS glades)) (PP (IN around) (NP (JJ mature) (NNS scots) (NNS pines))))) (CC and) (VP (VBG planting) (NP (NN juniper) (CC and) (NNS blackberries))))))) (. .))
(S (S (NP (DT this)) (VP (VBD involved) (S (VP (VBG thinning) (NP (NNS trees)))))) (, ,) (NP (NN opening)) (VP (VBZ glades) (PP (IN around) (NP (NNPS scots) (NNS pines) (CC and) (NN planting) (NN juniper) (CC and) (NNS blackberries)))) (. .))
In addition the file
demo.align contains the word alignments between
the string yield of the source and target trees in the training set. These are
pairs of word indices (source-target) separated by spaces, counting from zero.
The alignments were produced from a larger training set using the Berkeley
aligner. Equivalently, GIZA++, or any other aligner, could be used.
For the last training pair, shown above, we have:
0-0 2-1 3-2 4-3 5-4 6-5 8-6 9-7 11-8 12-9 13-10 14-11 15-12 16-13 17-14 18-15
To launch the demo run the script
runme.sh which peforms grammar
extraction, feature instantiation, model training and decoding. This produces
the output files
rules.* files are the output of the grammar
extraction process. The
grammar, features files are the result of
feature extraction on each of the rules. The
model* files encode
the model settings and parameters after training, and
contains the model's predictions on the testing set.
demo.training.source, we can see that the model has learnt
to peform various modifications. Given the tiny training set, it's
unsuprising that the output is none too stellar.
harvest2tool, which extracts the maximally general (smallest) pairs of tree fragments which obey the word-alignment. There are various options on how to interpret the word alignment and the types and sizes of rules to emit. The full set of options are as follows:
$ harvest2 --help Allowed options: --help produce help message -a [ --align ] arg the alignments file in Pharaoh format -p [ --parse ] arg the parse tree file - each pair of lines contain source, target pairs -s [ --source ] arg source parse tree file name -t [ --target ] arg target parse tree file name --raw emit raw rules before generalisation -i [ --identity ] specify only parts of the rule which differ in source-target yield -u [ --uniform ] constrain rules to matching consequent categories --tight arg (=0) the minimum number of non-null edges required [0,4] --null with tight=0, allow entirely null alignment blocks --delete allow for source subtrees to be deleted -r [ --specification-limit ] arg (=0) the maximum extent of specification of aligned phrases -v [ --max-variables ] arg (=99) the maximum number of variable nodes in a rule -z [ --max-size ] arg (=99) the maximum size of a harvested rule tree (lhs or rhs tree nodes) -e [ --epsilon ] extract epsilon rules; i.e. rules which modify only one of the source and target trees. --spmt extract rules in a similar manner to Marcu et al.'s SPMT --max-phrase arg (=7) maximum phrase size in SPMT heuristic (just the lexical part)
Running this on the demo training data produces the following output:
$ ../harvest2 -a demo.align -p demo.training --delete --epsilon | head (S (NP #0) (VP #1) (. #2)) (S (NP #0) (VP #1) (. #2)) (TOP (S #0)) (TOP (S #0)) (NP (NNS #0)) (NP (NNS #0)) (NNS conservationists) (NNS conservationists) (VP (VBP #0) (VP #1)) (VP (VBP #0) (VP #1)) (VBP are) (VBP are) (VP (VBN #0) (PP #1)) (VP (VBN #0) (PP #1)) (VBN concerned) (VBN concerned) (PP (IN #0) (NP #1)) (PP (IN #0) (NP #1)) (IN about) (IN about)Each line consists of a source and target tree fragment with a tab separator, and the #x items denote a variable and its alignment (a.k.a. frontier non-terminal). The options
--delete --epsilonmeans that source fragments that are unaligned become deletion rules, e.g.,
(IN in) NULL (NP (DT #0) (JJ #1) (NN #2)) NULL (DT the) NULL (JJ scottish) NULL (NN population) NULLinstead of the following
(PP (IN in) (NP #0)) (PP #0) (NP (NP (DT the) (JJ scottish) (NN population)) (PP #0)) (PP #0)When there are variables in the source that do not appear in the target, this licences the deletion of a subtree. The training algorithm has a flag which controls whether subtrees can simply be deleted by such a rule, or if they also require epsilon rules (those with NULL target fragments) to cover the deleted subtree.
-r N allows up to N adjacent rules to be combined into
a single rule. The number of rules is exponential in N, so small values are
advised (≤ 3).
Another useful option is
--spmt which uses a
heuristic for rule extraction modelled after the technique in
Marcu et al., EMNLP 2006. This heuristic uses a sliding window over the source
string. For each string it finds the translation string in the target using
the alignment, and then extracts the smallest pair of tree fragments which
cover the two strings. This results in extremely lexicalised transuction
rules. This can be seen in the output,
$ ../harvest2 -a demo.align -p demo.training --delete --epsilon --spmt | head -3 (NP (NNS conservationists)) (NP (NNS conservationists)) (S (NP (NNS conservationists)) (VP (VBP are) (VP #0)) (. #1)) (S (NP (NNS conservationists)) (VP (VBP are) (VP #0)) (. #1)) (S (NP (NNS conservationists)) (VP (VBP are) (VP (VBN concerned) (PP #0))) (. #1)) (S (NP (NNS conservationists)) (VP (VBP are) (VP (VBN concerned) (PP #0))) (. #1))
Alternatively, for tree-to-string data the binary
harvest_ghkm implements the Galley et al.'s 2004 grammar extraction heuristic. This supports
the following options, which are self-explanatory:
../harvest_ghkm Allowed options: --help produce help message -a [ --align ] arg the alignments file in Pharaoh format -s [ --source ] arg the source tree file, one per line in PTB format -t [ --target ] arg the target string file, one per line, tokenised -e [ --epsilon ] allow epsilon (deletion) rules -i [ --interior ] output interior alignment between terminals in each rule
In addition to the grammar extracted as above, in our experiments we've added
rules to clone or delete parts of the source tree. These rules are extracted
from the training and testing source trees using the python scripts in the
scripts directory. See
sample/runme.sh for An
example of their usage.
scripts/rule_features.py, which is used as follows:
$ python scripts/rule_features.py --help Options: -h, --help show this help message and exit --input=INPUT input file of grammar rules; can specify multiple inputs --pcfg=PCFG expansion frequencies for CFG rewrites; used to implement PCFG feature --output=OUTPUT output file for the resultant grammar --fout=FOUT output file for the feature set --verbose display to stdout each rule and its features --type --root --identity --unlex --rcount --wcount --dvars --yields --length --epsminAs input files, we use the output from grammar extraction and the copy and delete rules. The output is the grammar consisting of rules and a sparse vector of features for each rule and a feature file, which provides a human-readable defintion of each feature.
Here's a snippet from the grammar for the demo data:
(S (NP #0) (VP #1) (. .)) (S (NP #0) (VP #1) (. .)) 5155 1 4450 1 5162 1 5484 1 5485 1 5486 1 7 1 5487 1 5488 1 5489 1 11 1 12 1 58 1 59 1 5475 1 127 1 4131 1 61 1Where the feature vector is represented as many index, value pairs. The features are defined as:
5155 ('root: lhs', 'S') 4450 ('root: rhs', 'S') 5162 ('root: lhs,rhs', ('S', 'S')) 5484 ('identity: lhs', '(S (NP) (VP) (. .))') 5485 ('identity: rhs', '(S (NP) (VP) (. .))') 5486 ('identity: lhs,rhs', (S (NP #0) (VP #1) (. .)), (S (NP #0) (VP #1) (. .))) 7 identity: lhs and rhs equal 5487 ('unlex: lhs', '(S (NP) (VP) (. .))') 5488 ('unlex: rhs', '(S (NP) (VP) (. .))') 5489 ('unlex: lhs,rhs', '(S (NP #0) (VP #1) (. .))', '(S (NP #0) (VP #1) (. .))') 11 wcount: target 12 wcount: source 58 ('yield: lhs,rhs', ('.',), ('.',)) 59 ('yield: tok in lhs and rhs', '.') 5475 ('yield: preterms lhs,rhs', ('NP', 'VP', '.'), ('NP', 'VP', '.')) 127 ('yield: preterm in lhs and rhs', 'NP') 4131 ('yield: preterm in lhs and rhs', 'VP') 61 ('yield: preterm in lhs and rhs', '.')which can be found by running with the
--verboseflag or else inspecting the output features file (
--foutoption). The script supports a number of types of features using the flags
--type --root --identity --unlex --rcount --wcount --dvars --yields --length --epsmin. Please see the python code and journal article for further details on how these work. The model also supports a PCFG probability features (actually the log-probability of the target). This requires PCFG expansion codes to be provided as input using the
--pcfgflag. The file should contain lines of the form:
I ADJP JJ JJR 13 17787 P VBZ yells 1 26436where
Idenotes an internal node and
Pa pre-terminal, and the following strings are the parent and child nodes. The two numbers are the counts/frequency of the rewrite and the parent category, respectively.
transducer_learnwith no arguments gives the command line help, pruned here to just the core options:
usage: transducer_learn [options] example_file model_file Arguments: example_file-> file with training data model_file -> file to store learned decision rule in Learning options: -c float -> C: trade-off between training error and margin (default 0.01) -o [1,2] -> Slack rescaling method to use for loss. 1: slack rescaling 2: margin rescaling (default 1) -l [0..] -> Loss function to use. 0: zero/one loss (default 0) --g string grammar file name --m string SRILM file name --o int LM order --v string source-conditioned features file nameThe grammar must be supplied, while the LM file (and order) is optional. You can also supply a file specifying source-conditioned features. This is a fairly complicated format and can really slow down inference, so I wouldn't suggest using unless absolutely necessary. The
-loptions controls choice the loss function, which must be one of:
1: unordered token errors 2: unordered token errors with length penalty 3: unordered token precision 4: (3) x brevity penalty 5: (3) x two-sided brevity penalty 6: unordered ngram errors (uniformly weighted) 7: unordered ngram precision (smoothed geometric mean) 8: (7) x brevity penalty (i.e., BLEU) 9: unordered CFG production errors 10: (9) with length penalty 11: unordered CFG production precision 12: clipped unordered token errors 13: clipped unordered token precision 14: clipped unordered token recall 15: clipped unordered token F1 16: unordered edit distance 17: asymmetric hamming distance, sim. to (2)The last (
-l 17) is a pretty good default for most tasks.
Futher options control the style of tree-transduction. For instance, the model can perform tree-to-tree or tree-to-string transduction. Additionally, it can perform exact or loose rule matching. In loose matching the rule internals are ignored; instead the frontier nodes are matched to any complete sequence of descendent nodes in the target tree.
--j [0,1] index rule sources with a 0) forest [def] 1) trie --k [0,1] index rule targets with a 0) forest [def] 1) trie --q [0..2] select gold derivation based on mapping source to 0) target tree, 1) target string, 2) a surrogate tree; default = 0
The model requires a feature representation for the gold-standard of each
training instance. Because we model derivations, not the resultant tree or
string, we need a gold-standard derivation. But this is a latent variable,
so consequently, we adopt a few different heuristic to find a derivation.
These are controlled using
--q, above, and the following:
--s [1..5] heuristic for selecting gold derivation: 1: maximum number of rules (default) 2: minimum number of rules 3: at random 4: maximum score (may want to use --w) 5: minimum score (ditto) --f [0,1] recompute gold (1) every iteration only useful with --s [3,4,5]; default no (0)The surrogate option
--q 2is experimental; this allows derivations which do not exactly yield the reference tree/string, but instead chooses a surrogate with low loss. This search is approximate, and is prone to search errors for some loss functions.
transducer_classify. This takes the testing instances (tree pairs, or just input trees), the model and writes the predictions to an output file:
usage: svm_struct_classify [options] example_file model_file output_fileMost of the same options as training can be supplied, which override the value used in training. (Many of these will have no effect on decoding.)