The Tree Transducer Toolkit (T3) is a tree-transduction model using a synchronous tree-substitution grammar (STSG). This models tree transformations as a series of tree-rewrite steps. The model is trained using a discriminative large margin method, allowing a rich feature representation and configurable loss function. It's written in C++ and is known to run on Linux and Mac OS X, on both intel and PPC architectures. (Not updated since 2009, so may not play well with modern compilers and libraries.)


You can download the source code here. Note that this is provided free of charge for any academic or non-commerical purpose. The full licence terms are given in the LICENSE.txt. The software includes Thorsten Joachim's SVMStruct, which has its own licencing terms. The corpus used in the COLING 2008 paper is also available for download.

Building the source

Firstly, you'll need to make sure you have both SRILM (version ≥ 1.5.2) and boost (version ≥ 1.33.1) installed. Boost can be installed via a package managcodeent tool on most platforms, but SRILM will have to be built by hand. Note that SRILM can be built for 32 or 64 bit architectures (or others) by changing MACHINE_TYPE in the Makefile.

Unarchive T3 into a directory and edit the Makefile. Set the SRILM variable to point to your srilm install directory, and ensure that SRILM_LIB is set appropriately for your architecture. Also set the include and library location of boost, if you've installed it in a non-standard location. You can also change the compilers, but I can't guarantee that the code will compile except with gcc/g++. I've compiled successfully with g++ 3.4.6, 4.0.1 and 4.1.2 for 32-bit and 64-bit platforms (intel, amd and PPC). To build a 64-bit version, you'll have to add the flag -m64 to CXXFLAGS, CFLAGS and LDFLAGS in the Makefile files in the source directory and in svm_light and svm_struct.

Now run make, which will build four binaries. These perform training, classification and grammar extraction (in two different ways). Note that there are a number of python scripts in the scripts directory. To run these you'll need a moderately modern version of Python (I use 2.5, but I think 2.3 would be sufficient).

Running the demo

There are a number of demonstration files in the sample directory. There's both a training set and a testing set of ten tree-pairs each. These are stored in penn-treebank format with one tree per line and source and target trees on alternate lines. The trees are predicted parses from Bikel's implcodeentation of the Collins parser. To illustrate, the final pair of lines in are:
(S (NP (DT this)) (VP (VBZ has) (VP (VBN involved) (NP (VBG thinning) (NNS trees)) (, ,) (S (VP (VP (VBG opening) (PRT (RP up)) (NP (NP (NNS glades)) (PP (IN around) (NP (JJ mature) (NNS scots) (NNS pines))))) (CC and) (VP (VBG planting) (NP (NN juniper) (CC and) (NNS blackberries))))))) (. .))
(S (S (NP (DT this)) (VP (VBD involved) (S (VP (VBG thinning) (NP (NNS trees)))))) (, ,) (NP (NN opening)) (VP (VBZ glades) (PP (IN around) (NP (NNPS scots) (NNS pines) (CC and) (NN planting) (NN juniper) (CC and) (NNS blackberries)))) (. .))

In addition the file demo.align contains the word alignments between the string yield of the source and target trees in the training set. These are pairs of word indices (source-target) separated by spaces, counting from zero. The alignments were produced from a larger training set using the Berkeley aligner. Equivalently, GIZA++, or any other aligner, could be used. For the last training pair, shown above, we have:
0-0 2-1 3-2 4-3 5-4 6-5 8-6 9-7 11-8 12-9 13-10 14-11 15-12 16-13 17-14 18-15

To launch the demo run the script which peforms grammar extraction, feature instantiation, model training and decoding. This produces the output files rules.harvest, rules.copy, rules.delete, rules.epsilon, grammar, features, model, model.grammar, model.svm, predict. The rules.* files are the output of the grammar extraction process. The grammar, features files are the result of feature extraction on each of the rules. The model* files encode the model settings and parameters after training, and predict contains the model's predictions on the testing set.

Comparing predict to, we can see that the model has learnt to peform various modifications. Given the tiny training set, it's unsuprising that the output is none too stellar.

Usage of the various sub-components

  1. Grammar extraction: harvest2

    The first step is to take a parsed aligned corpus and extract a set of STSG rules which describe the tree-transformations. This is done using the harvest2 tool, which extracts the maximally general (smallest) pairs of tree fragments which obey the word-alignment. There are various options on how to interpret the word alignment and the types and sizes of rules to emit. The full set of options are as follows:
    $ harvest2 --help
    Allowed options:
      --help                                produce help message
      -a [ --align ] arg                    the alignments file in Pharaoh format
      -p [ --parse ] arg                    the parse tree file - each pair of 
                                            lines contain source, target pairs
      -s [ --source ] arg                   source parse tree file name
      -t [ --target ] arg                   target parse tree file name
      --raw                                 emit raw rules before generalisation
      -i [ --identity ]                     specify only parts of the rule which 
                                            differ in source-target yield
      -u [ --uniform ]                      constrain rules to matching consequent 
      --tight arg (=0)                      the minimum number of non-null edges 
                                            required [0,4]
      --null                                with tight=0, allow entirely null 
                                            alignment blocks
      --delete                              allow for source subtrees to be deleted
      -r [ --specification-limit ] arg (=0) the maximum extent of specification of 
                                            aligned phrases
      -v [ --max-variables ] arg (=99)      the maximum number of variable nodes in
                                            a rule
      -z [ --max-size ] arg (=99)           the maximum size of a harvested rule 
                                            tree (lhs or rhs tree nodes)
      -e [ --epsilon ]                      extract epsilon rules; i.e. rules which
                                            modify only one of the source and 
                                            target trees.
      --spmt                                extract rules in a similar manner to 
                                            Marcu et al.'s SPMT
      --max-phrase arg (=7)                 maximum phrase size in SPMT heuristic 
                                            (just the lexical part)

    Running this on the demo training data produces the following output:

    $ ../harvest2 -a demo.align -p --delete --epsilon | head
    (S (NP #0) (VP #1) (. #2))  (S (NP #0) (VP #1) (. #2))
    (TOP (S #0))    (TOP (S #0))
    (NP (NNS #0))   (NP (NNS #0))
    (NNS conservationists)  (NNS conservationists)
    (VP (VBP #0) (VP #1))   (VP (VBP #0) (VP #1))
    (VBP are)   (VBP are)
    (VP (VBN #0) (PP #1))   (VP (VBN #0) (PP #1))
    (VBN concerned) (VBN concerned)
    (PP (IN #0) (NP #1))    (PP (IN #0) (NP #1))
    (IN about)  (IN about)
    Each line consists of a source and target tree fragment with a tab separator, and the #x items denote a variable and its alignment (a.k.a. frontier non-terminal). The options --delete --epsilon means that source fragments that are unaligned become deletion rules, e.g.,
    (IN in) NULL
    (NP (DT #0) (JJ #1) (NN #2))    NULL
    (DT the)    NULL
    (JJ scottish)   NULL
    (NN population) NULL
    instead of the following
    (PP (IN in) (NP #0))    (PP #0)
    (NP (NP (DT the) (JJ scottish) (NN population)) (PP #0))        (PP #0)
    When there are variables in the source that do not appear in the target, this licences the deletion of a subtree. The training algorithm has a flag which controls whether subtrees can simply be deleted by such a rule, or if they also require epsilon rules (those with NULL target fragments) to cover the deleted subtree.

    The option -r N allows up to N adjacent rules to be combined into a single rule. The number of rules is exponential in N, so small values are advised (≤ 3).

    Another useful option is --spmt which uses a heuristic for rule extraction modelled after the technique in Marcu et al., EMNLP 2006. This heuristic uses a sliding window over the source string. For each string it finds the translation string in the target using the alignment, and then extracts the smallest pair of tree fragments which cover the two strings. This results in extremely lexicalised transuction rules. This can be seen in the output,

    $ ../harvest2 -a demo.align -p --delete --epsilon --spmt | head -3
    (NP (NNS conservationists)) (NP (NNS conservationists))
    (S (NP (NNS conservationists)) (VP (VBP are) (VP #0)) (. #1))   (S (NP (NNS conservationists)) (VP (VBP are) (VP #0)) (. #1))
    (S (NP (NNS conservationists)) (VP (VBP are) (VP (VBN concerned) (PP #0))) (.  #1))    (S (NP (NNS conservationists)) (VP (VBP are) (VP (VBN concerned) (PP #0))) (. #1))

    Alternatively, for tree-to-string data the binary harvest_ghkm implements the Galley et al.'s 2004 grammar extraction heuristic. This supports the following options, which are self-explanatory:

    Allowed options:
      --help                produce help message
      -a [ --align ] arg    the alignments file in Pharaoh format
      -s [ --source ] arg   the source tree file, one per line in PTB format
      -t [ --target ] arg   the target string file, one per line, tokenised
      -e [ --epsilon ]      allow epsilon (deletion) rules
      -i [ --interior ]     output interior alignment between terminals in each rule

    In addition to the grammar extracted as above, in our experiments we've added rules to clone or delete parts of the source tree. These rules are extracted from the training and testing source trees using the python scripts in the scripts directory. See sample/ for An example of their usage.

  2. Feature extraction:

    Now that we've obtained a set of grammar rules, we need to create a vector of features for each rule. This is done by scripts/, which is used as follows:
    $ python scripts/ --help
      -h, --help       show this help message and exit
      --input=INPUT    input file of grammar rules; can specify multiple inputs
      --pcfg=PCFG      expansion frequencies for CFG rewrites; used to implement
                       PCFG feature
      --output=OUTPUT  output file for the resultant grammar
      --fout=FOUT      output file for the feature set
      --verbose        display to stdout each rule and its features
      --type  --root  --identity  --unlex  --rcount  --wcount
      --dvars  --yields  --length  --epsmin         
    As input files, we use the output from grammar extraction and the copy and delete rules. The output is the grammar consisting of rules and a sparse vector of features for each rule and a feature file, which provides a human-readable defintion of each feature.

    Here's a snippet from the grammar for the demo data:

    (S (NP #0) (VP #1) (. .))       (S (NP #0) (VP #1) (. .))       5155 1 4450 1 5162 1 5484 1 5485 1 5486 1 7 1 5487 1 5488 1 5489 1 11 1 12 1 58 1 59 1 5475 1 127 1 4131 1 61 1
    Where the feature vector is represented as many index, value pairs. The features are defined as:
    5155	('root: lhs', 'S')      
    4450	('root: rhs', 'S')      
    5162	('root: lhs,rhs', ('S', 'S'))   
    5484	('identity: lhs', '(S (NP) (VP) (. .))')        
    5485	('identity: rhs', '(S (NP) (VP) (. .))')        
    5486	('identity: lhs,rhs', (S (NP #0) (VP #1) (. .)), (S (NP #0) (VP #1) (.  .))) 
    7	identity: lhs and rhs equal     
    5487	('unlex: lhs', '(S (NP) (VP) (. .))')   
    5488	('unlex: rhs', '(S (NP) (VP) (. .))')   
    5489	('unlex: lhs,rhs', '(S (NP #0) (VP #1) (. .))', '(S (NP #0) (VP #1) (. .))') 
    11	wcount: target  
    12	wcount: source  
    58	('yield: lhs,rhs', ('.',), ('.',))      
    59	('yield: tok in lhs and rhs', '.')      
    5475	('yield: preterms lhs,rhs', ('NP', 'VP', '.'), ('NP', 'VP', '.')) 
    127	('yield: preterm in lhs and rhs', 'NP')         
    4131	('yield: preterm in lhs and rhs', 'VP')         
    61	('yield: preterm in lhs and rhs', '.')  
    which can be found by running with the --verbose flag or else inspecting the output features file (--fout option). The script supports a number of types of features using the flags --type --root --identity --unlex --rcount --wcount --dvars --yields --length --epsmin. Please see the python code and journal article for further details on how these work. The model also supports a PCFG probability features (actually the log-probability of the target). This requires PCFG expansion codes to be provided as input using the --pcfg flag. The file should contain lines of the form:
    I ADJP JJ JJR 13 17787
    P VBZ yells 1 26436
    where I denotes an internal node and P a pre-terminal, and the following strings are the parent and child nodes. The two numbers are the counts/frequency of the rewrite and the parent category, respectively.
  3. Training: transducer_learn

    Training takes as input a grammar, a language model and a training set of tree pairs. Running transducer_learn with no arguments gives the command line help, pruned here to just the core options:
       usage: transducer_learn [options] example_file model_file
             example_file-> file with training data
             model_file  -> file to store learned decision rule in
    Learning options:
             -c float    -> C: trade-off between training error
                            and margin (default 0.01)
             -o [1,2]    -> Slack rescaling method to use for loss.
                            1: slack rescaling
                            2: margin rescaling
                            (default 1)
             -l [0..]    -> Loss function to use.
                            0: zero/one loss
                            (default 0)
             --g string  grammar file name
             --m string  SRILM file name
             --o int     LM order
             --v string  source-conditioned features file name
    The grammar must be supplied, while the LM file (and order) is optional. You can also supply a file specifying source-conditioned features. This is a fairly complicated format and can really slow down inference, so I wouldn't suggest using unless absolutely necessary. The -l options controls choice the loss function, which must be one of:
             1: unordered token errors
             2: unordered token errors with length penalty
             3: unordered token precision
             4: (3) x brevity penalty
             5: (3) x two-sided brevity penalty
             6: unordered ngram errors (uniformly weighted)
             7: unordered ngram precision (smoothed geometric mean)
             8: (7) x brevity penalty (i.e., BLEU)
             9: unordered CFG production errors
             10: (9) with length penalty
             11: unordered CFG production precision
             12: clipped unordered token errors
             13: clipped unordered token precision
             14: clipped unordered token recall
             15: clipped unordered token F1
             16: unordered edit distance
             17: asymmetric hamming distance, sim. to (2)
    The last (-l 17) is a pretty good default for most tasks.

    Futher options control the style of tree-transduction. For instance, the model can perform tree-to-tree or tree-to-string transduction. Additionally, it can perform exact or loose rule matching. In loose matching the rule internals are ignored; instead the frontier nodes are matched to any complete sequence of descendent nodes in the target tree.

             --j [0,1]   index rule sources with a 0) forest [def] 1) trie
             --k [0,1]   index rule targets with a 0) forest [def] 1) trie
             --q [0..2]  select gold derivation based on mapping source to
                         0) target tree, 1) target string, 2) a surrogate tree;
                         default = 0

    The model requires a feature representation for the gold-standard of each training instance. Because we model derivations, not the resultant tree or string, we need a gold-standard derivation. But this is a latent variable, so consequently, we adopt a few different heuristic to find a derivation. These are controlled using --q, above, and the following:

             --s [1..5]  heuristic for selecting gold derivation:
                           1: maximum number of rules (default)
                           2: minimum number of rules
                           3: at random
                           4: maximum score (may want to use --w)
                           5: minimum score (ditto)
             --f [0,1]   recompute gold (1) every iteration
                         only useful with --s [3,4,5]; default no (0)
    The surrogate option --q 2 is experimental; this allows derivations which do not exactly yield the reference tree/string, but instead chooses a surrogate with low loss. This search is approximate, and is prone to search errors for some loss functions.
  4. Decoding: transducer_classify

    To get predictions on test data, use transducer_classify. This takes the testing instances (tree pairs, or just input trees), the model and writes the predictions to an output file:
       usage: svm_struct_classify [options] example_file model_file output_file
    Most of the same options as training can be supplied, which override the value used in training. (Many of these will have no effect on decoding.)


The manually created abstractive sentence compression data can be downloaded. This data set is introduced in the COLING 2008 paper. Updated March 2011 to correct sentence segmentation problem. See also James Clarke's page for access to extractive sentence compression data. There are a number of sentences in common between this corpus and our new abstractive corpus.


The following papers describe the model in detail:

  • An abstractive approach to sentence compression
    Trevor Cohn and Mirella Lapata. In ACM Transactions on Intelligent Systems and Technology, 3:4, 2013.

  • Sentence Compression Beyond Word Deletion
    Trevor Cohn and Mirella Lapata. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pp 137--144, Manchester, August.

  • Sentence Compression as Tree Transduction
    Trevor Cohn and Mirella Lapata. Journal of Artificial Intelligence Research, 2009.

  • Large Margin Synchronous Generation and its Application to Sentence Compression
    Trevor Cohn and Mirella Lapata. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp 73--82, Prague, June.