Resources

Graphing. I have assembled a collection of examples of jgraph scripts and other information about jgraph.

Random numbers. The code in this zip file generates random numbers according to an arbitrary distribution, from random numbers generated according to a uniform distribution. It is handy for simulations.

Another handy utility is unsort, which randomises the order of lines in the input.

Efficient data structures. Several of my papers rely on code for high-efficiency hashing and tree structures. Some of the code for these structures was written to support a paper on efficient hashing (full details in my list of publications). The full set of code is in this zip file. Code for burst tries is available on request but Ranjan Sinha should be approached first.

String sets. Several of the large sets of strings used in our experiments are available from Ranjan Sinha. Some of this data is derived from the TREC web data collections.

String sorting. A simple implementation of ternary quick sort for sorting an array of strings is available. Faster string sorting routines are available on our string sorting page.

Integer coding. Many of my papers make use of integer coding techniques. Source for some of these techniques, including Elias and Golomb coding, is in this zip file.

Approximate string matching. Code for searching databases of strings, such as names, is in the vrank suite. These are string-based techniques such as edit distances; there is also code for phonetic methods, in the ipa suite. There are also some collections of data.

Synthetic text databases. The finnegan suite can be used to generate artificial text databases that are useful for retrieval efficiency experiments; the suite includes the quangle code for generating queries.

Stopping and text processing. The routine rmstop is a simple utility for removing stop words from text; the source includes some stop lists. The awk script double is a simple utility for detecting repeated words in text files.

The C program getstat reads a file of text (with one word per line) and counts how often each distinct word occurs. Test it on a book; the output should look like this.

Statistics The programs anova (a shell script), t-test (a shell script), and wilcoxon (in C) are for testing significance of hypotheses. All three operate on paired columns of numbers, and should be used in conjunction with statistical tables.

Return to Justin Zobel's home page.