Prospective research students/interns, please read this before contacting
me.
Major Projects
Present
- Whopping Volta GPU Cluster (ARC LIEF Project, 2020)
- Fairness in Natural Language Processing (ARC Discovery
Project, 2020—2022)
- ARC Centre in Cognitive Computing for Medical Technologies (ARC
ITRP, 2018–2023)
- Biochemical text mining for advancing chemical and pharmaceutical
knowledge (ARC Linkage Project, 2018—2021)
Past
- Making Computers Understand Common Language about Place (ARC
Discovery Project, 2017–2019)
- User-Adaptive Search and Evaluation for Complex Information-Seeking
Tasks (ARC Linkage Project, 2016–2019)
- Personalised Topic Modelling and Sentiment Analysis for Enhanced
Information Discovery over Document Streams (ARC Linkage Project,
2013–2017)
- VetCompass: Big Data and Real-time Surveillance for Veterinary
Science (ARC LIEF Project, 2016)
- Information access through web-scale question-answer pair
finding, ranking and matching (ARC Future Fellowship,
2013—2016)
- Talking about Place — Tapping Human Knowledge to Enrich
National Spatial Data Sets (ARC Linkage Project,
2011–2014)
- Principles, Practice, and Pragmatics of Measurement in Experimental
Computer Science (ARC Discovery Project, 2011–2014)
- NICTA
biomedical text mining (with Verspoor, Cavedon, Zobel, Moffat et al.)
- OLE (ARC Discovery Project, with Bird: Online Linguistic Exploration: Deeper,
Faster, Broader Language Documentation, 2009–2011)
- Kubadji (ARC Discovery Project, with
Zukerman, Sonenberg, Balbo and Bird: Personalised Content Delivery
for Assisted Navigation of Information Rich, Physical Environments such
as a Museum, 2007–2010)
- Web-scale Language Identification: All Languages Great and Small
(Google Research Award, 2008–2009)
- Multilingual Unsupervised Parse Selection (Microsoft Research Asia Research
Award, 2009–2010)
- Web User Forum Text Analysis (Microsoft Research Asia Research
Award, 2008–2009)
- Information Delivery from Segmented Textual Data Streams (ARC
Discovery Project: 2006–2008)
- Scalable Language Understanding for Japanese (joint research project
with NTT Communication Science Labs., 2006–2008)
- Interactive Information Discovery and Delivery (NICTA project, with
Cavedon, Stokes, Bird, Moffat, et al., 2005–2007)
- An Intelligent Search Infrastructure for Language Resources on the
Web (ARC e-Research Special Research Initiative, with Bird and Hughes, 2006)
- Feature-rich Word Sense Disambiguation and Unknown Word
Bootstrapping (joint research project with NTT Communication Science
Labs., 2004–2006)
Publications
See my publications page for a
reasonably up-to-date list of my papers (with links to most papers). My Google
Scholar profile and Semantic
Scholar profile are also a reasonably accurate snapshot of my
publication output.
Talks
Slides from recent(ish) talks:
- Language and the Shifting Sands of
Domain, Space and Time, Invited Talk at the Fifth Workshop on NLP
for Similar Languages, Varieties and Dialects, Santa Fe, USA.
- Democratic Regression,
Stanford University NLP Golden Hour talk, Stanford, USA.
- Robust, Unbiased Natural Language
Processing, Keynote Talk at the The Third Workshop on
Representation Learning for NLP, Melbourne, Australia.
- Language Identification in the
Wild, Guest Talk at the First Workshop on Multi-Language
Processing in a Globalising World, Dublin, Ireland.
- Learning to Label
Documents, Invited Talk at the ICML 2017 Workshop on Learning to
Generate Natural Language, Sydney, Australia.
- VetCompass: Clinical Natural Language
Processing for Animal Health, Invited Talk at the 2016 Clinical
Natural Language Processing Workshop, Osaka, Japan.
- Multiword Expressions at the
Grammar--Lexicon Interface, Invited Talk at the COLING 2016
Workshop: Grammar and lexicon: interactions and interfaces, Osaka,
Japan.
- Multiword Expressions: From Theory to
Practicum, Invited Talk at the ICGL 2015 Workshop: Modern Greek
MWE 2015, Berlin, Germany.
- Text Mining of Social Media: Going
beyond the Text and Only the Text, Invited Talk at the ACL 2015
Workshop on Noisy User-generated Text (W-NUT), Beijing, China.
- Composed, Distributed Reflections on
Semantics and Statistical Machine Translation, Invited talk at the
Eighth Workshop on Syntax, Semantics and Structure in Statistical
Translation (SSST-8 2014).
- Semantic Analysis of Social
Media, Keynote talk at the Third Joint Conference on Lexical and
Computational Semantics (*SEM 2014).
Resources
Online systems
- Twitter user
geolocator: predict the location of a Twitter user based on the text
of their recent posts, and their user profile [developed in collaboration
with Bo Han and Paul Cook]
- FOKS: an intelligent dictionary
interface for Japanese, intended to help learners of Japanese look up
unknown words without having to dust off their kanji dictionary
[developed in collaboration with Lars Yencken and Slaven Bilac]
- Kanji Tester: an adaptive
Japanese learning environment, specifically targeted at those swatting
for the JLPT 3 and 4 exams [developed in collaboration with Lars Yencken]
- SimSearch: a visual kanji
search interface, based on similarity with known kanji [developed in
collaboration with Lars Yencken]
Software
- pigeo: an
automatic geotagging tool, based on text- and graph-based methods
[developed in collaboration with Afshin Rahimi and Trevor Cohn]
- LexSemTM: code for
training topic models to estimate word sense distributions at scale
[developed in collaboration with Andrew Bennett, Jey Han Lau, Francis Bond
and Diana McCarthy]
- polyglot: a language
identification toolkit for multilingual documents [developed in
collaboration with Marco Lui and Jey Han Lau]
- Toolkit for
evaluating topic coherence and topic model quality: toolkit for
evaluating the semantic coherence of individual topics and overall topic
models, as described in our EACL 2014 paper [developed in
collaboration with Jey Han Lau and Dave Newman]
- Twitter user
geolocator: trained models and full code to replicate the Twitter user
geolocation experiments published in our ACL 2013 demo paper [developed in
collaboration with Bo Han and Paul Cook]
- HDP-based word sense
induction system: toolkit for inducing word senses based on a
Hierarchical Dirichlet Process (HDP) [developed in collaboration with Jey
Han Lau and Paul Cook]
- On-line Topic Modeller:
implementation of an on-line topic modeller for trend analysis [developed
in collaboration with Jey Han Lau]
- langid.py: fast, accurate standalone
language identification toolkit; also versions of the pre-trained
identifier in C and Javascript [developed in
collaboration with Marco Lui]
- SiteScraper:
automatically scrapes data from websites based on a handful of sample
URLs and strings of interest [developed in collaboration with Richard
Penman and David Martinez]
- Hydrat:
Python library for text categorisation/language identification
[developed in collaboration with Marco Lui]
- Malay
tokeniser/lemmatiser: lex/perl tools for tokenising and lemmatising
Malay text
Datasets
- Pre-trained doc2vec models
for English (described in Lau and Baldwin, 2016)
- LexSemTM
(largely) all-vocabulary trained topic models for English (described
in Bennett et al., to appear)
- CQADupStack
dataset of duplicate questions from StackExchange (described in Hoogeveen et al., 2015)
- Financial agreement named entity
dataset (described in Salinas et al., 2015)
- City label set for user
geolocation (described in Han et al.,
2014; with thanks to Mark Dredze)
- W-NUT 2015
Shared Task on Lexical Normalisation for English Tweets (described in
Baldwin et al., 2015)
- Locative Expressions in Social
Media Text dataset (described in Liu et al., 2014)
- Novel sense dataset
(described in Cook et
al., 2014)
- Twitter and Web lexical sample
sense annotations (described in Gella et al.,
2014)
- Twituser language
identification dataset (described in Lui and Baldwin, 2014)
- Topics
annotated for observed coherence (described in Lau et al., 2014)
- Multilingual language
identification dataset (described in Lui et
al., 2014)
- Lexical normalisation
dictionary (described in Han et al., 2012)
- Japanese
SemCor (described in Bond et al., 2012)
- Multi-domain language
identification dataset (from Lui and Baldwin, 2011)
- Topic
label dataset (described in Lau et al., 2011)
- Lexical normalisation dataset
(described in Han and
Baldwin, 2011, incorporating corrections from Jacob Eisenstein); (old
version: v1.1)
- Multilingual
language identification dataset (as used in the ALTA-2010
Shared Task, and described in Baldwin
and Lui, 2010)
- Web
user forum thread and post structure dataset (described in Kim et al., 2010 and
Wang
et al., 2010)
- Topic
coherence topics and human judgements (described in Newman et al., 2012)
- Language
identification dataset (described in Baldwin and Lui, 2010)
- Case
and punctuation restoration dataset (described in Baldwin and Joseph, 2009)
- Satire
document collection (described in Burfoot and
Baldwin, 2009)
- Tagalog
predicate-argument parsing dataset (described in Mistica and
Baldwin, 2009)
- Pooled kanji
similarity dataset (described in Yencken and Baldwin,
2008)
- Noun-noun
compound semantic relations (described in Kim and
Baldwin, 2008)
- Compound
nominalisation interpretation (described in Nicholson
and Baldwin, 2008)
- Deep
lexical acquisition of English verb-particle constructions (described
in Baldwin,
2008)
- Parsing and WSD dataset (described in Agirre et
al., 2008) — email me for access details
- Kanji
similarity dataset (described in Yencken
and Baldwin, 2006)
- Japanese
grapheme-phoneme alignment data (described in Baldwin and
Tanaka, 1999)
Miscellaneous
Teaching
Present
- COMP10001 Foundations of Computing (Semester 1, 2021; co-lectured with
Nic Geard and Marion Zalk)
Past
- COMP30027 Machine Learning (Semester 1, 2017 and 2019; co-lectured with Karin
Verspoor, Afshin Rahimi, and Jeremy Nicholson)
- Lecture series on Social
Media and Text Analytics presented as part of the International Summer School on
Web Science and Technology (2016)
- Series of guest lectures on Text Analysis of Social Media
presented as part of Language Technology II (Saarland University, August,
2014)
- COMP10001 Foundations of Computing (Semester 1, 2012—2020; co-lectured with
Andrew Turpin, Egemen Tanin, Nic Geard, and Marion Zalk)
- COMP90051 Statistical and Evolutionary Learning (Semester 2, 2011;
co-lectured with Michael Kirley)
- COMP30018 Knowledge Technologies (2010—2012)
- INFO10001 Informatics 1 (2008—2011)
- Empirical
Approaches to Multilingual Lexical Acquisition (Saarland University,
Winter, 2008; taught as part of the Erasmus Mundus LCT Masters program)
- 433-352 Data on the Web (2006—2009)
- 433-484/684 Machine Learning (2006— 2008)
- ESSLLI 2006 course on
Data-Driven Methods for Acquiring Linguistic Information (with Aline
Villavicencio, Anna Korhonen and Valia Kordoni)
- Lexical semantics course convener and co-lecturer for ACL/HCSNet Advanced
Program in Natural Language Processing (2006)
- 433-253 Algorithms and Data Structures (2006—2007; co-lectured
with Linda Stern)
- 433-395 Advanced Topic in Computer Science (2005)
- 433-680 Machine Learning (2005)
- An Introduction to
Computational Word Learning (Stanford University, Fall Quarter,
2003)
Staff
Present
- Victor Fedyashov (Research Fellow 2019—)
- Meladel Mistica (Research Fellow 2019—)
- Aili Shen (Research Fellow 2020—)
- Shivashankar Subramanian (Research Fellow 2020—)
- Simon Šuster (Research Fellow 2020—)
Past
- Afshin Rahimi (Research Fellow 2018—2019)
- Bahar Salehi (Research Fellow 2017—2019)
- Julian Brooke (McKenzie Postdoctoral Fellow 2015—2017)
- Andrew Bennett (Research Associate 2016—2017)
- Huizhi Liang (Research Fellow 2014—2016)
- Joel Nothman (Research Fellow 2015—2016)
- Angelos Molfetas (Research Fellow 2014)
- Yvette Graham (Research Fellow 2012—2014)
- Paul Cook (McKenzie Postdoctoral Fellow 2011—2014)
- Jey Han Lau (Research Fellow 2013)
- Rebecca Dridan (Research Fellow working on OLE 2009—2011)
- Gintarė Grigonytė (Visiting Research Fellow 2011)
- Su Nam Kim
(Research Fellow working on LangID and ILIAD 2009—2010)
- Patrick Ye (Research Fellow working on Kubadji 2009—2010)
- David Martinez (Research Fellow working on ILIAD 2007—2009)
- Marco Lui (Research Assistant working on ILIAD and LangID 2009—2010)
- Richard Penman (Research Assistant working on ILIAD 2008—2009)
- Shlomo Berkovsky (Research Fellow 2007—2008)
- Kapil Gupta (Research Fellow 2009)
Students
Present
- Shraey Bhatia (PhD student; co-supervised with Jey Han Lau)
- Biaoyan Fang (PhD student; co-supervised with Karin Verspoor)
- Brian Hur (PhD student; co-supervised with James Gilkerson, Laura
Hardefeldt, and Karin Verspoor)
- Anirudh Joshi (PhD student; co-supervised with Richard Sinnott and
Cecile Paris)
- Fajri Koto (PhD student; co-supervised with Jey Han Lau)
- Haonan Li (PhD student; co-supervised with Martin Tomko and Maria Vasardani)
- Nitika Mathur (PhD student; co-supervised with Trevor Cohn)
- Yulia Otmakhova (PhD student; co-supervised with Karin Verspoor
and Jey Han Lau)
- Takashi Wada (PhD student; co-supervised with Jey Han Lau)
- Siyang Wang (MSc(CS) student; co-supervised with Simon Šuster)
- Yuxia Wang (PhD student; co-supervised with Karin Verspoor)
Past
- Chenbang Huang (MSc(CS) student; co-supervised with Aili Shen)
- Qian Sun (MSc(CS) student; co-supervised with Aili Shen)
- Wayan Oger Vihikan (MIT student; co-supervised with Meladel Mistica)
- Fan Ye (MSc(CS) student; co-supervised with Simon Šuster)
- Shuanglong You (MSc(CS) student; co-supervised with Victor Fedyashov)
- Aili Shen (PhD student; co-supervised with Jianzhong Qi and
Bahar Salehi)
- Shivashankar Subramanian (PhD student; co-supervised with Trevor
Cohn)
- Saumya Pandey (MSc(CS) student; co-supervised with Lea Frermann)
- Haowen Tang (MSc(CS) student)
- Gaurav Arora (MSc(CS) student; co-supervised with Afshin Rahimi)
- Yitong Li (PhD student; co-supervised with Trevor Cohn)
- Fei Liu (PhD student; co-supervised with Trevor Cohn)
- Adel Foda (PhD student; co-supervised with Jey Han Lau)
- Tatsuya Aoki (visiting PhD student from Tokyo Institute of Technology)
- Leo Bouillet (MSc(CS) student)
- Jun Wang (MSc(CS) student; co-supervised with Graeme Gange)
- Karen Qu (MIT student; co-supervised with Afshin Rahimi)
- Ekaterina Vylomova (PhD student; co-supervised with Trevor Cohn)
- Navnita Nandakumar (MSc(CS) student; co-supervised with Bahar Salehi)
- Qianji Di (MIT student; co-supervised with Ekaterina Vylomova)
- Jinxiang Wang (MSc(CS) student)
- Doris Hoogeveen (PhD student; co-supervised with Karin Verspoor)
- Afshin Rahimi (PhD student; co-supervised with Trevor Cohn)
- Ned Letcher (PhD student; co-supervised with Emily Bender)
- Jingyuan Zhang (MIT student)
- Jim Breen (PhD
student; co-supervised with Francis Bond)
- Steven Xu (MSc(CS) student; co-supervised with Jey Han Lau)
- Katharine Cheng (MSc(CS) student; co-supervised with Karin Verspoor)
- Richard Fothergill (PhD student — currently working at
rome2rio)
- Viet Nguyen (MSc(CS) student; co-supervised with Julian Brooke)
- Shraey Bhatia (MSc(CS) student; co-supervised with Jey Han Lau
— currently a PhD student at The University of Melbourne)
- King Chan (completed PGDip studies in 2017; co-supervised with Julian Brooke)
- Ionut-Teodor Sorodoc (visiting Masters student in 2016; co-supervised with
Jey Han Lau)
- Bahar Salehi (completed PhD 2016; co-supervised with Paul Cook
— currently working at Go1)
- Andrew Bennett (completed MSc(CS) 2016; co-supervised with Jey Han Lau,
Francis Bond, Diana McCarthy and Paul Cook — currently a PhD student
at Cornell University)
- Liang Han (completed MIT 2016)
- Michael Niemann
(completed PhD 2015; co-supervised with Henry Linger — currently
working at Monash University)
- Julio Salinas (completed MIT 2015; co-supervised with Karin Verspoor)
- Marco Lui (completed PhD 2015 — currently working at
rome2rio)
- Li Wang (completed PhD 2015;
co-supervised with Su Nam Kim — currently working at Dropbox)
- Nitika Mathur (completed MSc(CS) 2014; co-supervised with Yvette Graham)
- Bo
Han (completed PhD 2014; co-supervised with Paul Cook)
- Xiwei Wang (completed MSc(CS) 2014; co-supervised with Yvette
Graham — currently working at Alibaba)
- Andrew Chester (completed MSc(CS) 2014; co-supervised with Tony
Wirth)
- Jared Willett (completed MSc(CS) 2012; co-supervised with David Martinez
and Angus Webb)
- Siming Wang (completed PGDip 2013; co-supervised with Alistair
Moffat)
- Meladel Mistica (completed PhD 2013; external supervisor —
currently working at The University of Melbourne)
- Spandana Gella (completed MSc(CS) 2013; co-supervised with Paul
Cook — currently working at Amazon)
- Jey Han Lau (completed PhD 2013; co-supervised with Dave Newman —
currently working at The University of Melbourne)
- Clint Burford (completed PhD 2013; co-supervised with Steven Bird
— currently working at Apple)
- Willy Yap (completed PhD 2013; co-supervised with Tara McIntosh —
currently working at Sportsbet)
- Luke Parkinson (MSc(CS); co-supervised with Paul Cook)
- Matěj Korvas (completed MSc(CS) 2012)
- Igor Tytyk (completed MSc(CS) 2012 —
currently working at Grammarly)
- Andrew MacKinlay (completed PhD 2012 —
currently working at culture amp)
- Karl Grieser (completed PhD 2012 —
currently working at Redbubbble)
- Ned Letcher (completed BSc(Hons) 2010)
- Lars Yencken
(completed PhD 2010)
- Marco Lui (completed BCS(Hons) 2009 —
currently working at Rome2rio)
- Ben White (completed MIT 2009)
- Li Wang (completed MIT 2009)
- Patrick Ye (completed PhD 2009 —
currently working at Amazon)
- Lejoe Kuriakose (completed MEDC 2008)
- Paul Joseph (completed MSSE 2008)
- Su Nam Kim (completed PhD 2008)
- Michael Yang (completed BCS(Hons) 2007)
- Sumukh Ghodke(completed MSSE 2007)
- Phil Blunsom (completed PhD 2007 —
currently working at Oxford University/Google DeepMind)
- Edward Ivanovic (completed MPhil 2007)
- Aidan Furlan (completed BCS(Hons) 2006)
- Karl Grieser (completed BSc(Hons) 2006)
- Rebecca Dridan (completed MPhil 2006)
- Jeremy Nicholson (completed BCS(Hons) 2005 —
currently working at UniMelb)
Interested in pursuing natural language processing research at The
University of Melbourne? Contact me directly, making sure to include a CV
and description of your research interests.
UniMelb Administration
Present
- CIS Education Committee (2008—)
Past
- Associate Dean (Research Training) (2017—2020)
- Higher Degree Research Committee (2018—2020)
- Chair of the East Asia Regional Study Group (2016—2017)
- Member of Arts Faculty Board (2015)
- Deputy Head of Department (2011—2012)
- Teaching Committee chair (2010—2012)
- School of Engineering Education Committee (2010—2012)
- CIS Faculty of Science liaison (Science APC/PGPC: 2005—2011)
- CSSE Research Committee (2008—2011)
- CSSE Postgraduate Coursework Programmes Committee chair
(2007-2009)
- School of Engineering IT Advisory Group (2009)
- Honours/PGDip Coordinator (2005-2007)
- Publications Liaison (2005-2007)
- Member of Science Tools working group, Faculty of Science (2007)
Professional Activities
Present
- Vice President, Association for Computational Linguistics (2021)
- Programme co-chair of The
7th Workshop on Noisy User-generated Text (W-NUT)
- Permanent Member of the International Committee on Computational Linguistics (2014—)
- Advisory Board for ACL
SIGLEX (Special Interest Group on the Lexicon) (2014—)
- Advisory Board for ACL
SIGDAT (Special Interest Group for linguistic data and corpus-based approaches to NLP) (2014—)
- Editorial board of Transactions of the Association for Computational
Linguistics (2015—)
Past (highlights)
Random Miscellania
In the media:
- TOPBOTS:
GPT-3 & Beyond: 10 NLP Research Papers You Should Read (17/11/2020)
- The Australian: Engineering & Computer Science Australia’s Research Field Leaders (23/9/2020)
- And
the Winner Is ... ACL 2020 Announces Best Paper Awards, Slator (9/7/2020)
- This
AI Poet Mastered Rhythm, Rhyme, and Natural Language to Write Like
Shakespeare, IEEE Spectrum (30/4/2020)
- The Australian: Engineering & Computer Science Australia’s Research Field Leaders (10/6/2019)
- Australian
Financial Review: IBM to build $10 million AI centre with Melbourne Uni (10/6/2019)
- New
Scientist News, The
Times, Daily Mail, Digital Trends, NVIDIA, InfoSurHoy, la Repubblica, BBC Radio 4: Deep-speare — A joint neural model of
poetic language, meter and rhyme (7/2018)
- ABC
News: Can we Replace Red Symons with a Robot? (3/10/2017)
- Crikey:
How can You Tell if a Tweet is Credible? (6/3/2017)
- Farrago:
The Revolution Will Be Computerised (29/8/2016)
- Tech
Exec: The Fourth Revolution: Artificial Intelligence
(29/1/2016)
- MIT
Technology Review: King – Man + Woman = Queen: The Marvelous
Mathematics of Computational Linguistics (17/9/2015)
- NCI
News: Real-time Twitter Mining (30/9/2014)
- The
Age: The Rise of Artificial Intelligence (23/1/2014)
- Oregonian:
Tweet Talk (24/6/2011)
- Sydney
Morning Herald: Big Brains Coming back to Melbourne (6/12/2005)
- UniNews:
Reversing the Brain Drain (14/11/2005)
In a moment of weakness, I signed up for LinkedIn.
For the trivia lovers, here is my (almost certainly outdated) full CV.