ILS Colloquium

Agenda

15 September 2016
15:30 - 17:00
Drift 21, 1.05

Gertjan van Noord

Large-scale Automated Syntactic Analysis of Dutch

Rijksuniversiteit Groningen

In this presentation, we will describe some aspects of the Alpino parser, and some recent attempts (some in vain) at improving the parser.

The Alpino parser is a system which assigns syntactic structures to Dutch sentences fully automatically.  It is a hybrid system in which a hand-written grammar and large dictionary is combined with a statistical disambiguation component. The grammar and dictionary contain detailed linguistic rules which are used to derive syntactic structures for a given sentence. In most cases, these linguistic rules will allow multiple candidate structures for a given sentence, because of (unintended) ambiguity. For instance, sentences such as the following can be analysed in several ways:

(1) We luisterden naar de berichten over de oorlog in Irak
(2) We luisterden naar de berichten over de oorlog in de auto
(3) Mannen die vrouwen haten

In order to figure out the intended analysis of a sentence in such cases, Alpino includes a statistical disambiguation component which is able to solve about 80% of such disambiguation decisions. This component is trained using a set of 10 thousand manually verified syntactic analyses (the Alpino treebank).

The disambiguation component furthermore uses cooccurence information of heads and their dependents extracted from much larger corpora for improved disambiguation accuracy (van Noord 2010). This cooccurence information informs the parser, for instance, which nouns typically occur as direct object or subject of a particular verb. This helps for the disambiguation of sentences such as:

(4) Die topfilm heeft u natuurlijk gezien
(5) Die getuige heeft u natuurlijk gezien
(6) De wijn die Elvis gedronken zou hebben als hij wijn had gedronken
(7) De paus heeft duizend daklozen te eten gehad
(8) De paus heeft twee biefstukken te eten gehad

If time permits, we describe a recent experiment to add “word embedding” features to the disambiguation component, building on ideas in Mikolov et al, 2013. We compare word embedding features with the original disambiguation approach, and we also report on the combination of the two techniques.

 

Gertjan van Noord. Self-trained Bilexical Preferences to Improve Disambiguation Accuracy. In: Harry Bunt, Paola Merlo and Joakim Nivre (editors), Trends in Parsing Technology.  Dependency Parsing, Domain Adaptation, and Deep Parsing.  Springer Verlag. pp 183-200. 2010.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. eprint arXiv:1310.4546