blob: 2672dd3e4d7e3c174cd0f738b0c386f534e53d1d [file] [log] [blame]
\documentclass[a0,landscape]{a0poster}
%\documentclass[a0,draft]{a0poster}
% With the word preview in the square bracket, your poster will be a4
% size, handy for previewing or submit to x4u queue for A0.
% with preview removed, you get a file the right size for the HP
% large-format printer. This is a bit bigger than A0, the width is
% exactly one imperial yard.
% This is so we can have multiple columns of text side-by-side
\usepackage[utf8]{inputenc}
\usepackage{graphicx}
\usepackage{multicol}
\usepackage{helvet}
\usepackage{sectsty}
%\allsectionsfont{\usefont{OT1}{phv}{bc}{n}\selectfont}
\allsectionsfont{\sffamily} \subsubsectionfont{\sffamily\large}
\columnsep=100pt
% This is the thickness of the black line between the columns of text
\columnseprule=3pt
% this package gives you coloured text and various other simple
% graphics hacks.
% Details in /usr/local/teTeX/texmf/doc/generic/pstricks/*
\usepackage{pstricks}
%\psset{unit=1cm}
\usepackage{times}
\usepackage{url}
% Define names for some colours
\newcmykcolor{Inblue}{1.00 0.37 0.00 0.00}
\newcmykcolor{Inred}{0.00 1.00 0.63 0.00}
\newrgbcolor{Inmaroon}{0.4 0.0 0.4}
\newrgbcolor{darkblue}{0.0 0.0 0.5}
% Colour used for figure captions. Change to suit your own preference
\newrgbcolor{captcolor}{0.0 .5 0.0}
\begin{document}
% This is black magic to put a title at the top of the page.
% Thanks to Mark ``the magician'' Filipiak, although I have re-done a
% lot of the code here -- hopefully it is more robust.
% Make a box 0.55* width of poster for title and names
% If your title is long, replace 0.55 with something bigger and
% make the box for the address smaller by the same amount.
\begin{minipage}[b]{0.75\linewidth}
\veryHuge \bf
\textsf{Rapid rule-based machine translation between Dutch and Afrikaans}
\\[1cm]
\huge \bf Pim Otte, Francis M. Tyers,\\
\huge \rm Mendelcollege, Universitat d'Alacant
\end{minipage}
% Make box 0.35 X width of poster for address
\begin{minipage}[b]{0.25\linewidth}
\Large
Contact: [email protected], [email protected] \\
\end{minipage}
%% Make box 0.05*width of poster for EU Shield graphic
%\begin{minipage}[b]{0.05\linewidth}
%\includegraphics[width=10cm]{logo_ofis_liv.pdf}
%%\includegraphics{uitlogo}
%\end{minipage}
%\vspace{0.3cm}
%\hrule
%vspace{0.3cm}
%\begin{minipage}[b]{\linewidth}
%\hrule
%{\small \url{http://www.ofis-bzh.org/bzh/ressources_linguistiques/index-troerofis.php}}
%\end{minipage}
%
% This is how many columns your poster will be broken into.
\begin{multicols}{4}
% This is the width your figures will be scaled to. Change this if you
% change the number of columns.
\newlength{\figwidth}
\setlength{\figwidth}{20cm}
% Set to half of figwidth. Used for putting two figs side by side
\newlength{\fighalfwidth}
\setlength{\fighalfwidth}{10cm}
% Can't use the figure environment within multicolumns. Set up our own
% counter for figures.
\newcounter{figscount}
\section{Introduction}
\noindent
{\bf Dutch} is a West-Germanic language spoken by nearly 23 million people, mostly from the
Netherlands and Flanders. {\bf Afrikaans} is spoken by at least 5 million people, mainly in
South-Africa, but also in Namibia. Afrikaans also belongs to the West-Germanic language family and originates from the language
spoken by the the Dutch colonists of the Cape Colony. In 1925 Afrikaans replaced Dutch as
an official language in South-Africa, to be the joint official language together with English.
On this poster we will describe the development of {\small {\tt apertium-af-nl}}, a bi-directional Afrikaans
and Dutch machine-translation system
\begin{center}
\begin{minipage}[b]{26cm}
\includegraphics[width=260mm]{mapdutchworld.png}
\end{minipage}\\
\textbf{Figure 1:} Where Afrikaans or Dutch is spoken as official language.
\vspace{0.3cm}
\end{center}
%\vspace{0.5cm}
\section{Method}
\noindent
The system is based on {\bf Apertium} (\url{http://www.apertium.org/}), a free/open-source rule-based
machine translation platform. To create the language pair we used an existing resource and created
several new ones. \\
\begin{center}
\begin{minipage}[b]{26cm}
\includegraphics[width=260mm]{apertium2.pdf}
\end{minipage}\\
\textbf{Figure 2:} Modules of the Apertium translation system
\vspace{0.3cm}
\end{center}
\vspace{0.5cm}
\subsection{Existing resources}
\noindent
We reused the morphological transducer for Afrikaans, created during a currently dormant English-Afrikaans
Machine Translation project.\\
\subsection{Resources created}
\subsubsection{Dutch morphological transducer}
A new Dutch morphological transducer was created, because existing ones were unsuitable for several reasons:
\begin{itemize}
\item Non-free licence
\item Not bidirectional (only analysis, not generation)
\item Tagset different from Afrikaans transducer
\end{itemize}
%\vspace{0.5cm}
\noindent
The open categories (nouns, verbs, adjectives, adverbs) for the Dutch
morphological analyser were extracted semi-automatically from
Wiktionary, (\url{http://www.wiktionary.org}), which has entries like in {\bf Figure 3}. The resulting analyser entries were all hand checked.
Closed categories were added by hand based on the grammar of Dutch of Shetter and Ham.
\vspace{0.5cm}
\begin{center}
\begin{minipage}[b]{26cm}
\includegraphics[width=260mm]{hoofdstad.png}
\end{minipage}\\
\textbf{Figure 3:} Example of Wiktionary entry: hoofdstad
\label{wikt1}
\vspace{0.3cm}
\end{center}
\vspace{0.5cm}
\subsubsection{Bilingual dictionary}
\noindent
The bilingual dictionary was developed by adding exact matches, extracting proper names
from Wikipedia, adding cognates, words which often have a small spelling difference and adding
some entries by hand, such as closed categories and frequently missing words.
\subsection{Transfer rules}
\subsubsection{Afrikaans to Dutch}
An example of an Afrikaans to Dutch transfer rule, which handles the negation scope marker, can be seen in
{\bf Figure 4}
\begin{center}
\begin{minipage}[b]{25cm}
\begin{scriptsize}
\begin{verbatim}
<rule comment="REGLA: nie">
<pattern>
<pattern-item n="nie"/>
</pattern>
<action>
<choose>
<when>
<test>
<equal>
<var n="seen_neg"/>
<lit v="true"/>
</equal>
</test>
</when>
<otherwise>
<out>
<chunk name="nie">
<tags>
<tag><lit-tag v="ADV"/></tag>
</tags>
<lu>
<clip pos="1" side="tl" part="lemh"/>
<clip pos="1" side="tl" part="a_adv"/>
<clip pos="1" side="tl" part="lemq"/>
</lu>
</chunk>
</out>
</otherwise>
</choose>
<let>
<var n="seen_neg"/>
<lit v="true"/>
</let>
</action>
</rule>
\end{verbatim}
\end{scriptsize}
\end{minipage}\\
~\\
\textbf{Figure 4:} Transfer rule for handling negation.
\vspace{0.5cm}
\end{center}
\subsubsection{Compound words}
In both Afrikaans and Dutch words can combine very productively into compounds.
{\small {\tt apertium-af-nl}} can handle compounds that consist of one or more nouns,
such as the one in {\bf Figure 5}. \\
\begin{center}
\begin{minipage}[b]{26cm}
\begin{small}
\begin{verbatim}
lugmag
^lug<n><sg><cmp>+mag<n><sg>$
^lucht<n><mf><sg><cmp>$^macht<n><mf><sg>$
luchtmacht
\end{verbatim}
\end{small}
\end{minipage}\\
\end{center}
~\\
\textbf{Figure 5:} Example of compound analysis and translation: lugmag (air force)
\subsubsection{Separable verbs}
\noindent
The system can handle separable verbs, which exist both in Afrikaans and Dutch, as
long as there are no other phrases between the separated parts.\\
\section{Evaluation}
\noindent
We evaluated the system in several ways: Naïve coverage, compound analysis, qualitative, quantitative and comparative
\subsection{Coverage}
\noindent
Naïve coverage was calculated over Wikipedia corpora.\\
\begin{minipage}[b]{25cm}
\begin{center}
\begin{tabular}{|l|r|r|}
\hline
{\bf Corpus} & {\bf Tokens} & {\bf Coverage}\\
\hline
{\tt af} Wikipedia & 2,926,943 & 82.1\% $\pm$ 0.8 \\
\hline
{\tt nl} Wikipedia & 18,569,183 & 80.5\% $\pm$ 0.7 \\
\hline
\end{tabular}
\end{center}
\textbf{Table 1:} Na\"ive vocabulary coverage for the two morphological analysers.
\end{minipage}\\
\subsection{Compound words}
\noindent
To test the accuracy of the compound word analysis, we marked words as correctly segmented and/or
correctly translated, this test was performed in the Afrikaans$\rightarrow$Dutch direction. \\
\begin{minipage}[b]{25cm}
\begin{center}
\begin{tabular}{|l|r|r|}
\hline
{\bf Corpus} & {\bf Corr. Seg.} & {\bf Corr. Trans.}\\
\hline
top-1,000 & 914 & 776 \\
\hline
random-1,000 & 957 & 801 \\
\hline
\end{tabular}
\end{center}
\textbf{Table 2:} Compound word accuracy in analysis and translation.
\end{minipage}\\
\subsection{Quantitative}
\noindent
The translation quality was measured using word error rate (WER). For this we used sets
of 100 sentences from Wikipedia. The set C1 could not contain any unknown words, set C2 could. \\
\begin{minipage}[b]{25cm}
\begin{center}
\begin{tabular}{|l|l|r|r|}
\hline
{\bf Dir.} & {\bf System} & {\bf C1} & {\bf C2} \\
\hline
{af-nl} & {\small Apertium} & 16.625 $\pm$ 1.465 & 23.405 $\pm$ 1.235 \\
& {\small Google} & {\bf 9.485 $\pm$ 1.115} & {\bf 10.575 $\pm$ 1.795} \\
\hline
{nl-af} & {\small Apertium } & {\bf 15.435 $\pm$ 1.885} & {\bf 21.72 $\pm$ 1.06} \\
& {\small Google } & 21.81 $\pm$ 1.72& 25.71 $\pm$ 1.22 \\
\hline
\end{tabular}
\label{table:quan}
\end{center}
\textbf{Table 3:} Accuracy for the test corpora for the two systems as measured by Word Error Rate
with 95\% confidence interval.
\end{minipage}\\
\subsection{Qualitative}
To discover which areas of the system could use the most improvement we reviewed the translation errors
in the Afrikaans$\rightarrow$Dutch direction and categorised them as in {\bf Table 4}. \\
\begin{minipage}[b]{25cm}
\begin{center}
\begin{tabular}{|l|c|r|r|}
\hline
{\bf Error type} & {\bf Count} & {\bf \% of total} \\
\hline
Syntactic transfer & 235 & 42.4 \\
\hline
~~~- Verb concordance & 99 & 17.9 \\
~~~- Auxiliary verbs & 13 & 2.3 \\
~~~- Relative pronoun& 11 & 2.0 \\
~~~- Capitalisation & 10 & 1.8 \\
~~~- Chunking error & 9 & 1.6 \\
~~~- Other & 93 & 16.8 \\
\hline
Unknown word & 147 & 26.5 \\
Disambiguation & 106 & 19.1 \\
Morphology & 28 & 5.1 \\
Polysemy & 23 & 4.2 \\
Multiword & 6 & 1.1 \\
Compounding & 6 & 1.1 \\
Separable verb & 3 & 0.5 \\
\hline
Total & 554 & 100 \\
\hline
\end{tabular}
\end{center}
\textbf{Table 4:} Contribution to total error by type. Syntactic transfer errors are split into
further categories.
\end{minipage}\\
\subsection{Comparative}
Finally, we compared {\small {\tt apertium-af-nl}} with Google translate, using WER. The results
are in {\bf Table 3}. {\bf Figure 6} is an example of an error, caused by wrong disambiguation. \\
\begin{itemize}
\item[] Hier volg 'n lys van hoofstede.
\item[] Hier {\em volgen} een lijst van hoofdsteden.
\item[] Hier {\em volgt} een lijst van hoofdsteden.
\item[] `Here follows a list of capital cities.'
\end{itemize}
\textbf{Figure 6:} Example of disambiguation error.
\section{Future work}
We have presented a bi-directional rule-based machine translation system between Dutch and Afrikaans.
The system gives promising results and offers improvement in the Dutch$\rightarrow$Afrikaans direction
over another public system, but does not offer improvement in quality in the other direction.
The three biggest issues in the system are:
\begin{itemize}
\item Lack of dictionary coverage -- some common words are missing, such as kalender, rewolosie, binneland, silwer
\item Poor morphological disambiguation -- work is needed to be able to, e.g. distinguish better between present and infinitive in Afrikaans
\item Insufficient syntactic transfer -- separable constructions need more support, amongst which separable verbs and auxiliary verbs with participles.
\end{itemize}
\section*{Acknowledgements}
Development of this system was partially supported by the Google Code-in,
a contest to introduce pre-university students to contributing to open-source
software. This work has also received the support of the Spanish Ministry of Science and Innovation through project TIN2009-14009-C02-01.
\includegraphics{apertium.png}
\end{multicols}
\end{document}