Multiconcord: the Lingua Parallel Concordancer

Multiconcord:
the Lingua Multilingual
Parallel Concordancer
for Windows

Introduction

This page describes the work undertaken at the University of Birmingham under Lingua project ndeg.93-09/1245/F-VB (Co-ordinator Francine Roussel, Universite de Nancy II) to develop a Windows-based parallel concordancer for classroom use (Programmer David Woolls, with Brum support from Philip King and Tim Johns).

System Requirements

PC with 486 processor or faster and Windows 3.x orWindows 95.

Background

The theoretical background to the project is to be found in the work of Church and Gale in the 1980's on text alignment. Let us say that we have a text in English and a (skilled) translation of that text in French, and that we are interested in how the translator has handled the translation, in context, of a particular word. Using normal concordancing techniques, the program is able to identify all instances of the word in the Search Language (here, English), and is able also to identify the paragraphs and sentences in which those instances occur - say, sentence 5 in paragraph 2, sentence 4 in paragraph 3, and so on. The task for the computer is now to identify the equivalent sentences in the Target language (here, French). For our approach to this task, the two texts must have been aligned in advance at the level of the paragraph, so that praragraph 5 in one language is equivalent to paragraph 5 in the other language.It is difficult to employ this approach at sentence level since a skilled translator may well translate one sentence by two, or two by one, three by two, and so on. This is the central problem of text alignment. Most solutions to that problem, including the work at Birmingham, rest on the following assumptions:

The usual pattern of translation of translation is for one sentence to be translated by one sentence.
Another general feature of the usual pattern is for short sentences to be translated by short sentences, and for longer sentences to be translated by longer sentences.
a and b operating together give a match of patterning of short and long sentences between the original text and the translation that is consistent enough for places where it is disturbed to be clearly detectable, and for the program to test a range of hypotheses to account for the disturbance and thereby attempt to re-establish the match

What is distinctive about the work at Birmingham is that the alignment at sentence level is made 'on the fly' when a concordance is requested: and that while most other work in this area has sought to elaborate the methods proposed by Gale and Church in order to achieve greater accuracy, the Birmingham approach has been to simplify those methods.

The other distinctive feature of the Lingua project is that its primary focus is practical: our primary aim has not been to invent new methods of test alignment (though that has been an incidental spin-off), but to develop a working program and a methodology for teachers and students to exploit the program in language-learning. This work is based on the following assumptions:

The learning methods developed at Birmingham and elsewhere on the basis of monolingual concordance output would be equally applicable, and could be enriched, in the context of multilingual concordance output.
The opportunity to study "good practice" would consitute a considerable reinforcement in the teaching of translation.
The program could form the basis for a reassessment of the place of translation in general foreign language teaching - for example, in giving opportunities for "applied contrastive analysis" and in weaning students from the myth of one-to-one correspondence between first and second language.

These assumptions are currently being put to the test in trialling of the program.

Search Facilities

Languages

The user must specify a search language and a target language.

Thus, on the screen shown below the user has told the computer to search the English files and to identify the corresponding sentences in the French files.

The 10 languages included in the main project, and their identifying file extensions, are:

In addition, work has been done on Russian, Czech, Polish, Lithuanian, Venda, Zulu and other alphabetic languages. For details of versions of the program able to handle these, contact David Woolls.

File Selection

The program automatically identifies, and prints a list of, the files that are available in both the Search and the Target languages.

Up to 10 files can be chosen from the list offered.

Search Options

Main Search

The following examples show the options available when looking for words or phrases that you are interested in

Search option Example

Single-word item planet

Multiple-word item carbon dioxide

Final wild card ortho*

Initial wild card *ible

Medial wild card un*ly

Some points to notice:

Searches are not case-sensitive: thus both french and FRENCH identify all examples of French.
Problems of diacritics when specfiying search items in languages other than English can be overcome by:
- Using Country-specific keyboards
- Using a keyboard manager (which can be supplied)
- Using the ANSI codes from the numeric keyboard (not recommended for regular use)
Greek needs a keyboard manager and a specific font to be loaded into the Windows system. The program uses TimelTGR which is found on most multilingual word-processing systems. Note, however, that the language-awareness of Windows95 removes this problem.

Context Search

Further control over the output is offered by the facility to specify the presence of a word or phrase in the context ot the search item, context being defined as:

From one to six words to the left of the search word.
From one to six words to the right of the search word.
From one to six words to the left, or one to six words to the right of the search word.
Being in the same sentence as the search word,
Being in the same paragraph as the search word.

Editing & Sorting Output

Citations may be:

inspected, together with the paragraphs in which they occur.
deleted from the list, and deletions may be restored;
classified according to up to 4 user-defined categories.
sorted by left context, right context, search item (where more than 1 search item specified), or by user-defined category.

Results of the search may be shown on screen, saved to disc, or sent to a printer as follows:

Search Language only.
Target Language only.
Search Language interleaved with for Target Language.
Search language and Target language in parallel columns

Tests

The Lingua Multilingual Concordancer for Windows is deliberately a hybrid in that it combines a parallel concordancer together with facilities for the semi-automatic generation by gapping of 'guided translation' materials for testing or teaching. Two principles for gapping are offered:

Search word only gapped.
Interval between words (1-7) , with three types of gap:
- Full deletion
- First letter shown
- Half word shown (C-test: see example above)
Length of word:
- All words with less than a certain number of letters (tends to identify function words).
- All words with more than a certain number of letters (tends to identify content words).

On-screen tests offer a limited degree of interactivity in that the user can click on Answers to check his/her guesses - a second click then restores the gaps.

The program saves parallel-column files in a format that can be read by Word for Windows.

Own Data

In developing the Windows version of the MultiLingual Concordancer, we have kept in mind the goal that users should be able to add their own pairs of texts to the corpus, using simple and easily-learned mark-up conventions based on SGML (Standardised General Markup Language).

To mark up your own data you need to:

Find a pair of texts
Make sure you have the same number of paragraphs in both texts
Mark the top of the text with <body>
Mark the end of the text with </body>
Mark each paragraph with
Mark the start of each sentence with <s>
Give each text the same name with an extension to show the language.

Below is an example of marked up Danish text:

<body>

<s>Alice i Eventyrland og Bag spejlet

<s>LEWIS CARROLL

<s>KAPITEL I

<s>Alice dumper ned til kaninen

<s>Hvor det dog kedede Alice at sidde sammen med søsteren dernede ved søen - uden at have noget at tage sig til! <s>Et par gange havde hun kigget i den bog, søsteren var i færd med at læse, men der var ingen billeder i den og heller ingen samtaler... <s>"Og hvad fornøjelse har man af en bog uden billeder, og hvor personerne ikke snakker med hinanden?" tænkte Alice.

<s>Man kunne jo give sig til at binde kranse, men var det så morsomt, at det var umagen værd at rejse sig op og plukke bellis? <s>Mens Alice således sad og tænkte frem og tilbage (så godt hun kunne på grund af varmen, der gjorde hende sløv og søvnig) - opdagede hun pludselig en hvid kanin med røde øjne. <s>Den løb lige forbi hende...

<s>Det var ikke særlig mærkværdigt, og Alice syntes heller ikke, det var særlig overraskende, at kaninen sagde til sig selv: "Ih, du glade verden! <s>Jeg kommer for sent!" <s>- Da hun tænkte på det bagefter, syntes hun ganske vist, at hun burde være blevet forbavset, men lige i øjeblikket betragtede hun det som noget meget naturligt, at kaninen kunne tale. <s>- Men da den også tog et ur op af vestelommen, kiggede på det og skyndte sig af sted, sprang Alice op, for nu slog det hende, at hun aldrig nogen sinde havde set en kanin, der både havde vest på og et ur, den kunne tage op af lommen. <s>Alice blev vældig nysgerrig og løb bag efter kaninen, tværs over marken. <s>Hun nåede lige netop at se den smutte ned i et stort hul, der var under hækken.

<s>I næste øjeblik fulgte Alice efter den uden at tænke på, hvordan i alverden hun skulle komme ud igen. </body>

Distribution

The program is available from CFL Software Development, price £40 to educational establishments

Downloadable parallel texts without restrictions on distribution are available without extra charge from the Parallel Texts Library

Last updated 24 Jan 2002.

	Search option	Example
	Single-word item	planet
	Multiple-word item	carbon dioxide
	Final wild card	ortho*
	Initial wild card	*ible
	Medial wild card	unly*