Skip to content

User guide

gramirez-prompsit edited this page Sep 9, 2021 · 11 revisions

Welcome to Corset

Corset is a web-based data selection portal that helps you getting relevant data from massive amounts of parallel data. So, if you don't need the whole corpus, but just a suitable subset (indeed, a cor(pus sub)set, this is what Corset will do for you--and the reason of the name of the tool.

Here are some highlights of what you will find in Corset:

Millions of parallel sentences to explore

  • Dive into parallel corpora performing searches at the speed of light.
  • Search in either source or target sides of corpora.
  • Keep track of your preferred searches and their details.

Tailored corpora (corsets) from big corpora

  • Get smaller and custom corpora that fit your sample text.
  • Set up the details of your corset (name, topic, languages, size) and launch your search over millions of parallel sentences.

Monitor and download corsets

  • See the status of your corsets, preview them, download them, remove or share them!
  • Take a look to shared corsets to see if they are already tailored to your needs.

Content

Getting started

This is the homescreen, where you can find information about the Corset project:

By clicking "Log in", you get into Corset. A Google account is needed for this.

Into Corset

This is what you see when you log into Corset:

Once logged in, you will land into Corset Explore corpora area. From the tabs at the top of this page, you can navigate to any other place in Corset:

  • Explore corpora (you are already here!)
  • Create corpora
  • Get corpora
  • Admin panel (only if you are an administrator)

Let's take a look at the sidebar on the left:

From top to bottom, in this sidebar you have:

  • 1: Your logged user area. Click on the cogwheel icon to see your user information (your name and email), and manage your saved searches. Click on the "power down" icon to log out from Corset.

  • 2: Your last corsets. These are the corsets (remember, corpora subsets) generated after uploading a sample corpus, that is, a text that you upload in order to get similar sentences from a broader corpus (we'll get back to this later!). In this area, you can see if your corsets are already built ("Ready") or still being processed ("Working"). By clicking on any of your "Ready" corsets, you'll be taken to its Preview page, where you can search in an excerpt of the corset or download it.

  • 3: Your History. In this area, your latest searches are displayed but only for the big corpora: searches performed in corsets are not stored. If you want to remove any search from your History, just hover over the entry and you'll see a X-shaped cross icon. Click on it, and the search will be gone from your History.

  • 4: A list of Top Corsets. The most popular corsets are displayed in this area, so you can easily find and discover them. Who knows, maybe another user already built a corset that perfectly matches your needs! Just click on it, and you'll see a preview of it. And if you like it, you can already download it.

Explore corpora

This is where you can search words or phrases in the vast corpora available at Corset, such as Paracrawl. The Explore Corpora interface looks like this:

  • 1: First of all, introduce your search term(s) in the text box.
  • 2: Select the language pair.
  • 3: Choose the Corpus in which you'd like to search the term(s).
  • 4: Click on "Source" or "Target", to indicate in which side must be your search performed.
  • 5: Finally, click on the magnifying glass icon to get your results.
  • 6: There they are, the results for your search! We display how many results were found in the selected Corpus, and we also put a highlight on your search term, so it's super easy for you to find it in the sentences.

So, for example, the image above depicts a search for the term "photographic equipment" in the source language of the English-Spanish corpus "Paracrawl EN-ES", which returned 166 results.

Create corpora

Corset is not only designed to search terms or words in corpora: you can upload your own sample text and get sentences that are similar to the ones you uploaded. Let's see how to do so:

  • 1: First, select the sample text that you want to base your corset on. We name these kind of corpora sample or query corpora. It can be monolingual (either the source or target language your are aiming at) or bilingual (either source-target or target-source); stored in a TXT, TSV (tab-separated values) or TMX (translation memory) format. In case it is a TXT or TSV, each line must contain a single sentence (if it's monolingual) or two sentences (source and target) separated with a tab for TSV in case of bilingual. Please note that, in case you upload more than 10.000 sentences, only the first 10.000 will be taken into account. This corpus will be removed from our systems as soon as we are done processing it, so it won't be shared with anyone (not even with you!)

  • 2: Set a name for your custom corset.

  • 3: Select the topic that best suits the sample text that you uploaded. It can be technical, legal, financial...

  • 4: Select the size of the Corset; this is: small (10.000 sentences), medium (100.000) or large (1 million)

  • 5: Select the source language...

  • 6: ... and the target language.

  • 7: Choose an available corpus for the languages you set.

  • 8: Select a format for your corset: TSV for a tab-separated file, or TMX for a translation memory.

Then, click "Upload and generate", and your search for similar sentences to made up your corset will be launched!

In the example depicted in the image above, we upload a corpus called "climbing.txt" in order to generate a small-sized corset called "Mountain sports and climbing", an English-Spanish one based on the Paracrawl EN-ES corpus. Since there is no topic for "sports", we labelled it with the "Other" topic. The generated corset will be in TSV format.

Get corpora

In the "Get corpora" page you can see two tables: "Your corsets" and "Shared corsets".

In "Your corsets", you can see all your corset requests, with their status (Working, Ready...), their metadata, etc. For each corset, you can download it, preview it, share/hide it or remove it. When a corset is shared, other users can see it or download it. If it's not shared, you are the only one who can view it. You can share and unshare a corset at any time, any times you want.

In the table below, "Shared corsets", you can see the same information but for corsets generated and shared by other users, preview them, search them and, of course, download them.

Admin panel

In case you are an admin of Corset (congratulations!) you will be able to see the following section:

In this page, you can see the status of the machine Corset is running on; for example CPU load, memory usage and disk usage.

You can also see a list of jobs (running or ready), and a form to add new base corpora. When adding new base corpora, remember that this form ONLY registers the corpus in the Corset system. You need to add the collection core to Solr by hand (see the Technical Documentation for further information). Make sure the "Solr collection" matches exactly the name the collection has in your Solr.

Using corsets

Now, what to do with your corsets? Just a couple of ideas:

  1. Get a translation memory: if you need to translate a file, and do not have a translation memory, you may use Corset to generate a tiny one (small, already accounting for 10k sentences). Once you have your corset, you can upload it to your preferred CAT tool, provided that they accept 10k-sized sentence files in either TMX or TSV formats.

  2. Enhance an MT system: if you need up to 1M parallel sentences that are "similar" to a sample corpus you may have at hand, use Corset to generate them and add them to your training, validation or test corpora. Remember that Corset can generate very small files of 10k sentences (maybe useful as test sets?), medium-sized (100k) or up to 1M parallel sentences.

On our try, we trained an English-to-Italian systems with ECB corpus and TED talks (up to 500k sentences mixing both), using a subset of ECB as validation and test sets. Then we replaced the TED corpus with sentences from a corset generated from a sample file from ECB, using the same validation and test sets as in the previous experiment and all the same settings. The result? We improved the ECB TED by 1 absolute BLEU point going from 53.62 to 54.61. Any visible results? Yes, some vocabulary improvements, less omissions, better output in general:

Source: Employment, conduct, fraud prevention and transparency.

  • Output 1 (ECB TED): L'ocupazzione, conduzione, prevenzione delle frodi e transparenza.
  • Output2 (ECB ECB-based corset): Occupazione, condotta, prevenzione delle frodi e transparenza.

Source: Please note, however, that a visit date cannot be guaranteed, even if very early notification is given.

  • Output 1 (ECB TED): La nota, tuttavia, che la data di visita non può essere garantita, anche se una notifica molto presto.
  • Output 2 (ECB ECB-based corset): Si prega di notare, tuttavia, che una data di visita non può essere garantita, anche se la notifica è data molto presto.

We will be looking forward to hearing from your own experience soon!

Glossary

  • Base Corpus

Large corpus that is available to all users to be searched or queried. Currently, in corset.paracrawl.eu we provide Paracrawl EN-ES, EN-DE, EN-FR, EN-IT, EN-NL and EN-PT.

  • Query Corpus/Sample Corpus

Small corpora uploaded by users. We extract information from them, in order to select related sentences from base corpora. Don't worry: we'll remove them from our systems immediately after processing them, so they won't be shared with anyone (not even with you!)

  • Custom Corpus (aka "corset")

Corsets are the custom corpora we build for you, based on a query corpus you upload, with sentences from a base corpus. You can preview them, search in them or download them. You can choose to keep them just for you, or mark them as "shared" so anyone in Corset can discover them. For sure, you can also remove them (but make sure you want to, you won't be able to recover a deleted corset!)

See also


Connecting Europe Facility

All documents and software contained in this repository reflect only the authors' view. The Innovation and Networks Executive Agency of the European Union is not responsible for any use that may be made of the information it contains.