«

»

Mar 15

Segmentation rules for text with references

SegmentationIf you happen to translate text with numbered references, you’ve probably encountered problem with incorrect segmentation when the reference number is placed just after end of sentence period, like this:

This is an example sentence.² With a second sentence added.

In most cases the above sentences will be imported to CAT tools as a single segment, which will require manual splitting to achieve correct leverage in TM. Fortunately there is a simple solution for this problem.

A frustrated friend recently posted about this problem on Twitter, and when I asked for details of her problem, she sent me a sample file with lots and lots of references, asking if it’s possible to create segmentation rules, which would segment such text correctly. It came out to be quite simple matter, and since the problem may be more widespread, I thought I should make the rule public.

The rule properly segments text in the following cases:

  • word.1 Next sentence.
  • word.1,2 Next sentence
  • word.1-3 Next sentence
  • word.”1 Next sentence (for two quotation variants: ” ” )
  • word.)1 Next sentence (for “)” and “]” )

Below you’ll find instructions on how to add segmentation rule for memoQ and Trados Studio.

I. memoQ

In memoQ segmentation rules are separate resources. There are always default segmentation rules for a given source language, but you can create your own rules used only in particular project. In the procedure below I’ll describe how to copy default rules for English and how to make it default for new projects in memoQ 2014 R2.

  1. Open memoQ.
  2. Click Resource console icon.
  3. Click Segmentation rules category.
  4. From the Language drop-down select English (or English variant, or any other source language you need this rule for).
  5. Click Create new command in the lower part of the window.
  6. Enter name for the new rule set, e.g. Default with references. New rule set will contain a copy of default settings.
  7. Select new rule and click Edit.
  8. Click Advanced view in the lower left corner of the dialog (if you have memoQ older than 2014 R2, simply skip this step).
  9. In the editing field below Rules pane paste the following regular expression:
     \p{Ll}\.[\)\]”"]?\d+([-–,]\d+)?#!#[\s]+\p{Lu}
  10. Click Add.
  11. Click OK and Close, to close the resource console.

Segmentation rules in memoQ

To make the modified rule default for new projects:

  1. Click Options icon (for older memoQ versions: Tools > Options) to display Options dialog.
  2. In the Default resources field and Segmentation rules category use Language drop-down to select English (or any other language you’ve edited).
  3. Check the option field next to newly created segmentation rules (e.g. Default with references).
  4. Click OK.

New projects will use this segmentation rules for selected source language.

II. Trados Studio

In Studio segmentation rules are stored within translation memories, so you need to edit any TM you wish to add this rule to. (The editing procedure is copied from SDL Knowledge base.)

  1. Open SDL Trados Studio.
  2. Go to Translation Memory View.
  3. Click on File > Open > Open Translation Memory to add the translation memory you wish to change to the list of translation memories.
  4. Right-click on the listed translation memory and select Settings from the menu; this opens the Translation Memory Settings dialog box.
  5. Click on Language Resources.
  6. On the right-hand side highlight Segmentation Rules and press Edit.
  7. In the Segmentation Rules dialog box, click Add to open the Add Segmentation Rule dialog box.
  8. Enter for example Literature references in the Description field.
  9. Click Advanced View.
  10. In the field Before break replace the existing regular expression with the following:
    \p{Ll}\.[\)\]”"]?\d+([-–,]\d+)?
  11. In the field After break replace the existing regular expression with the following:
    \s
  12. Click OK several times to confirm the changes until you are back in the Translation Memory View.

Segmentation rules in Studio

Please note that when I was editing the rule in Studio I’ve encountered a peculiar bug: any file was segmented correctly only on the second try, that is, if you add the rules to your TM and the file will not be segmented correctly, delete the file and import again. It should work on the second try.

 

If the rules are not enough for you, e.g. you need rules for some special cases, please contact me, I may be able to help.

13 comments

Skip to comment form

  1. Kaori Myatt

    Great tip!

  2. Rajanikanth

    How to split segment using of spaces in arabic language

    1. Wasaty

      Send me a sample of the text you want to segment – just several lines in the form it is currently and manually segmented the way you want, I will try to give you a solution.

      1. Rajanikanth

        Thanks for the response,

        Tiền và các khoản tương đương tiền

        We need segment rule for the below like sentences, i.e., it should split for every two words
        ex:
        Tiền và
        các khoản
        tương đương
        tiền

        Like said in the example,

        please give us segmentation rule

        Rajanikanth

        1. Wasaty

          This rule will segment text every two “words” regardless of any punctuation marks (it’s for memoQ):

          [\p{L}\p{P}\p{N}]+\s[\p{L}\p{P}\p{N}]+#!#\s

          If you need something more fancy, contact me by email, we can talk.

          Marek

          1. Rajanikanth

            Thank You so much for the code provided, it is working fine

  3. Rajanikanth

    Hi,

    Above segment rules is working fine but we are facing some issue while splitting.

    This is actual sentence – TỔNG CÔNG TY KHÍ VIỆT NAM – For this it is working fine but

    if we have the sentence like –

    1. TỔNG CÔNG TY KHÍ VIỆT NAM – it is spitting like below

    1.

    TỔNG

    CÔNG TY

    KHÍ VIỆT

    NAM

    but i want to split the segment like this.

    1.

    TỔNG CÔNG

    TY KHÍ

    VIỆT NAM

    Please help me with this type of segmentation rule segment.

    1. Wasaty

      I’m sorry Rajanikanth, but you’ve reached the quota of free work I’m willing to offer. As I’ve written previously, if you need more specific rules, contact me via email and I’ll be happy to provide segmentation rules for modest fee.

      1. Rajanikanth

        your mail id please

  4. Rajanikanth

    Please delete all of my comments from Blog

  5. Ciaran

    Hi,

    I have been looking for a way to segment Word files in German with footnotes in memoQ so that the program does not segment before footnote tags located at the end of sentences (i.e. outside the final punctuation mark). With the default segmentation rules for German, these tags appear at the beginning of the following segment, even though they “belong” to the previous segment. In files with lots of footnotes, I find this very confusing and have even begun to manually resegment texts so that the footnote tags are where they belong.

    As I understand your post, this is what your segmentation rule is supposed to achieve. However, it is not working for me: I added the rule following your instructions above, but it makes no difference to how texts are segmented (i.e. memoQ still splits segments before a footnote tag at the end of a sentence).

    I you have any suggestions about what I might be doing wrong, or if you could modify your rule to cover my situation, I would be very grateful.

    1. Wasaty

      Hi,

      The rules I’ve put in the post won’t work for tags, only for actual numbers. I’m not even sure it’s possible to account for tags in segmentation. I can test it, but I would need example file.

  6. Ciaran

    Thanks for this clarification. That does not sound too hopeful. The memoQ support people were not able to help me with this problem either. I don’t know how to send a sample file though this web site. But any Word file with “live” footnotes or endnotes, where the footnote numbers appear at the end of the sentence outside the final punctuation (i.e. the full stop) will do. This is the usual position of end of sentence footnotes in English and German. In French, by contrast, the footnotes are placed inside the final punctuation, so this problem does not arise with French Word files.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>