{"id":829,"date":"2015-03-15T12:18:59","date_gmt":"2015-03-15T11:18:59","guid":{"rendered":"http:\/\/wasaty.pl\/blog\/?p=829"},"modified":"2015-03-15T12:19:15","modified_gmt":"2015-03-15T11:19:15","slug":"segmentation-rules-for-references","status":"publish","type":"post","link":"http:\/\/wasaty.pl\/blog\/2015\/03\/15\/segmentation-rules-for-references\/","title":{"rendered":"Segmentation rules for text with references"},"content":{"rendered":"<div style=\"float: right; margin-left: 10px;\"><a href=\"https:\/\/twitter.com\/share\" class=\"twitter-share-button\" data-via=\"Wasaty\" data-count=\"vertical\" data-url=\"http:\/\/wasaty.pl\/blog\/2015\/03\/15\/segmentation-rules-for-references\/\">Tweet<\/a><\/div>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignleft size-full wp-image-833\" src=\"http:\/\/wasaty.pl\/blog\/wp-content\/uploads\/2015\/03\/Segmentation.png\" alt=\"Segmentation\" width=\"80\" height=\"80\" \/>If you happen to translate text with numbered references, you&#8217;ve probably encountered problem with incorrect segmentation when the reference number is placed just after end of sentence period, like this:<\/p>\n<pre>This is an example sentence.\u00b2\u00a0With a second sentence added.<\/pre>\n<p>In most cases the above sentences will be imported to CAT tools as a single segment, which will require manual splitting to achieve correct leverage in TM. Fortunately there is a simple solution for this problem.<\/p>\n<p><!--more--><\/p>\n<p>A frustrated friend recently posted about this problem on Twitter, and when I asked for details of her problem, she sent me a sample file with lots and lots of references, asking if it&#8217;s possible to create segmentation rules, which would segment such text correctly. It came out to be quite simple matter, and since the problem may be more widespread, I thought I should make the rule public.<\/p>\n<p>The rule properly segments text in the following cases:<\/p>\n<ul>\n<li>word.<sup>1<\/sup> Next sentence.<\/li>\n<li>word.<sup>1,2<\/sup> Next\u00a0sentence<\/li>\n<li>word.<sup>1-3<\/sup> Next sentence<\/li>\n<li>word.&#8221;<sup>1<\/sup> Next sentence (for two quotation variants:\u00a0\u201d &#8221; )<\/li>\n<li>word.)<sup>1<\/sup> Next sentence (for &#8220;)&#8221; and &#8220;]&#8221; )<\/li>\n<\/ul>\n<p>Below you&#8217;ll find instructions on how to add segmentation rule for memoQ and Trados Studio.<\/p>\n<h2>I. memoQ<\/h2>\n<p>In memoQ segmentation rules are separate resources. There are always default segmentation rules for a given source language, but you can create your own rules used only in particular project. In the procedure below I&#8217;ll describe how to copy default rules for English and how to make it default for new projects in memoQ 2014 R2.<\/p>\n<ol>\n<li>Open memoQ.<\/li>\n<li>Click\u00a0<strong>Resource console<\/strong> icon.<\/li>\n<li>Click\u00a0<strong>Segmentation rules<\/strong> category.<\/li>\n<li>From the\u00a0<strong>Language<\/strong> drop-down select\u00a0<strong>English<\/strong> (or English variant, or any other source language you need this rule for).<\/li>\n<li>Click\u00a0<strong>Create new<\/strong> command in the lower part of the window.<\/li>\n<li>Enter name for the new rule set, e.g.\u00a0<em>Default with references<\/em>. New rule set will contain a copy of default settings.<\/li>\n<li>Select new rule and click\u00a0<strong>Edit<\/strong>.<\/li>\n<li>Click\u00a0<strong>Advanced view<\/strong> in the lower left corner of the dialog (if you have memoQ older than 2014 R2, simply skip this step).<\/li>\n<li>In the editing field below\u00a0<strong>Rules<\/strong> pane paste the following regular expression:\n<pre> \\p{Ll}\\.[\\)\\]\u201d\"]?\\d+([-\u2013,]\\d+)?#!#[\\s]+\\p{Lu}<\/pre>\n<\/li>\n<li>Click\u00a0<strong>Add.<\/strong><\/li>\n<li>Click\u00a0<strong>OK<\/strong> and\u00a0<strong>Close<\/strong>, to close the resource console.<\/li>\n<\/ol>\n<p><a href=\"http:\/\/wasaty.pl\/blog\/wp-content\/uploads\/2015\/03\/Segmentation_memoQ.png\"><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-medium wp-image-831\" src=\"http:\/\/wasaty.pl\/blog\/wp-content\/uploads\/2015\/03\/Segmentation_memoQ-300x171.png\" alt=\"Segmentation rules in memoQ\" width=\"300\" height=\"171\" srcset=\"http:\/\/wasaty.pl\/blog\/wp-content\/uploads\/2015\/03\/Segmentation_memoQ-300x171.png 300w, http:\/\/wasaty.pl\/blog\/wp-content\/uploads\/2015\/03\/Segmentation_memoQ-1024x585.png 1024w, http:\/\/wasaty.pl\/blog\/wp-content\/uploads\/2015\/03\/Segmentation_memoQ-900x514.png 900w, http:\/\/wasaty.pl\/blog\/wp-content\/uploads\/2015\/03\/Segmentation_memoQ.png 1112w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p>To make the modified rule default for new projects:<\/p>\n<ol>\n<li>Click\u00a0<strong>Options<\/strong> icon (for older memoQ versions:\u00a0<strong>Tools &gt; Options<\/strong>) to display\u00a0<strong>Options<\/strong> dialog.<\/li>\n<li>In the\u00a0<strong>Default resources<\/strong>\u00a0field and\u00a0<strong>Segmentation rules\u00a0<\/strong>category use\u00a0<strong>Language<\/strong> drop-down to select\u00a0<strong>English<\/strong> (or any other language you&#8217;ve edited).<\/li>\n<li>Check the option field next to newly created segmentation rules (e.g.\u00a0<em>Default with references<\/em>).<\/li>\n<li>Click\u00a0<strong>OK<\/strong>.<\/li>\n<\/ol>\n<p>New projects will use this segmentation rules for selected source language.<\/p>\n<h2>II. Trados Studio<\/h2>\n<p>In Studio segmentation rules are stored within translation memories, so you need to edit any TM you wish to add this rule to.\u00a0(The editing procedure is copied from SDL Knowledge base.)<\/p>\n<ol>\n<li>Open SDL Trados Studio.<\/li>\n<li>Go to <strong>Translation Memory View<\/strong>.<\/li>\n<li>Click on <strong>File<\/strong> &gt; <strong>Open<\/strong> &gt; <strong>Open Translation Memory<\/strong> to add the translation memory you wish to change to the list of translation memories.<\/li>\n<li>Right-click on the listed translation memory and select <strong>Settings<\/strong> from the menu; this opens the <strong>Translation Memory Settings<\/strong> dialog box.<\/li>\n<li>Click on <strong>Language Resources<\/strong>.<\/li>\n<li>On the right-hand side highlight <strong>Segmentation Rules<\/strong> and press <strong>Edit<\/strong>.<\/li>\n<li>In the <strong>Segmentation Rules<\/strong> dialog box, click <strong>Add<\/strong> to open the <strong>Add Segmentation Rule<\/strong> dialog box.<\/li>\n<li>Enter for example <em>Literature references<\/em>\u00a0in the <strong>Description<\/strong> field.<\/li>\n<li>Click <strong>Advanced View<\/strong>.<\/li>\n<li>In the field <strong>Before break<\/strong> replace the existing regular expression with the following:\n<pre>\\p{Ll}\\.[\\)\\]\u201d\"]?\\d+([-\u2013,]\\d+)?<\/pre>\n<\/li>\n<li>In the field <strong>After break<\/strong> replace the existing regular expression with the following:\n<pre><code>\\s<\/code><\/pre>\n<\/li>\n<li>Click <strong>OK<\/strong> several times to confirm the changes until you are back in the <strong>Translation Memory View<\/strong>.<\/li>\n<\/ol>\n<p><a href=\"http:\/\/wasaty.pl\/blog\/wp-content\/uploads\/2015\/03\/Segmentation_Studio.png\"><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-medium wp-image-832\" src=\"http:\/\/wasaty.pl\/blog\/wp-content\/uploads\/2015\/03\/Segmentation_Studio-300x169.png\" alt=\"Segmentation rules in Studio\" width=\"300\" height=\"169\" srcset=\"http:\/\/wasaty.pl\/blog\/wp-content\/uploads\/2015\/03\/Segmentation_Studio-300x169.png 300w, http:\/\/wasaty.pl\/blog\/wp-content\/uploads\/2015\/03\/Segmentation_Studio.png 815w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p>Please note that when I was editing the rule in Studio\u00a0I&#8217;ve encountered a peculiar bug: any file was segmented correctly only on the second try, that is, if you add the rules to your TM and the file will not be segmented correctly, delete the file and import again. It should work on the second try.<\/p>\n<p>&nbsp;<\/p>\n<p>If the rules are not enough for you, e.g. you need rules for some special cases, please contact me, I may be able to help.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Tweet If you happen to translate text with numbered references, you&#8217;ve probably encountered problem with incorrect segmentation when the reference number is placed just after end of sentence period, like this: This is an example sentence.\u00b2\u00a0With a second sentence added. In most cases the above sentences will be imported to CAT tools as a single &hellip; <\/p>\n<p><a class=\"more-link btn\" href=\"http:\/\/wasaty.pl\/blog\/2015\/03\/15\/segmentation-rules-for-references\/\">Continue reading<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3,6],"tags":[20,45,46,31,50],"_links":{"self":[{"href":"http:\/\/wasaty.pl\/blog\/wp-json\/wp\/v2\/posts\/829"}],"collection":[{"href":"http:\/\/wasaty.pl\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/wasaty.pl\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/wasaty.pl\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"http:\/\/wasaty.pl\/blog\/wp-json\/wp\/v2\/comments?post=829"}],"version-history":[{"count":4,"href":"http:\/\/wasaty.pl\/blog\/wp-json\/wp\/v2\/posts\/829\/revisions"}],"predecessor-version":[{"id":836,"href":"http:\/\/wasaty.pl\/blog\/wp-json\/wp\/v2\/posts\/829\/revisions\/836"}],"wp:attachment":[{"href":"http:\/\/wasaty.pl\/blog\/wp-json\/wp\/v2\/media?parent=829"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/wasaty.pl\/blog\/wp-json\/wp\/v2\/categories?post=829"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/wasaty.pl\/blog\/wp-json\/wp\/v2\/tags?post=829"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}