Topics

Segmentation of numbers

M --
 

Hello all,

I have a problem with the segmentation of numbers. I have something like this:

24<segment 01>
Hour National Crisis Line<segment 02>


When in fact, it should be:
24 Hour National Crisis Line<segment 01>

There is no line break after the number 24, all the hidden tags (if any) have been cleaned with "Document Cleaner" inTranstools.

I haven't found a way to make it work using the Segmentation Setup within Omegat. Any ideas. I am not a Regex expert.

It does work well if I change "Hour" to "hour". Then I have the correct segmentation: 24 hour National Crisis Line<segment 01>
However, I would prefer not to touch the original file to make changes like this.

I don't have this problem when numbers are in the middle of a sentence, only at the beginning.

My OS is Windows 10 and I work with OMEGAT 5.2.

Thanks a lot!

Miguel

Jean-Christophe Helary
 

Hello Miguel,

The easiest way to deal with such rogue segments is to go to your original file, make sure that the thing between "24" and "Hour National Crisis Line" is really a space (erase the current space and add a new "standard" one) and reload.

If that still doesn't work you can start to worry about rogue segmentation rules :)

Jean-Christophe Helary
-----------------------------------------------
http://mac4translators.blogspot.com @brandelune

On Apr 23, 2020, at 6:52, M -- via groups.io <testaferro7=yahoo.com@groups.io> wrote:

Hello all,

I have a problem with the segmentation of numbers. I have something like this:

24<segment 01>
Hour National Crisis Line<segment 02>


When in fact, it should be:
24 Hour National Crisis Line<segment 01>

There is no line break after the number 24, all the hidden tags (if any) have been cleaned with "Document Cleaner" inTranstools.

I haven't found a way to make it work using the Segmentation Setup within Omegat. Any ideas. I am not a Regex expert.

It does work well if I change "Hour" to "hour". Then I have the correct segmentation: 24 hour National Crisis Line<segment 01>
However, I would prefer not to touch the original file to make changes like this.

I don't have this problem when numbers are in the middle of a sentence, only at the beginning.

My OS is Windows 10 and I work with OMEGAT 5.2.

Thanks a lot!

Miguel

Dmitri Gabinski
 

1. Show the file. 
2. Show your segmentation.conf. 

Samuel Murray
 

On 22/04/2020 23:52, M -- via groups.io wrote:

I have a problem with the segmentation of numbers. I have something like this:
*24<segment 01>*
*Hour National Crisis Line<segment 02>
I tested this on a plain text file, and I confirm that this happens.

My guess is that the default segmentation rules assume that a number at the start of a line, followed by a capital letter, is meant to be a line number or a heading number. I tried to google for the meaning of the four general rules in OmegaT, but I was unable to find a sufficiently comprehensive guide... and the link to the Java documentation in the OmegaT user manual is dead.

What you can do, is add a new rule above all other rules.

- In OmegaT, go Options > Segmentations.
- In the top part of the dialog, click Add. This will add a new set of rules, usually called "New Language and Country" and "LN-CO". Rename this to e.g. "Fixes" and ".*".
- Select the Fixes rule set and then click Move Up until the rule set is at the very top of the list.
- Then, while Fixes is selected, in the bottom part of the dialog, click "Add". This will add a new segmentation rule. Edit that rule as follows:

Break/Exception: unticked
Pattern Before: [0-9]
Pattern After: \s

This rule means that (apart from certain exceptions) no segment will ever break between a number and a space. I'm not sure if this will affect numbers followed by tabs.

Samuel

Jean-Christophe Helary
 

On Apr 23, 2020, at 16:26, Samuel Murray <@ugcheleuce> wrote:

On 22/04/2020 23:52, M -- via groups.io wrote:

I have a problem with the segmentation of numbers. I have something like this:
*24<segment 01>*
*Hour National Crisis Line<segment 02>
I tested this on a plain text file, and I confirm that this happens.

My guess is that the default segmentation rules assume that a number at the start of a line, followed by a capital letter, is meant to be a line number or a heading number.
That's correct. I seem to remember having added that a long time ago...

before: ^\s*\p{Nd}+[\p{Nd}\.\)\]]+

^\s* string beginning with zero or more white space
\p{Nd}+ followed by one or more "digit zero through nine in any script except ideographic scripts"
[\p{Nd}\.\)\]]+ followed one of more of "digit zero through nine in any script except ideographic scripts" or literal "." or literal ")" or literal "]"

after: \s+\p{Lu}

s+ one or more white space
\p{Lu} followed by an uppercase letter that has a lowercase variant




I tried to google for the meaning of the four general rules in OmegaT, but I was unable to find a sufficiently comprehensive guide... and the link to the Java documentation in the OmegaT user manual is dead.
Oh. We need to fix this.

My suggestion was WRONG, sorry. There needs to be a NON BREAKABLE SPACE there to fix the segmentation for that segment only.


Jean-Christophe Helary
-----------------------------------------------
http://mac4translators.blogspot.com @brandelune

M --
 

Thank you very much Jean-Christophe and Samuel for your time. I willt try to follow Samel instructions. Hopefully, this issue is resolved in future versions.

Take care

Miguel

On Thursday, April 23, 2020, 2:02:56 AM MDT, Jean-Christophe Helary <jean.christophe.helary@...> wrote:




> On Apr 23, 2020, at 16:26, Samuel Murray <samuelmurray@...> wrote:
>
> On 22/04/2020 23:52, M -- via groups.io wrote:
>
>> I have a problem with the segmentation of numbers. I have something like this:
>> *24<segment 01>*
>> *Hour National Crisis Line<segment 02>
>
> I tested this on a plain text file, and I confirm that this happens.
>
> My guess is that the default segmentation rules assume that a number at the start of a line, followed by a capital letter, is meant to be a line number or a heading number.

That's correct. I seem to remember having added that a long time ago...

before: ^\s*\p{Nd}+[\p{Nd}\.\)\]]+

^\s*        string beginning with zero or more white space
\p{Nd}+        followed by one or more "digit zero through nine in any script except ideographic scripts"
[\p{Nd}\.\)\]]+    followed one of more of "digit zero through nine in any script except ideographic scripts" or literal "." or literal ")" or literal "]"

after: \s+\p{Lu}

s+        one or more white space
\p{Lu}        followed by an uppercase letter that has a lowercase variant





>  I tried to google for the meaning of the four general rules in OmegaT, but I was unable to find a sufficiently comprehensive guide... and the link to the Java documentation in the OmegaT user manual is dead.


Oh. We need to fix this.

My suggestion was WRONG, sorry. There needs to be a NON BREAKABLE SPACE there to fix the segmentation for that segment only.


Jean-Christophe Helary
-----------------------------------------------
http://mac4translators.blogspot.com @brandelune


M --
 

Hi Samuel,

Please see the attached image. I did what you suggested but it doesn't seem to work unless I did something wrong.

It should be:

791 Chambers Road

453 E.Colfax Avenue

Instead, numbers followed by capitalized words, break segments in two.

This file is totally clean, it doesn't have any strange characters between numbers and capitalized words.

Let me know what you think.
Thanks
Miguel


Inline image



On Thursday, April 23, 2020, 9:52:12 AM MDT, M -- via groups.io <testaferro7@...> wrote:


Thank you very much Jean-Christophe and Samuel for your time. I willt try to follow Samel instructions. Hopefully, this issue is resolved in future versions.

Take care

Miguel

On Thursday, April 23, 2020, 2:02:56 AM MDT, Jean-Christophe Helary <jean.christophe.helary@...> wrote:




> On Apr 23, 2020, at 16:26, Samuel Murray <samuelmurray@...> wrote:
>
> On 22/04/2020 23:52, M -- via groups.io wrote:
>
>> I have a problem with the segmentation of numbers. I have something like this:
>> *24<segment 01>*
>> *Hour National Crisis Line<segment 02>
>
> I tested this on a plain text file, and I confirm that this happens.
>
> My guess is that the default segmentation rules assume that a number at the start of a line, followed by a capital letter, is meant to be a line number or a heading number.

That's correct. I seem to remember having added that a long time ago...

before: ^\s*\p{Nd}+[\p{Nd}\.\)\]]+

^\s*        string beginning with zero or more white space
\p{Nd}+        followed by one or more "digit zero through nine in any script except ideographic scripts"
[\p{Nd}\.\)\]]+    followed one of more of "digit zero through nine in any script except ideographic scripts" or literal "." or literal ")" or literal "]"

after: \s+\p{Lu}

s+        one or more white space
\p{Lu}        followed by an uppercase letter that has a lowercase variant





>  I tried to google for the meaning of the four general rules in OmegaT, but I was unable to find a sufficiently comprehensive guide... and the link to the Java documentation in the OmegaT user manual is dead.


Oh. We need to fix this.

My suggestion was WRONG, sorry. There needs to be a NON BREAKABLE SPACE there to fix the segmentation for that segment only.


Jean-Christophe Helary
-----------------------------------------------
http://mac4translators.blogspot.com @brandelune


Samuel Murray
 

On 23/04/2020 23:03, M -- via groups.io wrote:

Please see the attached image. I did what you suggested but it doesn't seem to work unless I did something wrong.
You have to change the "Language Pattern" to ".*". A fullstop followed by an asterisk.

Right now, the language pattern is "LN-CO", which means that OmegaT will only follow the rule if your project's source language is "LN-CO", which it isn't. :-) The characters ".*" tells OmegaT to use the rule for all languages and all file types.

Samuel

M --
 


Hi Samuel,

Please see the attached image. I did what you suggested but it doesn't seem to work unless I did something wrong.

You have to change the "Language Pattern" to ".*".  A fullstop followed
by an asterisk.

Right now, the language pattern is "LN-CO", which means that OmegaT will
only follow the rule if your project's source language is "LN-CO", which
it isn't. :-)  The characters ".*" tells OmegaT to use the rule for all
languages and all file types

On Thursday, April 23, 2020, 3:03:03 PM MDT, M -- <testaferro7@...> wrote:


Hi Samuel,

Please see the attached image. I did what you suggested but it doesn't seem to work unless I did something wrong.

It should be:

791 Chambers Road

453 E.Colfax Avenue

Instead, numbers followed by capitalized words, break segments in two.

This file is totally clean, it doesn't have any strange characters between numbers and capitalized words.

Let me know what you think.
Thanks
Miguel


Inline image



On Thursday, April 23, 2020, 9:52:12 AM MDT, M -- via groups.io <testaferro7@...> wrote:


Thank you very much Jean-Christophe and Samuel for your time. I willt try to follow Samel instructions. Hopefully, this issue is resolved in future versions.

Take care

Miguel

On Thursday, April 23, 2020, 2:02:56 AM MDT, Jean-Christophe Helary <jean.christophe.helary@...> wrote:




> On Apr 23, 2020, at 16:26, Samuel Murray <samuelmurray@...> wrote:
>
> On 22/04/2020 23:52, M -- via groups.io wrote:
>
>> I have a problem with the segmentation of numbers. I have something like this:
>> *24<segment 01>*
>> *Hour National Crisis Line<segment 02>
>
> I tested this on a plain text file, and I confirm that this happens.
>
> My guess is that the default segmentation rules assume that a number at the start of a line, followed by a capital letter, is meant to be a line number or a heading number.

That's correct. I seem to remember having added that a long time ago...

before: ^\s*\p{Nd}+[\p{Nd}\.\)\]]+

^\s*        string beginning with zero or more white space
\p{Nd}+        followed by one or more "digit zero through nine in any script except ideographic scripts"
[\p{Nd}\.\)\]]+    followed one of more of "digit zero through nine in any script except ideographic scripts" or literal "." or literal ")" or literal "]"

after: \s+\p{Lu}

s+        one or more white space
\p{Lu}        followed by an uppercase letter that has a lowercase variant





>  I tried to google for the meaning of the four general rules in OmegaT, but I was unable to find a sufficiently comprehensive guide... and the link to the Java documentation in the OmegaT user manual is dead.


Oh. We need to fix this.

My suggestion was WRONG, sorry. There needs to be a NON BREAKABLE SPACE there to fix the segmentation for that segment only.


Jean-Christophe Helary
-----------------------------------------------
http://mac4translators.blogspot.com @brandelune


Samuel Murray
 

On 27/04/2020 22:13, M -- via groups.io wrote:

Please see the attached image. I did what you suggested but it doesn't seem to work unless I did something wrong.
1. According to your screenshots, your language pattern is still "LN-CO", which is incorrect.

2. According to your screenshots, the rule you added is for "[0-9]\.", which will match a number followed by a fullstop. Your sample text's numbers do not end on a fullstop.

So, in the rule set, change the language pattern from this:

LN-CO

to this:

.*

and in the rule itself, change this:

[0-9]\.

to this:

[0-9]

and then reload the project (F5).

Samuel

M --
 

Wonderful. It works great. Thank you very much Samuel!

On Tuesday, April 28, 2020, 3:55:36 AM MDT, Samuel Murray <samuelmurray@...> wrote:


On 27/04/2020 22:13, M -- via groups.io wrote:

> Please see the attached image. I did what you suggested but it doesn't
> seem to work unless I did something wrong.

1. According to your screenshots, your language pattern is still
"LN-CO", which is incorrect.

2. According to your screenshots, the rule you added is for "[0-9]\.",
which will match a number followed by a fullstop.  Your sample text's
numbers do not end on a fullstop.

So, in the rule set, change the language pattern from this:

LN-CO

to this:

.*

and in the rule itself, change this:

[0-9]\.

to this:

[0-9]

and then reload the project (F5).


Samuel