Re: Segmentation of numbers

Jean-Christophe Helary

On Apr 23, 2020, at 16:26, Samuel Murray <@ugcheleuce> wrote:

On 22/04/2020 23:52, M -- via wrote:

I have a problem with the segmentation of numbers. I have something like this:
*24<segment 01>*
*Hour National Crisis Line<segment 02>
I tested this on a plain text file, and I confirm that this happens.

My guess is that the default segmentation rules assume that a number at the start of a line, followed by a capital letter, is meant to be a line number or a heading number.
That's correct. I seem to remember having added that a long time ago...

before: ^\s*\p{Nd}+[\p{Nd}\.\)\]]+

^\s* string beginning with zero or more white space
\p{Nd}+ followed by one or more "digit zero through nine in any script except ideographic scripts"
[\p{Nd}\.\)\]]+ followed one of more of "digit zero through nine in any script except ideographic scripts" or literal "." or literal ")" or literal "]"

after: \s+\p{Lu}

s+ one or more white space
\p{Lu} followed by an uppercase letter that has a lowercase variant

I tried to google for the meaning of the four general rules in OmegaT, but I was unable to find a sufficiently comprehensive guide... and the link to the Java documentation in the OmegaT user manual is dead.
Oh. We need to fix this.

My suggestion was WRONG, sorry. There needs to be a NON BREAKABLE SPACE there to fix the segmentation for that segment only.

Jean-Christophe Helary
----------------------------------------------- @brandelune

Join to automatically receive all group messages.