Re: Segmentation of numbers


M --
 

Hi Samuel,

Please see the attached image. I did what you suggested but it doesn't seem to work unless I did something wrong.

It should be:

791 Chambers Road

453 E.Colfax Avenue

Instead, numbers followed by capitalized words, break segments in two.

This file is totally clean, it doesn't have any strange characters between numbers and capitalized words.

Let me know what you think.
Thanks
Miguel


Inline image



On Thursday, April 23, 2020, 9:52:12 AM MDT, M -- via groups.io <testaferro7@...> wrote:


Thank you very much Jean-Christophe and Samuel for your time. I willt try to follow Samel instructions. Hopefully, this issue is resolved in future versions.

Take care

Miguel

On Thursday, April 23, 2020, 2:02:56 AM MDT, Jean-Christophe Helary <jean.christophe.helary@...> wrote:




> On Apr 23, 2020, at 16:26, Samuel Murray <samuelmurray@...> wrote:
>
> On 22/04/2020 23:52, M -- via groups.io wrote:
>
>> I have a problem with the segmentation of numbers. I have something like this:
>> *24<segment 01>*
>> *Hour National Crisis Line<segment 02>
>
> I tested this on a plain text file, and I confirm that this happens.
>
> My guess is that the default segmentation rules assume that a number at the start of a line, followed by a capital letter, is meant to be a line number or a heading number.

That's correct. I seem to remember having added that a long time ago...

before: ^\s*\p{Nd}+[\p{Nd}\.\)\]]+

^\s*        string beginning with zero or more white space
\p{Nd}+        followed by one or more "digit zero through nine in any script except ideographic scripts"
[\p{Nd}\.\)\]]+    followed one of more of "digit zero through nine in any script except ideographic scripts" or literal "." or literal ")" or literal "]"

after: \s+\p{Lu}

s+        one or more white space
\p{Lu}        followed by an uppercase letter that has a lowercase variant





>  I tried to google for the meaning of the four general rules in OmegaT, but I was unable to find a sufficiently comprehensive guide... and the link to the Java documentation in the OmegaT user manual is dead.


Oh. We need to fix this.

My suggestion was WRONG, sorry. There needs to be a NON BREAKABLE SPACE there to fix the segmentation for that segment only.


Jean-Christophe Helary
-----------------------------------------------
http://mac4translators.blogspot.com @brandelune


Join chat@omegat.groups.io to automatically receive all group messages.