Skip to main content

Table 12 The (Java) regular expressions used for the character feature in the GM task

From: Automating curation using a natural language processing pipeline

Description

Regexp

Capitals, lower case, hyphen then digit

[A-Z]+[a-z]*-[0-9]

Capitals followed by digit

[A-Z]{2,}[0-9]+

Single capital

[A-Z]

Single Greek character

\ p{InGreek}

Letters followed by digits

[A-Za-z]+[0-9]+

Lower case, hyphen then capitals

[a-z]+-[A-Z]+

Single digit

[0-9]

Two digits

[0-9][0-9]

Four digits

[0-9][0-9][0-9][0-9]

Two capitals

[A-Z][A-Z]

Three capitals

[A-Z][A-Z][A-Z]

Four capitals

[A-Z]{4}

Five or more capitals

[A-Z]{5,}

Digit then hyphen

[0-9]+-

All lower case

[a-z]+

All digits

[0-9]+

Nucleotide

[AGCT]{3,}

Capital, lower case then digit

[A-Z][a-z]{2,}[0-9]

Lower case, capitals then any

[a-z][A-Z][A-Z].*

Greek letter name

Match any Greek letter name

Roman digit

[IVXLC]+

Capital, lower, capital and any

[A-Z][a-z][A-Z].*

Contains digit

.*[0-9].*

Contains capital

.*[A-Z].*

Contains hyphen

.*-.*

Contains period

.*\ ..*

Contains punctuation

.*\ p{Punct}.*

All digits

[0-9]+

All capitals

[A-Z]+

Is a personal title

(Mr|Mrs|Miss|Dr|Ms)

Looks like an acronym

([A-Za-z]\.)+

  1. GM, gene mention.