Epidemiology & Technology

Match string for patterns

One can use Regular Expressions (RegEx) in Stata functions when working with string data

Finding whether a String matches a Pattern

One can use the ustrregexm function. You can remember it as an abbreviation for “Unicode String Regular Expression Match” = UStrRegExM

count if ustrregexm(guardian_rel, "[Ss]\/[O0o]" )Code language: Stata (stata)

The above function will be TRUE if the following sequences are present ANYWHERE in the record – S/O s/o s/O s/0 etc.

Let use break down the RegEx pattern: It has three distinct sections [Ss]\/[O0o]

  • [Ss] = will match for ‘S’ or ‘s’. Please note that we have placed all the valid characters inside square brackets ‘[ ]’.
  • \/ = will match for slash ‘/’ Only . Note that we have placed a back-slash ‘\’ before the slash ‘/’ . This is called escaping. If you need to check for special characters, these need to bee escaped by a back-slash ‘\’.
  • [O0o] = will match for ‘O’, ‘0″ or ‘o’. Please note again that we have placed all the valid characters inside square brackets ‘[ ]’.

Checking whether a text sequence is present anywhere inside a string

di ustrregexm("patient is S/O mr Ram", "are") // 0
di ustrregexm("patient is S/O mr Ram", "is" ) // 1
di ustrregexm("patient is S/O mr Ram", "Is") // 0, note the caseCode language: JavaScript (javascript)

Using Brackets to account for capital and small case

// using square brackets to list all possible allowed values at that position
di ustrregexm("patient is S/O mr Ram", "[Ii]s") // 1, note that we have grouped I,i inside[]
di ustrregexm("patient Is S/O mr Ram", "[Ii]s") // 1, note that we have grouped I,i inside[]
di ustrregexm("patient IS S/O mr Ram", "[Ii]s")  // 0

di ustrregexm("patient IS S/O mr Ram", "[Ii][Ss]") // 1, two groups  
di ustrregexm("patient Is S/O mr Ram", "[Ii][Ss]") // 1, two groups  
di ustrregexm("patient iS S/O mr Ram", "[Ii][Ss]") // 1, two groups  
di ustrregexm("patient is S/O mr Ram", "[Ii][Ss]") // 1, two groups  Code language: JavaScript (javascript)

Checking whether a text sequence is present in the beginning of the string: using ^

// using hat symbol ^  just before the regular expression
di ustrregexm("patient IS S/O mr Ram", "^[Ii][Ss]") // 0  
di ustrregexm("patient IS S/O mr Ram", "^p") // 1
di ustrregexm("patient IS S/O mr Ram", "^P") // 0
di ustrregexm("patient IS S/O mr Ram", "^pat") // 1
di ustrregexm("Patient IS S/O mr Ram", "^pat") // 0
di ustrregexm("Patient IS S/O mr Ram", "^[Pp]") // 1
di ustrregexm("Patient IS S/O mr Ram", "^[Pp]Atient") // 0
di ustrregexm("Patient IS S/O mr Ram", "^[Pp][Aa][Tt][Ii]ent") // 1Code language: JavaScript (javascript)

Searching for Characters like * % $ – / \ ! etc – Using backslash `\’ Escape character

// using backslash to escape special characters
di ustrregexm("patient is S/O mr Ram", "[Ss]/[Oo]"  ) // 1
di ustrregexm("patient is S\O mr Ram", "[Ss]\[Oo]"  ) // 0 <= What happened here !
di ustrregexm("patient is S\O mr Ram", "[Ss]\\[Oo]" ) // 1 We had to escape the '\' to be searched for with '\'
di ustrregexm("patient is S/O mr Ram", "[Ss]/[Oo]"  ) // 1 works seemingly
di ustrregexm("patient is S/O mr Ram", "[Ss]\/[Oo]" ) // 1 Always escaape if there is a special character you need to search for
Code language: JavaScript (javascript)

One Character, then anything, then another character: use Period ‘.’

// If you Do not care what comes between two charaacters
di ustrregexm("patient is S/O mr Ram", "[Ss]\/[Oo]" ) // 1
di ustrregexm("patient is S-O mr Ram", "[Ss]\/[Oo]" ) // 0
di ustrregexm("patient is S\O mr Ram", "[Ss]\/[Oo]" ) // 0
di ustrregexm("patient is S\O mr Ram", "[Ss]\[/-\][Oo]" ) // 0 - does not work
di ustrregexm("patient is S\O mr Ram", "[Ss][\/\-\\][Oo]" ) // 1 - need to escape each character
di ustrregexm("patient is S\O mr Ram", "[Ss].[Oo]" ) // 1 - Or just put a period . , means anything goes here

di ustrregexm("patient is S/O mr Ram", "[Ss]\/[Oo]" ) // 1
di ustrregexm("patient is S-O mr Ram", "[Ss]\/[Oo]" ) // 0
di ustrregexm("patient is S\O mr Ram", "[Ss]\/[Oo]" ) // 0
di ustrregexm("patient is S\O mr Ram", "[Ss]\[/-\][Oo]" ) // 0 - does not work
di ustrregexm("patient is S\O mr Ram", "[Ss][\/\-\\][Oo]" ) // 1 - need to escape each character

di ustrregexm("patient is S\O mr Ram", "[Ss].[Oo]" ) // 1 - Or just put a period . , means anything goes here
di ustrregexm("patient is S-O mr Ram", "[Ss].[Oo]" ) // 1 - Or just put a period 
di ustrregexm("patient is S/O mr Ram", "[Ss].[Oo]" ) // 1 - Or just put a period 
di ustrregexm("patient is S*O mr Ram", "[Ss].[Oo]" ) // 1 - Or just put a period 
di ustrregexm("patient is SO mr Ram", "[Ss].[Oo]" ) // 0 - There was nothing there so match failed, Oh NoCode language: JavaScript (javascript)

Check if Anything or Nothing is in a sequence: using *

di ustrregexm("patient is SO mr Ram", "[Ss].[Oo]" ) // 0 - There was nothing there sso failed, Oh No
// is the same as the next statement
di ustrregexm("patient is SO mr Ram", "[Ss][.]+[Oo]" ) // 0, + means match one or more of previous character, which can be anything since we are umarsing a period  ' . '
di ustrregexm("patient is SO mr Ram", "[Ss][.]*[Oo]" ) // 1, * means match zero or one of previous character, which can be anything here since we have used a period   '  . '
di ustrregexm("patient is S/O mr Ram", "[Ss][.]*[Oo]" ) // 0 --- Aah this is frustratingCode language: JavaScript (javascript)

Mixing it up using parentheses ‘ ( ) ‘

di ustrregexm("patient is S-O mr Ram", "([Ss])(.)*([Oo])" ) // 1 
di ustrregexm("patient is S-O mr Ram", "([Ss])(.)*([Oo])" ) // 1 
di ustrregexm("patient is S\O mr Ram", "([Ss])(.)*([Oo])" ) // 1 
di ustrregexm("patient is S-O mr Ram", "([Ss])(.)*([Oo])" ) // 1 
di ustrregexm("patient is SoO mr Ram", "([Ss])(.)*([Oo])" ) // 1 
di ustrregexm("patient is S.O mr Ram", "([Ss])(.)*([Oo])" ) // 1 Code language: JavaScript (javascript)

Now we have three sub-expressions within the RegEx ([Ss]) (.)* ([Oo])

  • ([Ss]) will match S or s at first place
  • (.)*. will match any character at second place because we have placed a period ‘ . ‘ here. However, After the parentheses we have placed a *. This means match zero or more of previous expression. Essentially, it means anything ‘ . ‘ or nothing ‘ * ‘ gets matched
  • ([Oo]) will match O or o at third character , or at second character

Reference:

https://www.stata.com/support/faqs/data-management/regular-expressions/

Related posts