Home » Match string for patterns

Match string for patterns

One can use Regular Expressions (RegEx) in Stata functions when working with string data

Finding whether a String matches a Pattern

One can use the ustrregexm function. You can remember it as an abbreviation for “Unicode String Regular Expression Match” = UStrRegExM

count if ustrregexm(guardian_rel, "[Ss]\/[O0o]" )
Code language: Stata (stata)

The above function will be TRUE if the following sequences are present ANYWHERE in the record – S/O s/o s/O s/0 etc.

Let use break down the RegEx pattern: It has three distinct sections [Ss]\/[O0o]

  • [Ss] = will match for ‘S’ or ‘s’. Please note that we have placed all the valid characters inside square brackets ‘[ ]’.
  • \/ = will match for slash ‘/’ Only . Note that we have placed a back-slash ‘\’ before the slash ‘/’ . This is called escaping. If you need to check for special characters, these need to bee escaped by a back-slash ‘\’.
  • [O0o] = will match for ‘O’, ‘0″ or ‘o’. Please note again that we have placed all the valid characters inside square brackets ‘[ ]’.

Checking whether a text sequence is present anywhere inside a string

di ustrregexm("patient is S/O mr Ram", "are") // 0 di ustrregexm("patient is S/O mr Ram", "is" ) // 1 di ustrregexm("patient is S/O mr Ram", "Is") // 0, note the case
Code language: JavaScript (javascript)

Using Brackets to account for capital and small case

// using square brackets to list all possible allowed values at that position di ustrregexm("patient is S/O mr Ram", "[Ii]s") // 1, note that we have grouped I,i inside[] di ustrregexm("patient Is S/O mr Ram", "[Ii]s") // 1, note that we have grouped I,i inside[] di ustrregexm("patient IS S/O mr Ram", "[Ii]s") // 0 di ustrregexm("patient IS S/O mr Ram", "[Ii][Ss]") // 1, two groups di ustrregexm("patient Is S/O mr Ram", "[Ii][Ss]") // 1, two groups di ustrregexm("patient iS S/O mr Ram", "[Ii][Ss]") // 1, two groups di ustrregexm("patient is S/O mr Ram", "[Ii][Ss]") // 1, two groups
Code language: JavaScript (javascript)

Checking whether a text sequence is present in the beginning of the string: using ^

// using hat symbol ^ just before the regular expression di ustrregexm("patient IS S/O mr Ram", "^[Ii][Ss]") // 0 di ustrregexm("patient IS S/O mr Ram", "^p") // 1 di ustrregexm("patient IS S/O mr Ram", "^P") // 0 di ustrregexm("patient IS S/O mr Ram", "^pat") // 1 di ustrregexm("Patient IS S/O mr Ram", "^pat") // 0 di ustrregexm("Patient IS S/O mr Ram", "^[Pp]") // 1 di ustrregexm("Patient IS S/O mr Ram", "^[Pp]Atient") // 0 di ustrregexm("Patient IS S/O mr Ram", "^[Pp][Aa][Tt][Ii]ent") // 1
Code language: JavaScript (javascript)

Searching for Characters like * % $ – / \ ! etc – Using backslash `\’ Escape character

// using backslash to escape special characters di ustrregexm("patient is S/O mr Ram", "[Ss]/[Oo]" ) // 1 di ustrregexm("patient is S\O mr Ram", "[Ss]\[Oo]" ) // 0 <= What happened here ! di ustrregexm("patient is S\O mr Ram", "[Ss]\\[Oo]" ) // 1 We had to escape the '\' to be searched for with '\' di ustrregexm("patient is S/O mr Ram", "[Ss]/[Oo]" ) // 1 works seemingly di ustrregexm("patient is S/O mr Ram", "[Ss]\/[Oo]" ) // 1 Always escaape if there is a special character you need to search for
Code language: JavaScript (javascript)

One Character, then anything, then another character: use Period ‘.’

// If you Do not care what comes between two charaacters di ustrregexm("patient is S/O mr Ram", "[Ss]\/[Oo]" ) // 1 di ustrregexm("patient is S-O mr Ram", "[Ss]\/[Oo]" ) // 0 di ustrregexm("patient is S\O mr Ram", "[Ss]\/[Oo]" ) // 0 di ustrregexm("patient is S\O mr Ram", "[Ss]\[/-\][Oo]" ) // 0 - does not work di ustrregexm("patient is S\O mr Ram", "[Ss][\/\-\\][Oo]" ) // 1 - need to escape each character di ustrregexm("patient is S\O mr Ram", "[Ss].[Oo]" ) // 1 - Or just put a period . , means anything goes here di ustrregexm("patient is S/O mr Ram", "[Ss]\/[Oo]" ) // 1 di ustrregexm("patient is S-O mr Ram", "[Ss]\/[Oo]" ) // 0 di ustrregexm("patient is S\O mr Ram", "[Ss]\/[Oo]" ) // 0 di ustrregexm("patient is S\O mr Ram", "[Ss]\[/-\][Oo]" ) // 0 - does not work di ustrregexm("patient is S\O mr Ram", "[Ss][\/\-\\][Oo]" ) // 1 - need to escape each character di ustrregexm("patient is S\O mr Ram", "[Ss].[Oo]" ) // 1 - Or just put a period . , means anything goes here di ustrregexm("patient is S-O mr Ram", "[Ss].[Oo]" ) // 1 - Or just put a period di ustrregexm("patient is S/O mr Ram", "[Ss].[Oo]" ) // 1 - Or just put a period di ustrregexm("patient is S*O mr Ram", "[Ss].[Oo]" ) // 1 - Or just put a period di ustrregexm("patient is SO mr Ram", "[Ss].[Oo]" ) // 0 - There was nothing there so match failed, Oh No
Code language: JavaScript (javascript)

Check if Anything or Nothing is in a sequence: using *

di ustrregexm("patient is SO mr Ram", "[Ss].[Oo]" ) // 0 - There was nothing there sso failed, Oh No // is the same as the next statement di ustrregexm("patient is SO mr Ram", "[Ss][.]+[Oo]" ) // 0, + means match one or more of previous character, which can be anything since we are umarsing a period ' . ' di ustrregexm("patient is SO mr Ram", "[Ss][.]*[Oo]" ) // 1, * means match zero or one of previous character, which can be anything here since we have used a period ' . ' di ustrregexm("patient is S/O mr Ram", "[Ss][.]*[Oo]" ) // 0 --- Aah this is frustrating
Code language: JavaScript (javascript)

Mixing it up using parentheses ‘ ( ) ‘

di ustrregexm("patient is S-O mr Ram", "([Ss])(.)*([Oo])" ) // 1 di ustrregexm("patient is S-O mr Ram", "([Ss])(.)*([Oo])" ) // 1 di ustrregexm("patient is S\O mr Ram", "([Ss])(.)*([Oo])" ) // 1 di ustrregexm("patient is S-O mr Ram", "([Ss])(.)*([Oo])" ) // 1 di ustrregexm("patient is SoO mr Ram", "([Ss])(.)*([Oo])" ) // 1 di ustrregexm("patient is S.O mr Ram", "([Ss])(.)*([Oo])" ) // 1
Code language: JavaScript (javascript)

Now we have three sub-expressions within the RegEx ([Ss]) (.)* ([Oo])

  • ([Ss]) will match S or s at first place
  • (.)*. will match any character at second place because we have placed a period ‘ . ‘ here. However, After the parentheses we have placed a *. This means match zero or more of previous expression. Essentially, it means anything ‘ . ‘ or nothing ‘ * ‘ gets matched
  • ([Oo]) will match O or o at third character , or at second character

Reference:

https://www.stata.com/support/faqs/data-management/regular-expressions/