While working on SwarmDoc I managed to come across an article by Jeff Atwood at Coding Horror about using regular expressions.  I have always been kind of stand offish in regards to regular expressions , mainly because its not quite standard once you get past the super common bits, like . and *, and even then. You have other differences. But, it has been a while since I tried my hand and once I was turned on to http://regexpal.com/ (which is a really cool site) I decided to try to punch the regex shortbus yet again.

Basically, all the regex is SwarmDoc is to parse out links that use the wiki style like [[thispage]] and single links and a few others (right now is a small subset,  but eventually I want to parse all the tags that wikimedia defines).

So, I got to town on regexpal and end up coming up with a few  expressions that seem to work. The only issue is, it seems that these are based on the common Javascript browser implimentation… and it seems that this is not what is used in Adobe Air which uses the ActionScript RegExp object.

Although, I found out there was a few minor differences in Actionscript and Mozilla that made things function a bit differently.

ActionScript / Flash Regex documentation

Firefox / Mozilla Regexp documentation

Creating Regular Expressions

new RegExp(pattern [,flags]);
/pattern/flags

Flags can include:

  • g – global
  • i – ignore case
  • m – multi line, treat ^ and $ as the start and end of new lines, not just the input string.
  • y – (mozilla) – sticky
  • s – (flash) – dotall – the . matches the newline character
  • x – (flash) – extended – lets you add spaces to to the expression to make it more legible.

The y,s,x flags are unique to each and.. well. I guess that is important.

Special Caracters

  • ^ – Start of line or the negate charecter in a character group [^a-d] means, NOT in a through d
  • $ – End of line
  • [ and ] – define a group of characters –
  • ( and ) – define a sequence of characters, (ab|cd) matches ab or cd.
  • . – matches any character except newlines (FLASH only: unless the s flag is set in globals)
  • \- escape any character on this list so \. is treated as ‘.’ and not as any character.
  • + – one or more of previous item
  • ? – zero or one of pervious item
  • * – zero or more times
  • | – or, for use in () s
  • – – (flash)the dash is only considered a special character in the brackets that define a character class,
  • ] – (flash)the ] is only defined as a special character (and in turn must be escaped) when inside [ and ] – Interestingly, the ‘[‘ is not considered a special character when inside brackets.

Sequences

These are the pre-difined character classes,  such as any space, any number, most can be equivalent to others.

  • \b – matches at the position between a word and non-word character
  • \B – matches at the position between two word characters
  • \cX – (mozilla) – matches a control char, ie \cM = control-m
  • \d – matches a decimal digit
  • \D – any character other than the decimal
  • \f – matches a form feed character
  • \n – matches a newline character
  • \r – matches a return char
  • \s – matches any whitespace character (space, tab, newline or return)
  • \S – any non white space char
  • \t – matches the tab
  • \unnnn – matches a specific unicode character, for example \u263a matches the smiley
  • \v – matches a vertical feed
  • \w – matches a whore character (A-Z, a-z, 0-9, _ ) – does not match non-english characters
  • \W – matches anything other than a word character
  • \xnn – matches the specificd ascii char.
  • \n – (mozilla) – some crazy matching of paren groups that I didnt get
  • \0 – (mozilla) – matches a NUL character
  • [\b] – (mozilla) – matches a backspace character…. Not sure wthat that is –or– how to get it into text.

Braces

  • {n} – matches previous element N times
  • {n,} – matches previous element N or more times
  • {n,m} – matches previous element between n and m times.

Parens

  • (x) – matches a group that can be recalled at lookup, a paren-group, For example (\d{3})-(\d{3})-(\d{4}) would give 3 groups for a phone number
  • (?:x) – non-capturing, this would not be included in the result string.
  • x(?:y) – matches x, only if it is followed by y, but y is not included in the result
  • x(?!y) – matches x only if it is NOT followed by y

Using the parenthases to group a set of characters or a string helps to break apart the string

Return Format

The return formats are slightly different in both mozilla and actionscript. Using RegExp.exec(string) returns an array. Along with some other information, however, it does not seem to be standard across each, only the concept of returning an array.

On the Mozilla developer site, it shows the following , exec retruns

Regex.exec returns [‘last full match’, ‘paren group 1’, ‘paren group 2’]

It appears that the actionscript method returns the same.

Leave a Reply

Your email address will not be published. Required fields are marked *

Post Navigation