Trie-regexes
Given this set of words:
- aliceblue
- antiquewhite
- aqua
- aquamarine
- azure
- beige
- bisque
- black
- blanchedalmond
- blue
- blueviolet
- brown
- burlywood
- cadetblue
- chartreuse
- chocolate
- coral
- cornflowerblue
- cornsilk
- crimson
- cyan
- darkblue
- darkcyan
- darkgoldenrod
- darkgray
- darkgrey
- darkgreen
- darkkhaki
- darkmagenta
- darkolivegreen
- darkorange
- darkorchid
- darkred
- darksalmon
- darkseagreen
- darkslateblue
- darkslategray
- darkslategrey
- darkturquoise
- darkviolet
- deeppink
- deepskyblue
- dimgray
- dimgrey
- dodgerblue
- firebrick
- floralwhite
- forestgreen
- fuchsia
- gainsboro
- ghostwhite
- gold
- goldenrod
- gray
- grey
- green
- greenyellow
- honeydew
- hotpink
- indianred
- indigo
- ivory
- khaki
- lavender
- lavenderblush
- lawngreen
- lemonchiffon
- lightblue
- lightcoral
- lightcyan
- lightgoldenrodyellow
- lightgray
- lightgrey
- lightgreen
- lightpink
- lightsalmon
- lightseagreen
- lightskyblue
- lightslateblue
- lightslategray
- lightslategrey
- lightsteelblue
- lightyellow
- lime
- limegreen
- linen
- magenta
- maroon
- mediumaquamarine
- mediumblue
- mediumorchid
- mediumpurple
- mediumseagreen
- mediumslateblue
- mediumspringgreen
- mediumturquoise
- mediumvioletred
- midnightblue
- mintcream
- mistyrose
- moccasin
- navajowhite
- navy
- oldlace
- olive
- olivedrab
- orange
- orangered
- orchid
- palegoldenrod
- palegreen
- paleturquoise
- palevioletred
- papayawhip
- peachpuff
- peru
- pink
- plum
- powderblue
- purple
- rebeccapurple
- red
- rosybrown
- royalblue
- saddlebrown
- salmon
- sandybrown
- seagreen
- seashell
- sienna
- silver
- skyblue
- slateblue
- slategray
- slategrey
- snow
- springgreen
- steelblue
- tan
- teal
- thistle
- tomato
- transparent
- turquoise
- violet
- wheat
- white
- whitesmoke
- yellow
- yellowgreen
If one wanted to create a regex pattern to match any of those words in a piece of text, I'd wager most people would opt for a pattern that looked something like this:
/ ( aliceblue| antiquewhite| aqua| aquamarine| azure| beige| bisque| black| blanchedalmond| blue| blueviolet| brown| burlywood| cadetblue| chartreuse| chocolate| coral| cornflowerblue| cornsilk| crimson| cyan| darkblue| darkcyan| darkgoldenrod| darkgray| darkgrey| darkgreen| darkkhaki| darkmagenta| darkolivegreen| darkorange| darkorchid| darkred| darksalmon| darkseagreen| darkslateblue| darkslategray| darkslategrey| darkturquoise| darkviolet| deeppink| deepskyblue| dimgray| dimgrey| dodgerblue| firebrick| floralwhite| forestgreen| fuchsia| gainsboro| ghostwhite| gold| goldenrod| gray| grey| green| greenyellow| honeydew| hotpink| indianred| indigo| ivory| khaki| lavender| lavenderblush| lawngreen| lemonchiffon| lightblue| lightcoral| lightcyan| lightgoldenrodyellow| lightgray| lightgrey| lightgreen| lightpink| lightsalmon| lightseagreen| lightskyblue| lightslateblue| lightslategray| lightslategrey| lightsteelblue| lightyellow| lime| limegreen| linen| magenta| maroon| mediumaquamarine| mediumblue| mediumorchid| mediumpurple| mediumseagreen| mediumslateblue| mediumspringgreen| mediumturquoise| mediumvioletred| midnightblue| mintcream| mistyrose| moccasin| navajowhite| navy| oldlace| olive| olivedrab| orange| orangered| orchid| palegoldenrod| palegreen| paleturquoise| palevioletred| papayawhip| peachpuff| peru| pink| plum| powderblue| purple| rebeccapurple| red| rosybrown| royalblue| saddlebrown| salmon| sandybrown| seagreen| seashell| sienna| silver| skyblue| slateblue| slategray| slategrey| snow| springgreen| steelblue| tan| teal| thistle| tomato| transparent| turquoise| violet| wheat| white| whitesmoke| yellow| yellowgreen) / i
That'll work, and is quite simple and fast to create even by hand. But what you gain in ease-of-use(/creation), you lose in performance and length. Given a large enough set of words, this naïve pattern will end up biting you in the ass, mark my words. Another option is to organize the pattern into a Trie-structure aka. prefix tree, like so:
/ ^(
a( liceblue| ntiquewhite| qua( marine) ? | zure) |
b( eige| isque| l( a( ck| nchedalmond) | ue( violet) ? ) | rown| urlywood) |
c( adetblue| h( artreuse| ocolate) | or( al| n( flowerblue| silk) ) | rimson| yan) |
d( ark( blue| cyan| g( oldenrod| r( ay| e( en| y) ) ) | khaki| magenta| o( livegreen| r( ange| chid) ) | red| s( almon| eagreen| late( blue| gr[ ae] y) ) | turquoise| violet) | eep( pink| skyblue) | imgr[ ae] y| odgerblue) |
f( irebrick| loralwhite| orestgreen| uchsia) |
g( ainsboro| hostwhite| old( enrod) ? | r( ay| e( en( yellow) ? | y) ) ) |
ho( neydew| tpink) |
i( n( dianred| digo) | vory) |
khaki|
l( a( vender( blush) ? | wngreen) | emonchiffon| i( ght( blue| coral| cyan| g( oldenrodyellow| r( ay| e( en| y) ) ) | pink| s( almon| eagreen| kyblue| lategr[ ae] y| teelblue) | yellow) | me( green) ? | nen) ) |
m( a( genta| roon) | edium( aquamarine| blue| orchid| purple| s( eagreen| lateblue| pringgreen) | turquoise| violetred) | i( dnightblue| ntcream| styrose) | occasin) |
nav( ajowhite| y) |
o( l( dlace| ive( drab) ? ) | r( ange( red) ? | chid) ) |
p( a( le( goldenrod| green| turquoise| violetred) | payawhip) | e( achpuff| ru) | ink| lum| owderblue| urple) |
r( e( beccapurple| d) | o( sybrown| yalblue) ) |
s( a( ddlebrown| lmon| ndybrown) | ea( green| shell) | i( enna| lver) | kyblue| late( blue| gr[ ae] y) | now| pringgreen| teelblue) |
t( an| eal| histle| omato| urquoise) |
violet|
wh( eat| ite( smoke) ? ) |
yellow( green) ?
) $/ i
Formatted with whitespace for readability
Basically, instead of listing each word in its entirety as a flat list in the pattern, you create capture groups of the words based on their alphabetical prefixes. Now, granted, I doubt many people would opt to create these sorts of patterns by hand. So the trie-based patterns definitely cause a bit more friction for the developer during the pattern creation. But apparently these are a shit-ton faster to execute, and usually end up a lot shorter. With the color names I used here as an example, the naïve pattern is 1501 characters long, while the triegex ended up being just 1185 characters. I'm not mathematician, but I do believe that's a solid third less wasted space. Ain't that dank.