Unicode Regular Expressions*
Modern regular expression engines have been rapidly adding new features for matching and parsing Unicode strings, providing powerful new tools to add to your toolkit.
This talk will be useful to programmers of all levels who want to learn about pattern matching using character properties and other Unicode features that are new to many regex engines. The functionality of existing regex metacharacters has also been evolving to conform with Unicode standards and it’s important to understand the differences.
The topics will include:
- new modifiers, character classes, and special escape sequences
- differences in regex engines
- code points and grapheme clusters
- matching boundaries
- case folding and normalization
- and lots of character properties!
A basic knowledge of regular expressions is required.
I’ve presented and facilitated workshops at multiple events including YAPC::NA, Pittsburgh Perl Workshop, DC–Baltimore Perl Workshop, Boston Skillshare, and local user groups. Topics have ranged from development and testing to DIY and foraging for edible wild plants. Unicode Regular Expressions has been presented at a Shutterstock Tech Talk in January and will be presented at DCBPW 2013 in April.
Nova Patch is a software engineer on the International Search team at Shutterstock, specializing in internationalization, localization, and multilingual information retrieval; and focusing on developing a search and discovery experience that supports the world’s languages, writing systems, and cultures. They are an open source developer, contributor to the Unicode CLDR, and member of the Unicode Consortium.
- Title: Unicode Best Practices
- Track: Cooking
- Room: B301
- Time: 3:45 – 4:30pm
Developing applications to handle the natural languages and written scripts of the world—or even a small handful of them—is an impressively large task. Fortunately, Unicode provides tools to do just that. It’s more than just a character set, it’s a collection of standards for working with the world’s textual data. The problem is: Unicode itself is complex!
- Speakers: Nova Patch