REGEXP_REPLACE finds every substring in a string that matches a regular expression and swaps it for replacement text. It is the go-to tool for cleaning messy data right inside a query: strip junk out of a phone number, collapse whitespace, or reformat names without exporting to application code.
Signature and a basic replacement
In PostgreSQL the function reads REGEXP_REPLACE(source, pattern, replacement [, flags]). By default it replaces only the first match, so you almost always want the 'g' (global) flag.
SELECT
REGEXP_REPLACE('+1 (555) 123-45-67', '[^0-9]', '', 'g') AS digits,
REGEXP_REPLACE('hello world', '\s+', ' ', 'g') AS one_space;
Key points:
- Without
'g' only the first match goes: REGEXP_REPLACE('a-b-c', '-', '+') yields a+b-c.
- The class
[^0-9] means "any character that is not a digit" - the classic way to keep digits only.
- An empty replacement string
'' simply deletes the matches.
The g, i, and multiline flags
Flags are passed as a string in the fourth argument and they combine: 'gi' means global plus case-insensitive at once.
SELECT
REGEXP_REPLACE(email, 'GMAIL', 'gmail', 'gi') AS norm,
REGEXP_REPLACE(name, '\s+', ' ', 'g') AS clean_name
FROM users;
g - replace all occurrences, not just the first.
i - case-insensitive, so Gmail, GMAIL, and gmail all match.
n (or m) - multiline mode, where ^ and $ start anchoring to embedded line breaks.
Back-references in the replacement
Parentheses (...) in the pattern capture fragments into groups, and the replacement string refers to them as \1, \2, and so on. This lets you reorder and reformat parts of a string.
SELECT
REGEXP_REPLACE(name, '^(\w+)\s+(\w+)$', '\2, \1') AS last_first
FROM employees;
Another example - normalize a phone into one format by pulling out groups of digits:
SELECT
REGEXP_REPLACE('5551234567', '(\d{3})(\d{3})(\d{4})', '(\1) \2-\3') AS pretty;
The back-reference \1 points at the text captured by the first group, so the order of the parentheses matters.
POSIX versus PCRE: the big gotcha
PostgreSQL uses the POSIX ARE dialect, while MySQL and many languages use PCRE. They look alike, but the details bite.
- In POSIX (PostgreSQL) use
[[:alnum:]] for alphanumerics, or the shorthands \w, \d, \s. Perl-style \b (word boundary) exists, but edge behavior differs.
- Lazy quantifiers
*? and +? are supported in PostgreSQL, but named groups and lookbehind are not.
- The most common mistake: forgetting
'g' and wondering why only one occurrence changed.
SELECT REGEXP_REPLACE('a b c', '\s+', '_');
SELECT REGEXP_REPLACE('a b c', '\s+', '_', 'g');
MySQL and ClickHouse differences
The function is not universal and behaves differently across engines.
- MySQL 8+ has
REGEXP_REPLACE(source, pattern, replacement [, pos, occurrence, match_type]). There is no flags string like in Postgres: global replacement is the default (when occurrence = 0), while case and multiline come from match_type such as 'i' or 'm'. Back-references are written as $1, not \1.
SELECT REGEXP_REPLACE(phone, '[^0-9]', '') AS digits FROM users;
SELECT REGEXP_REPLACE(name, '^(\\w+) (\\w+)$', '$2, $1') AS last_first FROM employees;
- ClickHouse splits this into
replaceRegexpOne (first match) and replaceRegexpAll (all matches), and group references are written as \1.
SELECT replaceRegexpAll(phone, '[^0-9]', '') AS digits FROM users;
Remember: the essence is the same - a pattern, a replacement, and a global flag. Only the function name, the flag syntax, and the back-reference style (\1 versus $1) change.
REGEXP_REPLACEfinds every substring in a string that matches a regular expression and swaps it for replacement text. It is the go-to tool for cleaning messy data right inside a query: strip junk out of a phone number, collapse whitespace, or reformat names without exporting to application code.Signature and a basic replacement
In PostgreSQL the function reads
REGEXP_REPLACE(source, pattern, replacement [, flags]). By default it replaces only the first match, so you almost always want the'g'(global) flag.SELECT REGEXP_REPLACE('+1 (555) 123-45-67', '[^0-9]', '', 'g') AS digits, REGEXP_REPLACE('hello world', '\s+', ' ', 'g') AS one_space; -- 15551234567 | hello worldKey points:
'g'only the first match goes:REGEXP_REPLACE('a-b-c', '-', '+')yieldsa+b-c.[^0-9]means "any character that is not a digit" - the classic way to keep digits only.''simply deletes the matches.The g, i, and multiline flags
Flags are passed as a string in the fourth argument and they combine:
'gi'means global plus case-insensitive at once.SELECT REGEXP_REPLACE(email, 'GMAIL', 'gmail', 'gi') AS norm, REGEXP_REPLACE(name, '\s+', ' ', 'g') AS clean_name FROM users;g- replace all occurrences, not just the first.i- case-insensitive, soGmail,GMAIL, andgmailall match.n(orm) - multiline mode, where^and$start anchoring to embedded line breaks.Back-references in the replacement
Parentheses
(...)in the pattern capture fragments into groups, and the replacement string refers to them as\1,\2, and so on. This lets you reorder and reformat parts of a string.SELECT REGEXP_REPLACE(name, '^(\w+)\s+(\w+)$', '\2, \1') AS last_first FROM employees; -- 'Ada Lovelace' -> 'Lovelace, Ada'Another example - normalize a phone into one format by pulling out groups of digits:
SELECT REGEXP_REPLACE('5551234567', '(\d{3})(\d{3})(\d{4})', '(\1) \2-\3') AS pretty; -- (555) 123-4567The back-reference
\1points at the text captured by the first group, so the order of the parentheses matters.POSIX versus PCRE: the big gotcha
PostgreSQL uses the POSIX ARE dialect, while MySQL and many languages use PCRE. They look alike, but the details bite.
[[:alnum:]]for alphanumerics, or the shorthands\w,\d,\s. Perl-style\b(word boundary) exists, but edge behavior differs.*?and+?are supported in PostgreSQL, but named groups and lookbehind are not.'g'and wondering why only one occurrence changed.-- GOTCHA: without 'g' only the first space collapses SELECT REGEXP_REPLACE('a b c', '\s+', '_'); -- a_b c SELECT REGEXP_REPLACE('a b c', '\s+', '_', 'g'); -- a_b_cMySQL and ClickHouse differences
The function is not universal and behaves differently across engines.
REGEXP_REPLACE(source, pattern, replacement [, pos, occurrence, match_type]). There is no flags string like in Postgres: global replacement is the default (whenoccurrence = 0), while case and multiline come frommatch_typesuch as'i'or'm'. Back-references are written as$1, not\1.SELECT REGEXP_REPLACE(phone, '[^0-9]', '') AS digits FROM users; SELECT REGEXP_REPLACE(name, '^(\\w+) (\\w+)$', '$2, $1') AS last_first FROM employees;replaceRegexpOne(first match) andreplaceRegexpAll(all matches), and group references are written as\1.SELECT replaceRegexpAll(phone, '[^0-9]', '') AS digits FROM users;Remember: the essence is the same - a pattern, a replacement, and a global flag. Only the function name, the flag syntax, and the back-reference style (
\1versus$1) change.