SQL REGEXP_REPLACE: Pattern-Based String Cleaning with the g and i Flags

REGEXP_REPLACE finds every substring in a string that matches a regular expression and swaps it for replacement text. It is the go-to tool for cleaning messy data right inside a query: strip junk out of a phone number, collapse whitespace, or reformat names without exporting to application code.

Signature and a basic replacement

In PostgreSQL the function reads REGEXP_REPLACE(source, pattern, replacement [, flags]). By default it replaces only the first match, so you almost always want the 'g' (global) flag.

SELECT
  REGEXP_REPLACE('+1 (555) 123-45-67', '[^0-9]', '', 'g') AS digits,
  REGEXP_REPLACE('hello   world', '\s+', ' ', 'g')        AS one_space;
-- 15551234567 | hello world

Key points:

Without 'g' only the first match goes: REGEXP_REPLACE('a-b-c', '-', '+') yields a+b-c.
The class [^0-9] means "any character that is not a digit" - the classic way to keep digits only.
An empty replacement string '' simply deletes the matches.

The g, i, and multiline flags

Flags are passed as a string in the fourth argument and they combine: 'gi' means global plus case-insensitive at once.

SELECT
  REGEXP_REPLACE(email, 'GMAIL', 'gmail', 'gi') AS norm,
  REGEXP_REPLACE(name, '\s+', ' ', 'g')         AS clean_name
FROM users;

g - replace all occurrences, not just the first.
i - case-insensitive, so Gmail, GMAIL, and gmail all match.
n (or m) - multiline mode, where ^ and $ start anchoring to embedded line breaks.

Back-references in the replacement

Parentheses (...) in the pattern capture fragments into groups, and the replacement string refers to them as \1, \2, and so on. This lets you reorder and reformat parts of a string.

SELECT
  REGEXP_REPLACE(name, '^(\w+)\s+(\w+)$', '\2, \1') AS last_first
FROM employees;
-- 'Ada Lovelace' -> 'Lovelace, Ada'

Another example - normalize a phone into one format by pulling out groups of digits:

SELECT
  REGEXP_REPLACE('5551234567', '(\d{3})(\d{3})(\d{4})', '(\1) \2-\3') AS pretty;
-- (555) 123-4567

The back-reference \1 points at the text captured by the first group, so the order of the parentheses matters.

POSIX versus PCRE: the big gotcha

PostgreSQL uses the POSIX ARE dialect, while MySQL and many languages use PCRE. They look alike, but the details bite.

In POSIX (PostgreSQL) use [[:alnum:]] for alphanumerics, or the shorthands \w, \d, \s. Perl-style \b (word boundary) exists, but edge behavior differs.
Lazy quantifiers *? and +? are supported in PostgreSQL, but named groups and lookbehind are not.
The most common mistake: forgetting 'g' and wondering why only one occurrence changed.

-- GOTCHA: without 'g' only the first space collapses
SELECT REGEXP_REPLACE('a  b  c', '\s+', '_');     -- a_b  c
SELECT REGEXP_REPLACE('a  b  c', '\s+', '_', 'g'); -- a_b_c

MySQL and ClickHouse differences

The function is not universal and behaves differently across engines.

MySQL 8+ has REGEXP_REPLACE(source, pattern, replacement [, pos, occurrence, match_type]). There is no flags string like in Postgres: global replacement is the default (when occurrence = 0), while case and multiline come from match_type such as 'i' or 'm'. Back-references are written as $1, not \1.

SELECT REGEXP_REPLACE(phone, '[^0-9]', '') AS digits FROM users;
SELECT REGEXP_REPLACE(name, '^(\\w+) (\\w+)$', '$2, $1') AS last_first FROM employees;

ClickHouse splits this into replaceRegexpOne (first match) and replaceRegexpAll (all matches), and group references are written as \1.

SELECT replaceRegexpAll(phone, '[^0-9]', '') AS digits FROM users;

Remember: the essence is the same - a pattern, a replacement, and a global flag. Only the function name, the flag syntax, and the back-reference style (\1 versus $1) change.