SQL Server 2025: REGEXP_SUBSTR() and REGEXP_REPLACE() (Part 7)

Introduction

SQL Server 2025 continues expanding its native regular expression capabilities with two highly practical functions:

REGEXP_SUBSTR() – extract a substring that matches a pattern
REGEXP_REPLACE() – replace substrings that match a pattern

These functions eliminate years of workarounds involving CHARINDEX, SUBSTRING, PATINDEX, XML methods, and CLR helpers. With SQL Server 2025, you can extract and transform text using concise, standards-aligned RegEx logic right inside the engine.

This article provides a full deep dive into how these functions work, how to use them correctly, and how they apply to real enterprise workloads such as data cleansing, auditing, ETL, and text normalization.


REGEXP_SUBSTR(): Extracting Substrings

Syntax
REGEXP_SUBSTR(expression, pattern [, position [, occurrence [, flags]]])

Parameters:

expression – source text
pattern – regular expression
position – optional, starting character (default 1)
occurrence – which match to return (default 1)
flags – RegEx flags such as i (case-insensitive)


Basic Example

Extract the first sequence of digits from a text value.

SELECT REGEXP_SUBSTR('Order#4529 processed', '\d+') AS Extracted;

Output:

4529

Extracting Email Domains

Professional, real-world use case.

SELECT REGEXP_SUBSTR(Email, '@(.+)$') AS Domain
FROM EmployeeDirectory;

For an email john.smith@company.com, the result is:

@company.com

Combined with SUBSTRING or an additional pattern, you can extract just company.com.


Extracting Parts of an Address

Assume standardized formats:

SELECT REGEXP_SUBSTR(Address, '\b[0-9]{5}\b') AS ZipCode
FROM CustomerRecords;

Extracting Specific Occurrence

Extract the second numeric sequence:

SELECT REGEXP_SUBSTR('A12B34C56', '\d+', 1, 2) AS SecondNumber;

Output:

34

Case-Insensitive Substring Extraction

SELECT REGEXP_SUBSTR('DepartmentSales', 'sales', 1, 1, 'i');

REGEXP_REPLACE(): Transforming Text

Syntax
REGEXP_REPLACE(expression, pattern, replacement [, flags])

Typical Uses:

• Data cleansing
• Standardizing formats
• Removing invalid characters
• Masking sensitive data
• Normalizing user input
• Converting patterns into safer or standard values


Replacing Non-Alphanumeric Characters

Enterprise input validation:

SELECT REGEXP_REPLACE(Username, '[^A-Za-z0-9]', '') AS Cleaned
FROM UserAccounts;

Removes punctuation, spaces, and symbols.


Masking Sensitive Data

Mask everything except the last 4 digits.

SELECT REGEXP_REPLACE('123456789', '\d(?=\d{4})', '*') AS Masked;

Output:

Normalize Phone Numbers

Convert formats such as:

(714) 555-1234
714-555-1234
714 555 1234
714.555.1234

into a consistent form (digits only).

SELECT REGEXP_REPLACE(Phone, '\D', '') AS NormalizedPhone
FROM Contacts;

Advanced Examples

Extract First Name and Last Name

Format: LastName, FirstName

SELECT 
    REGEXP_SUBSTR(FullName, '^[^,]+') AS LastName,
    REGEXP_SUBSTR(FullName, '(?<=,\s).*') AS FirstName
FROM Employees;

Extract Version Numbers

SELECT REGEXP_SUBSTR('SQL2025-Version15.3.9', '\d+(\.\d+)*') AS Version;

Output:

15.3.9

Clean HTML Tags

SELECT REGEXP_REPLACE('<p>Hello World</p>', '<[^>]+>', '') AS CleanText;

Output:

Hello World

Performance Notes

REGEXP_SUBSTR and REGEXP_REPLACE are scalar functions. Their performance characteristics:

• Not SARGable
• Work best with VARCHAR(n) instead of NVARCHAR(MAX)
• Optimizer caches patterns for repeated use
• Can be CPU-heavy on large text columns
• Great for cleansing during ETL or one-time cleanup batches
• Avoid using on every row of a large OLTP workload

If performance becomes a concern:

• Pre-filter with LIKE
• Apply on staging tables
• Normalize data upstream when possible


Workshop: Hands-On RegEx Extraction and Replacement

Step 1. Create sample data

CREATE TABLE StringProcessing2025 (
    Id INT IDENTITY,
    InputValue VARCHAR(300)
);

INSERT INTO StringProcessing2025 (InputValue)
VALUES
('Invoice # 4529 – Approved'),
('Client: Smith, John'),
('Phone: (714) 555-1234'),
('Order2025-Version15.4'),
('Serial: AB-2025-XYZ');

Step 2. Extract first numeric value

SELECT 
    Id,
    REGEXP_SUBSTR(InputValue, '\d+') AS ExtractedNumber
FROM StringProcessing2025;

Step 3. Extract second occurrence

SELECT 
    Id,
    REGEXP_SUBSTR(InputValue, '\d+', 1, 2)
FROM StringProcessing2025;

Step 4. Replace non-digits

SELECT 
    Id,
    REGEXP_REPLACE(InputValue, '\D', '') AS DigitsOnly
FROM StringProcessing2025;

Step 5. Standardize name format

Convert "Smith, John""John Smith":

SELECT 
    REGEXP_SUBSTR(InputValue, '(?<=,\s).+') + ' ' +
    REGEXP_SUBSTR(InputValue, '^[^,]+') AS StandardizedName
FROM StringProcessing2025
WHERE InputValue LIKE 'Client:%';

Step 6. Remove labels or prefixes

SELECT 
    REGEXP_REPLACE(InputValue, '^[A-Za-z]+:\s*', '') AS Cleaned
FROM StringProcessing2025;

Final Thoughts

REGEXP_SUBSTR and REGEXP_REPLACE significantly upgrade SQL Server’s text processing capabilities. They allow DBAs and developers to handle validation, extraction, normalization, and transformation inside the database using concise and readable code. These features reduce the need for staging ETL code, CLR functions, and external scripting languages.

Tomorrow’s post, Day 8, focuses on REGEXP_INSTR() and REGEXP_COUNT(), which provide position-based searching and match-counting capabilities.


References

• REGEXP_SUBSTR Documentation
https://learn.microsoft.com/sql/t-sql/functions/regexp-substr-transact-sql

• REGEXP_REPLACE Documentation
https://learn.microsoft.com/sql/t-sql/functions/regexp-replace-transact-sql

• Regular Expression Reference
https://learn.microsoft.com/sql/t-sql/functions/regular-expression-reference

• SQL Server 2025 New Features
https://learn.microsoft.com/sql/sql-server/what-s-new-in-sql-server-2025


Discover more from SQLYARD

Subscribe to get the latest posts sent to your email.

Leave a Reply

Discover more from SQLYARD

Subscribe now to keep reading and get access to the full archive.

Continue reading