RegEx Primer
Finding Patterns in Logs (or Other Files)¶
What is Regex?¶
Regular Expressions (regex) are sequences of characters that form search patterns. They are used for matching, searching, and manipulating text, making them incredibly useful for analyzing data, detecting patterns, and automating tasks. In cybersecurity, regex can help identify sensitive information, extract useful data from logs, and detect anomalies.
Different Regex Formats¶
Regex patterns come in several different formats, each suited to specific use cases:
- Basic Regular Expressions (BRE): Simple, portable expressions that match literal text or basic patterns. Use when you need simplicity without advanced matching requirements.
- Extended Regular Expressions (ERE): Adds flexibility with operators like
+
,?
, and{}
, useful for moderately complex patterns. - Perl-Compatible Regular Expressions (PCRE): Highly versatile, supporting lookaheads, lookbehinds, and more. Ideal for complex patterns and advanced searches.
- POSIX Regular Expressions: Found in POSIX tools (like
awk
), with specific character classes like[[:alnum:]]
. Choose for cross-platform consistency.
Basic Concepts of Regex¶
Literal Characters
Match exactly what you type (e.g., abc
matches "abc").
Metacharacters
Special characters with unique functions:
.
: Matches any character except a newline.^
: Anchors the match to the start of a line.$
: Anchors the match to the end of a line.\
: Escapes a metacharacter to treat it as a literal.
Character Classes
Define a set of characters:
[0-9]
or\d
: Matches any digit.[a-zA-Z]
: Matches any letter (uppercase or lowercase).
Quantifiers
Define how many times an element must appear:
*
: Matches 0 or more times.+
: Matches 1 or more times.?
: Matches 0 or 1 time.{n,m}
: Matches betweenn
(minimum) andm
(maximum) times.
Grouping and Capturing
Parentheses ()
group patterns and capture matched text.
Why Use Regex in Cybersecurity?¶
- Log Analysis: Quickly search and filter through logs to find specific events, IP addresses, error codes, or patterns.
- Data Extraction: Extract sensitive information like credit card numbers, email addresses, or phone numbers.
- Intrusion Detection: Identify patterns indicative of malicious activity, like SQL injection attempts, XSS payloads, or anomalous user behavior.
- Data Sanitization: Validate and sanitize inputs to prevent injection attacks.
Choosing a Regex Format¶
Basic Regular Expressions (BRE)¶
When to Use:¶
Use BRE when working with simple patterns and in cases where compatibility with various systems is a factor.
Example:¶
grep -Bil '(secret|confidential|sensitive)' /path/to/file.txt
This command uses BRE to search for "secret," "confidential," or "sensitive" in the file.
Extended Regular Expressions (ERE)¶
When to Use:¶
Choose ERE for moderately complex patterns where you need features like +
for "one or more," ?
for "zero or one," or {}
for specifying occurrences.
Example:¶
grep -Eil '(secret|confidential|sensitive)' /path/to/file.txt
The -E
flag enables ERE, allowing the use of |
for alternation (like (secret|confidential|sensitive)
), which can simplify pattern creation for moderate complexity.
Perl-Compatible Regular Expressions (PCRE)¶
When to Use:¶
PCRE is best suited for complex patterns, especially when needing advanced features like lookaheads, lookbehinds, or shorthand character classes like \d
and \w
.
Example:¶
grep -Pail '(?i)(secret|confidential|sensitive)' /path/to/file.txt
The -P
flag enables PCRE, supporting case-insensitive matching with (?i)
, along with other complex syntax.
POSIX Regular Expressions¶
When to Use:¶
Use POSIX-compliant tools like awk
for cross-platform consistency with specific character classes ([:alnum:]
, [:space:]
).
Example:¶
awk '/secret|confidential|sensitive/' /path/to/file.txt
This uses POSIX regex with awk
, making it compatible across Unix-like systems for simpler matches.
grep Location Scope
It is worth mentioning that when you use grep
with a *
in the /path/to/file
parameter, as shown, it is searching the content of the current directory's files and does NOT include hidden files. It is not recursive and it is not searching the names of the files for the pattern.
```bash
grep 'pattern' *
```
To specify that the pattern should search the content of subdirectories as well, use -r or -R.
```bash
grep -r 'pattern' *
```
-r
: Searches recursively through files and directories.-R
: Works like-r
but also follows symbolic links.*
: As a location, does not include hidden files..
: As a location, includes thepwd
as a whole, including hidden directories, but still requires the-r
or-R
to search recursively.
Common Regex Patterns for Cybersecurity¶
1. Credit Card Numbers¶
To detect credit card numbers, we can use a regex pattern that matches typical formats.
\b(?:\d[ -]*?){13,16}\b
Usage Example:¶
grep -E '\b(?:\d[ -]*?){13,16}\b' /path/to/logfile.log
Choose Extended (ERE) or Perl-Compatible (PCRE) for easier alternation and quantifier handling. ERE is more compatible across systems, while PCRE offers more versatility for complex needs.
2. Email Addresses¶
A standard pattern for email addresses can be effective for capturing emails in logs.
[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
Usage Example:¶
grep -E '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' /path/to/logfile.log
Use Extended (ERE) for straightforward matching or PCRE if you need advanced email validation.
3. IPv4 Addresses¶
IPv4 addresses require precise regex to match valid octet ranges.
\b((25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\b
Usage Example:¶
grep -Eo '\b((25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\b' /path/to/logfile.log
Extended (ERE) is often enough, but PCRE provides flexibility with shorthand for digits (\d
) and complex patterns if needed.
Practical Tips for Using Regex¶
- Testing and Debugging: Use online regex testers like regex101 to test patterns with sample data.
- Performance Considerations: Complex patterns can be slow, especially with PCRE. Optimize by using ERE or BRE if advanced features aren’t required.
- Choose the Right Format: Use BRE for simplicity, ERE for moderate complexity, and PCRE for advanced needs.
- Sanitization and Validation: Always validate input lengths and formats, especially in PCRE, to avoid matching unintended data.