Text processing

Regular expressions

Regular expressions are text patterns representing a set of strings, used for string search, extraction or replacement.

Regular expressions are strings consisting of regular characters (e.g. letters or numbers) and meta-characters. Meta-characters are special-meaning characters, allowing to represent wider set of characters. These characters are

^ $ . + ? * { [ () | \

There are several specifications of regular expressions, whereas most common are

POSIX regular expressions and
Pearl regular expressions.

The following tutorial is describing POSIX regular expressions, which are commonly used by unix-like system tools.

When a meta-character should be interpreted as a regular character (ignoring its special meaning), it has to be preceded with backslash (\).

fel\.cvut\.cz        # match exact string "fel.cvut.cz"

Most general meta-character is the dot (.), representing single arbitrary character. A regular expression

a.c

represents all strings of length 3 starting with “a” and ending with “c”, e.g. “aac”, “abc”, “a3c” or “a_c”.

The . meta-character is too general for most applications and more restricted character sets are needed. The brackets may be used to specify set of characters. The following expression represents strings “ade”, “bde” and “cde”.

[abc]de

Brackets represent one single character from the set. Ranges may be used for specifying wider sets, using the “-” sign. More ranges may be combined in single brackets.

[a-z]         # any lowercase letter of english alphabet
[0-9]         # any number
[A-Za-z0-9]   # any alpha-numberic character (except national characters)

When a “-” character should be included in the set, it should be on place where it can not be interpreted as a range sign (e.g. last character of the set). A “^” character placed as the first after the opening bracket may be used to invert the meaning of the set.

[^A-Za-z]    # match any character except english alphabet letters

There are prepared character classes prepared for commonly used sets, written as [[:classname:]], e.g.

[[:blank:]] space and tab
[[:space:]] Space characters: in the ‘C’ locale (space, tab, new line, form feed, carriage return)
[[:alpha:]] all alphabetic characters (including national characters)
[[:alnum:]] all alphanumeric characters
[[:digit:]] all digits (same as [0-9])
[[:lower:]] all lower-case letters

Quantifiers are used to represent repeating of an expression. These quantifiers are postfix operators, applying to the preceding character:

a*     match zero-to-infinite repeating of character a
a\+    match any repeating of character a
a\?    match single character a or empty string
a\{3\}   match a character repeated exactly 3-times
a\{3,5\} match a character repeated 3 to 5-times

Note the backslash characters preceding the +, ? and { } quantifiers. The reason is backward compatibility with older regular expressions, where these quantifiers were not defined, so \ is used to enable special meaning of the characters. It is obvious that using these quantifiers leads to expressions having many backslashes, which makes them harder to read and create. This led to definition of so called Extended regular expressions, treating newly added quantifiers (and some other characters) as meta-characters without the preceding backslash. Many tools using regular expressions allow to switch between basic and extended regular expressions.

All quantifiers apply to the single preceding character. It is possible to apply them to a sub-expression using parentheses:

[0-9]\+\(,[0-9]\+\)*      match arbitrary-length list of comma-separated numbers (no spaces allowed in list)

Parentheses have to be also preceded by backslash in basic regular expressions. This expression may be written as extended regular expression as

[0-9]+(,[0-9]+)*

Parentheses may have additional meaning, dependent on the tool used. Search and replace tools allow to insert parts of text corresponding to parentheses-closed expression into replacement text. A backslash followed by a number is used in replacement string to represent the parenthesis specified by number index. Some searching tools allow to extract parts of matching strings corresponding to parentheses.

Alternatives may be specified using | (pipe) character. It has to be preceded by backslash in basic regular expressions:

color\|colour
colo\(r\|ur\)

Both expressions match words “color” and “colour”. Note that pipe makes alternatives from whole expressions on its left and right. To restrict the range of the alternatives parentheses may be used.

Anchors may be used to specify beginning or ending of word or line. These meta-characters do not represent specific characters, but mark the position of boundary of word or line. These anchors are defined in POSIX regular expressions:

\<    beginning of word
\>    ending of word
^     beginning of line
$     end of line

Examples:

\<word\>    matches "word" but not "words"
\<a[a-z]*   matches all lowercase words beginning with letter "a"
^From:      matches lines beginning with string "From:"
^[a-zA-Z0-9_]$ matches only text consisting of alphanumeric characters and underscores. No other characters are allowed before nor after the matching text.

Specification of the POSIX regular expressions is available on man pages:

man 7 regex

Regex tutorial by Mira Bursa

Standard text processing tools

bash

A bash scripting language allows to test if string contains a substring matching a regular expression. The syntax is

if [[ "$var" =~ a[0-9] ]] ; then ...

if [[ "$a" =~ ^[[:digit:]]+$ ]] ; then
	echo "${a} is a number"
fi

Note the double brackets around the condition (contrary to single brackets when test command is used). Also note, that the regular expressions is not quoted (from bash ver. 3.2, the quotes would be interpreted as part of the expression).

The variable \$BASH_REMATCH contains part of the tested string matching the expression after the test is performed.

grep - print lines matching a pattern

A grep tool searches the input text for the lines containing specified pattern, printing these matching lines to output. Other non-matching lines are discarded. The pattern to be searched is a regular expression. Grep processes text from a file supplied as a parameter, or a text from the standard input when no file is specified.

Example: (find all lines containing valid #include directive in C header file)

grep '^[[:space:]]*#include[[:space:]]*[<"][^>"]*[>"]' header_file.h

Example: (find all files having name length 3 characters or longer)

ls | grep '^....*$'

Grep supports both basic and extended regular expressions (when -E option is used). There are some other useful options available, e.g.:

-i (ignore case when searching for pattern matches)
-v (invert match - print only lines not matching the pattern)
-c (print count of matching lines, instead of normal output)
-o (print only matching parts of line)
-n (prepend line number before the matching line)

It is possible to search multiple files for single pattern using the single grep call.

awk - pattern scanning and processing language

AWK is an interpreted programming language designed for text processing. A program in the AWK language consists of pattern action pairs, written as

condition { action }

where condition is typically an expression and action is a series of commands. Input lines of text are evaluated by the condition and corresponding actions are applied to the matching lines. If either condition or action may be omitted, default condition (matching all lines) or action (print the current line) is used. Every input line is split into columns by a specified separator, allowing easy and comfortable processing. Text strings of the columns are accessible through AWK built-in variables \$1, \$2, \$3, etc., corresponding to columns # 1, 2, 3… respectively.

Example printing users of files in current directory:

ls -l | awk '{ print $3 }'

Variable \$0 contains whole input line.

AWK language provides some other built-in variables:

NR (number of records) - number of the actually processed line
NF (number of fields) - number of current line columns
FS (field separator) - character used for separating columns

AWK conditions may have forms of regular expressions:

/pattern/ { action }     # all lines matching the pattern

conditions:

NR=10 { action }    # this extracts 10-th line

compound expressions:

( $3 == "string" && NR>10 ) || NR == 1 { action }
! /pattern/ { action }     # all lines not matching the pattern

Additionally, AWK provides conditions BEGIN and END, allowing execution of actions before first input line is processed, or after last line is processed.

END { print NR }    # print number of input lines

print is the most useful command used in actions. Alternatively, printf may be used for formatted output (syntax is similar to C printf function).

Several built-in function may be used both in conditions and actions. There are numerical functions (e.g. sqrt, sin, exp) and string functions available, e.g.:

length(str)    # length of string
index(string, substring)    # index of substring in a string
substr(string, position [, length])    # extract substring from a string
match(string,regex)    # test if string matches regular expression
tolower(string)    # convert string to lower case

AWK supports string or numeric variables and associative arrays. It also allows to define user functions.

A different field separator (default is sequence of white-space characters) may by specified by -F option:

awk -F':' '{ print $1 " " $3 }' /etc/passwd

AWK program may be written in a separate file or supplied as an argument of awk command:

awk -f script.awk input_file.txt
awk '/pattern/ { print $0 }' input_file.txt

AWK reads input data from the file supplied as a parameter, or from the standard input when no file is specified.

awk '/pattern/ { print $0 }' input_file.txt
cat input_file.txt | awk '/pattern/ { print $0 }'

GNU awk User's Guide

sed - stream editor

sed is a powerful text transformation tool, applying a program of proprietary commands to lines of input text. The input text is read from a file supplied as a parameter, or from the standard input.

Basic sed commands include

a   append text
i   insert text
q   quit the sed script
r   append text from a file
c   replace input lines with text
d   delete lines
h   store lines to a buffer (hold space)
g   copy lines from a buffer

s/regexp/replacement/    attempt to replace substrings of input matching the regexp with the replacement string

Most commands may have an address specified before them to determine lines to which the command would be applied. Commands without an address are applied to all input lines. The address may have form of

number - number of line
number,number - range of lines with line numbers between the specified numbers
$ - last line of an input
/regexp/ - line matching the regexp

An exclamation (!) character may be inserted between the address and the command, which specifies that the command should be executed only for the lines not matching the address.

It is possible to write a sed script in a separate file, or as a sed command argument:

sed -f sed_script input_data.txt
sed -e '1d' input_data.txt