SPAWK — SQL Powered AWK

Introduction

Based on two-way pipes, known as coprocesses in gawk's jargon, SPAWK forms a tiny library of AWK functions that make possible for AWK to communicate with MySQL/MariaDB databases. Using SPAWK, one can submit SQL queries to MySQL/MariaDB database servers from within AWK scripts and process the result rows in the same AWK script. DDL/DML SQL queries may be submitted as well. Let's take a look at what a SPAWK script may look like:

@include "spawk.awk"

BEGIN {
	spawk_sesami("chat", "xxx", "chat")
}

{
	count_relations($1)
}

function count_relations(user,		count) {
	spawk_query("SELECT `relationship` FROM `relation` WHERE `user` LIKE '" user "'")

	while (spawk_fetchrow(row, 1))
	count[row[0]]++

	print user, count["FRIEND"] + 0, count["BLOCKED"] + 0
}

The above SPAWK script includes the SPAWK library at first. In the BEGIN section, the database credentials are being set. For every input line, the first field is considered to be a user login name for which the program will select the related users and count the selected relations based on the relation kind. Then, the user login name along with the friends and blocked counts for this user will be printed.

Notice!
This is an introductory, light version of SPAWK. While you can download and run this version of SPAWK in your local system, it's strongly recommended not to do so. Just read this page as you may read any article in your favourite Linux magazine. If you find SPAWK interesting, visit the official SPAWK site at spawk.opasopa.net and download the real SPAWK following the instructions there.

A similar SPAWK script can be used to print friends and blocked counts for ALL registered users:

@include "spawk.awk"

BEGIN {
	spawk_sesami("chat", "xxx", "chat")

	spawk_query("SELECT `login` FROM `user`")

	while (spawk_fetchrow(user, 1))
	count_relations(user[0])
}

function count_relations(user,		count) {
	spawk_query("SELECT `relationship` FROM `relation` WHERE `user` = '" user "'")

	while (spawk_fetchrow(row, 1))
	count[row[0]]++

	print user, count["FRIEND"] + 0, count["BLOCKED"] + 0
}

The chat application

Assume a chat application based on a relational database holding user accounts and relations between users. Every user may create relations of type "FRIEND" or "BLOCKED" with other registered users of the application. A simple database schema for the chat database would be:

-- Application "chat" is about web chatting between application users.
-- Behind "chat" application, lies the "chat" database were user data and
-- user relationships are stored.

-- (Re)create the "chat" database.

DROP DATABASE IF EXISTS `chat`
;

CREATE DATABASE `chat`
DEFAULT CHARSET = utf8
DEFAULT COLLATE = utf8_general_ci
;

-- Select "chat" database as the default database.

USE `chat`
;

-- Table "user" is the most significant table of the "chat" database.
-- Each row represents a registered user of the "chat" application.

CREATE TABLE `user` (
	`login`		VARCHAR(64) NOT NULL COMMENT 'Login name',
	`registration`	TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'Registration date',
	`name`		VARCHAR(128) NOT NULL COMMENT 'Full name',
	`email`		VARCHAR(128) NULL DEFAULT NULL COMMENT 'e-mail',
	-- 		user passwords are stored in SHA1
	`password`	CHARACTER(40) COLLATE utf8_bin NOT NULL COMMENT 'Password',

	PRIMARY KEY (
		`login`
	) USING BTREE,

	INDEX (
		`name`
	) USING BTREE,

	UNIQUE INDEX (
		`email`
	) USING BTREE
)

ENGINE = InnoDB
COMMENT = 'Person table'
;

-- Table "relation" holds the relationships between users.
-- There are three kinds of relationships between any two users:
-- The users are unrelated, friends or blocked.

CREATE TABLE `relation` (
	`user`		VARCHAR(64) NOT NULL COMMENT 'Login name',
	`related`	VARCHAR(64) NOT NULL COMMENT 'Related login name',
	`relationship`	ENUM (
		'FRIEND',
		'BLOCKED'
	) NOT NULL COMMENT 'Kind of relationship',
	`when`		TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'Date of relation',

	PRIMARY KEY (
		`user`,
		`related`
	) USING BTREE,

	INDEX (
		`related`
	) USING BTREE
)

ENGINE = InnoDB
COMMENT = 'Relations table'
;

ALTER TABLE `relation` ADD FOREIGN KEY (
	`user`
) REFERENCES `user` (
	`login`
) ON UPDATE CASCADE ON DELETE CASCADE
;

ALTER TABLE `relation` ADD FOREIGN KEY (
	`related`
) REFERENCES `user` (
	`login`
) ON UPDATE CASCADE ON DELETE CASCADE
;

-- Create user for generic DQL/DML access to "chat" database

-- Warning!!!
-- Option "IF EXISTS" produces an error prior to MySQL version 5.7
-- In that case, drop the user manually and delete DROP command
-- from this file. Then run the (modified) SQL script again to
-- create the "chat" database and the user "chat".

DROP USER IF EXISTS 'chat'@'localhost'
;

CREATE USER 'chat'@'localhost' IDENTIFIED BY 'xxx'
;

GRANT SELECT, INSERT, UPDATE, DELETE ON `chat`.* TO 'chat'@'localhost'
;

Download

Creating the database

Assuming that the above database schema is stored in schema.sql file, run the following command as the root user of the database to (re)create the database:

Warning!!!
The following command will drop any already existing chat database in your system; all relevant data will be erased! If there exists a database user named chat, this user will also be dropped!

mysql -u root -p <schema.sql

Populating the database

After creating the chat database and the chat generic database user granted with DQL/DML database access, we want to populate the database with test data in order to develop programs and run various test suits for the chat application. We'll use SPAWK and PPL AWK libraries for populating the database with random data:

gawk -v ucount=1000 -v maxrel=10 -f populate.awk

After running for a couple of seconds, your chat database will be populated with 1000 user rows, and about 50000 relation rows. There exists a possibility of unique constraints violations on user insert operations; in such a case your database may contain less than 1000 user rows. We can avoid duplications writing some extra SPAWK code, but it's not worthwhile to do so.

It's time to take a look at the populate.awk SPAWK script:

@include "spawk.awk"
@include "ppl.awk"

BEGIN {
	srand(PROCINFO["pid"])
	spawk_sesami("chat", "xxx", "chat")

	# "ucount" is the number of users to be inserted.
	# "ucount" may be defined in the command line via -v option

	ucount += 0

	# After inserting users in the database, every user
	# will obtain a random number of relations with other
	# users. The number of relations to be inserted for
	# each user is random but less than "maxrel" which
	# can be defined in the command line via -v oprion.

	maxrel += 0

	relation_populate(user_populate())
}

function user_populate(			i, user) {
	nuser = 0
	while (ucount-- > 0) {
		user["login"] = ppl_login()
		user["registration"] = ppl_timestamp()
		user["name"] = ppl_name()
		user["email"] = ppl_email()
		user["password"] = ppl_string(6, 20, ppl_lower)

		if (spawk_submit("INSERT INTO `user` (`login`, `registration`, " \
			"`name`, `email`, `password`) VALUES ('" user["login"] "', " \
			"FROM_UNIXTIME(" user["registration"] "), '" user["name"] "', " \
			"'" user["email"] "', SHA1('" user["password"] "')) " \
			"ON DUPLICATE KEY UPDATE `name` = VALUES(`name`)") != 1)
			continue	# duplicates return 2, not 1

		ulist[++nuser] = user["login"]
	}

	return nuser
}

function relation_populate(nuser,		i, nrel, query) {
	for (i = 1; i <= nuser; i++) {
		nrel = ppl_integer(1, maxrel)

		query = "INSERT INTO `relation` (`user`, `related`, `relationship`) VALUES "

		while (nrel-- > 1)
		query = query relation_row(ulist[i]) ","

		query = query relation_row(ulist[i]) \
		" ON DUPLICATE KEY UPDATE `relationship` = VALUES(`relationship`)"
		spawk_submit(query)
	}
}

function relation_row(user,			related) {
	while((related = ulist[ppl_integer(1, nuser)]) == user)
	;

	return "('" user "', '" ulist[ppl_integer(1, nuser)] "', '" \
		(ppl_integer(0, 1) ? "FRIEND" : "BLOCKED") "')"
}

Download

Printing relations

After populating the chat database, it's time to make productive use of SPAWK. Our first program will print the counts of relations declared for specified users.

@include "spawk.awk"

BEGIN {
	spawk_sesami("chat", "xxx", "chat")
}

{
	delete count
	spawk_query("SELECT `relationship` FROM `relation` WHERE `related` LIKE '" $0 "'")

	while (spawk_fetchrow(relation, 1))
	count[relation[0]]++

	print $0, count["FRIEND"] + 0, count["BLOCKED"] + 0
}

Download

Run the above script and type in some user login names from your keyboard. For each name you'll get the results, that is the user login name, the count of users declared the specified user as "FRIEND", and the count of users declared the specified user as "BLOCKED":

gawk -f relpr.awk
asdasd
asdasd 9 12
sdfhww
sdfhww 3 8
gfuuzsy
gfuuzsy 19 5
...
sdfuirui
sdfuirui 2 6

Maybe the login names seem biazarre, but don't forget that this is a randomly populated database!

Anyway, we'll improve our SPAWK program adding an option to print all users instead of selected users given as input. To accomplish that, we can use an AWK variable which will be used as a flag: if this variable has a non zero value, then all users will be printed, else (default) the users login names will be read as input line by line. Choosing all as a name for this flag, our script becomes:

@include "spawk.awk"

BEGIN {
	spawk_sesami("chat", "xxx", "chat")

	if (all)
	exit(print_all_users())
}

{
	spawk_query("SELECT `login` FROM `user` WHERE `login` LIKE '" $0 "'")

	while (spawk_fetchrow(user, 1))
	print_user(user[0])
}

function print_user(user,		relation, count) {
	spawk_query("SELECT `relationship` FROM `relation` WHERE `related` = '" user "'")

	while (spawk_fetchrow(relation, 1))
	count[relation[0]]++

	print user, count["FRIEND"] + 0, count["BLOCKED"] + 0
}

function print_all_users(		user) {
	spawk_query("SELECT `login` FROM `user`")

	while (spawk_fetchrow(user, 1))
	print_user(user[0])

	return 0
}

Download

To run the above program for users supplied as input, just type:

gawk -f relprint.awk

But if you want to run the above program for all users in the database, then you must supply a non-zero value to the variable all and this can be accomplished via AWK -v option in the command line:

gawk -v all=1 -f relprint.awk

Writing application programs

It's time to enclose all of the above SPAWK stuff about printing relations in a handy shell script. In order to do so, we must take care of AWKPATH environment variable for spawk.awk to be accessible from any directory. Our application SPAWK scripts must also be located in a well known directory. We can also specify absolute pathnames for AWK scripts, but it's strongly recommended not to do so.

Let's assume that we are using /usr/local/share/awklib directory for storing system AWK libraries, e.g. spawk.awk for accessing databases, ppl.awk for producing random data etc. In this case the AWKPATH variable may be set as:

export AWKPATH=".:/usr/local/share/awklib"

Now, let's assume that our chat application uses /usr/local/apps/chat directory as a base directory. A possible directory structure under this directory would contain subdirectories as bin, database, lib etc. It's good to create an awklib subdirectory under /usr/local/apps/chat directory in order to store various AWK scripts concerning the chat application. At last, assume that the base directory of the chat application is stored in the CHAT_BASEDIR environment variable:

export CHAT_BASEDIR="/usr/local/apps/chat"

Assuming the above chat enviroment, we'll develop the relprint program in /usr/local/apps/chat/bin directory. There is no need to write any complicated C program, neither to write spaghetti SQL scripts. All we need is to move the relprint.awk SPAWK script to /usr/local/apps/chat/awk directory, and then write a trivial shell script to call AWK properly. Most of the job will be executed by SPAWK, leaving only trivial tasks for the shell to do, like parsing command line arguments etc:

#!/use/bin/bash

progname=`basename $0`
errs=

[ -z "${CHAT_BASEDIR}" ] && {
	echo "${progname}: CHAT_BASEDIR: variable not set" >&2
	errs="yes"
}

[ -z "${AWKPATH}" ] && {
	echo "${progname}: AWKPATH: variable not set" >&2
	errs="yes"
}

[ -n "${errs}" ] && exit 1

usage() {
	echo "usage: ${progname} { -a | [ files... ] }" >&2
	exit 1
}

all=0

while getopts ":a" opt
do
	case "${opt}" in
	a)
		all=1
		;;
	\?)
		echo "${progname}: ${OPTARG}: invalid option" >&2
		errs="yes"
		;;
	esac
done

[ -n "${errs}" ] && usage

shift `expr ${OPTIND} - 1`
[ ${all} -eq 1 -a $# -ne 0 ] && usage

exec gawk -v all=${all} -f "${CHAT_BASEDIR}/awklib/relprint.awk" $*

Download

Program refinement

Now that our goal has been accomplished, we can make some program refinements. To be more specific, we'll add an option for specifying the key column to be other than the user login name, that is email or full name. We'll also add some flexibility on what to print. We choose not to change the relprint program but rather to develop the relsel program and the corresponding SPAWK script:

@include "spawk.awk"

BEGIN {
	OFS = "\t"

	# Flag "all" controls the overall program's behavior on
	# how to specify users for counting their popularity.
	# If "all" has a non-zero value (default) then the program
	# will read login names, full names or emails from standard input.
	# In that case, we can specify the key column by setting "ucol"
	# variable (default 1).

	all += 0

	# Check the key (login/email/name) column. If not specified,
	# then the first column is assumed as the key column.

	ucol += 0

	if (ucol <= 0)
	ucol = 1

	# We have the option to specify users by keys other than the
	# user login name. To do so we must specify the key using the
	# "key" variable, e.g. awk -v key="email" will use user email
	# instead of user login name.

	valid_keys["login"]
	valid_keys["email"]
	valid_keys["name"]

	if (!key)
	key = "login"

	else if (!(key in valid_keys))
	fatal(key ": invalid key")

	# By default the login name and the key will be printed.
	# However, we can specify which columns to print by setting
	# the "projection" variable to a list of any user column names.
	# The list may be separated by any special character, e.g. comma,
	# colon, space etc.

	if (!projection)
	projection = (key == "login" ? "login" : "login," key)

	nprjcols = split(projection, prjcols, "[^a-zA-Z0-9_]+")

	# The user select SQL clause is fixed, so we store this in
	# the "uselect" variable, as it will be used more than once
	# in the script

	uselect = "SELECT `login`, `name`, `email` FROM `user`"

	spawk_sesami("chat", "xxx", "chat")

	if (all)
	exit(process_all())
}

NF < ucol {
	next
}

{
	spawk_query(uselect " WHERE `" key "` LIKE '" $ucol "'")

	while (spawk_fetchrow(user))
	process_user(user)
}

function process_user(user,		relation, count, i) {
	spawk_query("SELECT `relationship` FROM `relation` " \
		"WHERE `related` = '" user[1] "'")

	while (spawk_fetchrow(relation, 1))
	count[relation[0]]++

	user["login"] = user[1]
	user["name"] = user[2]
	user["email"] = user[3]

	for (i = 1; i <= nprjcols; i++)
	printf user[prjcols[i]] OFS

	print count["FRIEND"] + 0, count["BLOCKED"] + 0
}

function process_all(			user) {
	spawk_query(uselect)

	while (spawk_fetchrow(user))
	process_user(user)

	exit(0)
}

function fatal(msg) {
	print msg >"/dev/stderr"
	exit(2)
}

Download

The relsel shell script follows. It's very similar to relprint, but there are some new options and the SPAWK script passed to gawk is relsel.awk instead of relprint.awk:

#!/use/bin/bash

progname=`basename $0`
errs=

[ -z "${CHAT_BASEDIR}" ] && {
	echo "${progname}: CHAT_BASEDIR: variable not set" >&2
	errs="yes"
}

[ -z "${AWKPATH}" ] && {
	echo "${progname}: AWKPATH: variable not set" >&2
	errs="yes"
}

[ -n "${errs}" ] && exit 1

usage() {
	echo "usage: ${progname} [ -k key ] [ -p projection ] { -a | [ files... ] }" >&2
	exit 1
}

all=0
key=
projection=

while getopts ":ak:p:" opt
do
	case "${opt}" in
	a)
		all=1
		;;
	k)
		key="${OPTARG}"
		;;
	p)
		projection="${OPTARG}"
		;;
	\?)
		echo "${progname}: ${OPTARG}: invalid option" >&2
		errs="yes"
		;;
	esac
done

[ -n "${errs}" ] && usage

shift `expr ${OPTIND} - 1`
[ ${all} -eq 1 -a $# -ne 0 ] && usage

exec gawk -v all=${all} -v key="${key}" -v projection="${projection}" \
	-f "${CHAT_BASEDIR}/awklib/relsel.awk" $*

Download

Relation statistics

We'll close this introductory SPAWK course by developing one more SPAWK script to produce some interesting statistics based on our chat user relations. To be more specific, we want to calculate the number of concordant as well as the number of contradictive user relations in the database. A relation is considered to be concordant if the reverse relation exists and is of the same kind; conversely, a relation is considered as contradictive if the reverse relation exists and is of a different kind. This seems not to be a trivial task to accomplish, but SPAWK comes to our rescue:

@include "spawk.awk"

BEGIN {
	spawk_sesami("chat", "xxx", "chat")

	concordant = 0
	contradictive = 0

	spawk_query("SELECT `user`, `related`, `relationship` FROM `relation`")

	while (spawk_fetchrow(relation))
	process_relation(relation)

	# concordant and contradictive relations have been counted twice,
	# once for the "user" and once for the "related".

	print concordant / 2, contradictive / 2
}

function process_relation(relation,			reverse) {
	# skip self relations

	if (relation[1] == relation[2])
	return

	# select the reverse relation (if exists)

	spawk_query("SELECT `relationship` FROM `relation` " \
		"WHERE `user` = '" relation[2] "' AND `related` = '" relation[1] "'")

	# if the reverse relation does not exist, then do nothing

	if (!spawk_fetchone(reverse, 1))
	return

	# check for the same or different kind of the reverse relation

	if (reverse[0] == relation[3])
	concordant++

	else
	contradictive++
}

Download

Of course there exist dark corners in our program, most of them having to do with concurrency issues. For example, when relation rows are being inserted, deleted or modified while gawk process is still running, you may get weird (non integer) results in heavily used databases, as some of the concordant or contradictive relationships may not be count twice as they should. Such undesirable situations are almost impossible to avoid without imposing table locks, but doing so may lead to much more complicated situations. As it has been said by Brian Kernighan, one of the creators of AWK: Dark corners are basically fractal–no matter how much you illuminate, there’s always a smaller but darker one.

Fixing anomalies

SPAWK is capable of executing DML and DDL queries too. Based on relstats.awk, the relfix.awk script will accept the introdel flag in order to delete self relationships, that is relations where user and related is the same person. Relation statistics have also been extended to include self relationships.

@include "spawk.awk"

BEGIN {
	spawk_sesami("chat", "xxx", "chat")

	concordant = 0
	contradictive = 0

	# Non-zero "introdel" flag value causes self relationships to be delete.
	# SPAWK returns the number of affected rows on DML queries.

	introverted = introdel ? spawk_submit("DELETE FROM `relation` WHERE `user` = `related`") : 0

	spawk_query("SELECT `user`, `related`, `relationship` FROM `relation`")

	while (spawk_fetchrow(relation))
	process_relation(relation)

	# concordant and contradictive relations have been counted twice,
	# once for the "user" and once for the "related".

	print concordant / 2, contradictive / 2, introverted
}

function process_relation(relation,			reverse) {
	# skip self relations

	if (relation[1] == relation[2])
	return introverted++

	# select the reverse relation (if exists)

	spawk_query("SELECT `relationship` FROM `relation` " \
		"WHERE `user` = '" relation[2] "' AND `related` = '" relation[1] "'")

	# if the reverse relation does not exist, then do nothing

	if (!spawk_fetchone(reverse, 1))
	return

	# check for the same or different kind of the reverse relation

	if (reverse[0] == relation[3])
	concordant++

	else
	contradictive++
}

Download

To print relation statistics using relfix.awk:

gawk -f relfix.awk
146 130 22

To print relation statistics deleting any self relationships:

gawk -v introdel=1 -f relfix.awk
146 130 22

Print relation statistics once again:

gawk -f relfix.awk
146 130 0

The SPAWK library

The SPAWK library is just an AWK script that contains a BEGIN section and function definitions; you can view the code online, but it's better to download the libray and view the code locally. It's a good practice to locate the library in one of the AWKPATH directories, so you can include it in your SPAWK scripts as follows:

@include "spawk.awk"

Alternatively you can include the SPAWK library in the command line using the -f option:

awk -f spawk.awk -f your_script data_files…

However, including the library in the application AWK script is more compact and gives the reader a clear sign that database usage is involved in the sciprt.

The SPAWK API

SPAWK API consists of a small subset of SPAWK library functions that can be called from AWK scripts. Actually there are less than 10 functions in the API, each belonging to one of these three categories: authentication, query submition, fetching results, miscellaneous functions.


spawk_sesami(user, password [, database])

spawk_sesami is the one and only authentication function in the API. The parameters are straightforward, namely the database user name, the user's password and the database name (optional). These parameters are used whenever a new client is to be created. Usually spawk_sesami function is called in the BEGIN section of an AWK script.


spawk_query(query)

spawk_query submits a DQL query to the database server for execution. DQL queries are SQL queries that produce result rows; usually a DQL query starts with a SELECT statement. After query submission via spawk_query, all of the result rows must be retrieved from the client to release the client and get it ready to accept another query. Submitting another query while there exist result rows waiting to be read, causes a new client to be pushed in the SPAWK stack.


spawk_fetchrow(row [, nosplit])

spawk_fetchrow asks for the next result row to be retrieved from the client that processed the last query passed to spawk_query. If there are no more result rows to be retrieved, the function returns 0, else the result row is returned in row[0], while row[1], row[2],… row[n] are filled in with the corresponding column values and, finally, the number of columns is returned. If passed a (non zero) nosplit value, spawk_fetchrow does not split the row in columns and a value of 1 is returned on success.


spawk_fetchrest()

spawk_fetchrest is used to skip all remainig result rows.


spawk_fetchone(row [, nosplit])

spawk_fetchone function is just like spawk_fetchrow, but skips all remaining result rows. spawk_fetchone returns 0 if there's no result row to be returned.


spawk_submit(query)

spawk_submit submits a DML/DDL query to the database server for execution. Normally, no result rows willl be produced from such queries. spawk_submit returns the number of the rows affected, or 0 for DDL queries and DML queries that did not affect any rows. On failure -1 is returned.


spawk_error(msg)

spawk_error is a miscellaneous function that prints an error to the standard error.


spawk_fatal(msg, err)

spawk_fatal is just like spawk_error, but exits AWK program with exit status err.