When working with a large database, some SELECT queries can take really long. And “really long” starts at a minute or two, of course. When designing queries, one often limits the result set because one is not yet actually interested in the final data.

However, you’ll want to have the result eventually. In the meantime, you probably want to play around further with your queries. But you don’t want to keep your current database connection (say, the GUI used for query design) busy. I happen to be in this situation quite often and have written a little script which allows me to run the queries in the background.

I had the following requirements

  • Flexible parametrization
  • Measure execution time
  • Keep results
  • Allow to reproduce the data easily

The script works as allows: One has to provide a file containing the query to be executed. The input file gets copied (for traceability/reproducibility) and executed. Execution time gets logged and one can provide further arguments like hostname or user/password.

After a couple of iterations, below is what the final script looks like. It is tied to MySQL but that doesn’t matter and can be adapted to your needs, of course.

#!/bin/bash

QUERYFILE=$1
OUTDIR=$2
ARGS=("${@:3}") # everything but the first 2 arguments
if [[ "$#" -le 1 || ! -r "$QUERYFILE" ]] ; then
    echo "Usage: $0 /path/to/query.sql /path/to/existing/outdir [mysql arguments]"
    exit 1
fi
OUTFILEPREFIX=$OUTDIR/$(basename "${QUERYFILE}")_$(date +%Y%m%d_%H%M%S)

echo "Query:"
cat "$QUERYFILE"
echo "Results: $OUTFILEPREFIX*"
echo "Mysql: ${ARGS[*]}"

# Redirect all output to the log file.
# https://serverfault.com/a/103569/234378
exec 3>&1 4>&2
trap 'exec 2>&4 1>&3' 0 1 2 3
exec 1>"$OUTFILEPREFIX.log" 2>&1

set -x # be verbose
cp "$QUERYFILE" "$OUTFILEPREFIX.sql" # copy input file
date
time mysql "${ARGS[@]}" < "${QUERYFILE}" > "${OUTFILEPREFIX}.tsv" # run the query
date

(Download)

Let me explain the different steps in more detail. The only mandatory arguments are the input query file and the output directory. The output directory will contain a copy of the input file, the log file and the results.

The output files get timestamp’ed in order to differentiate between several invocations that use the same (possibly modified) input file. Before running the query, the input parameters are printed to the screen for visual inspection. Then, all further output is redirected to the log file and the input file gets copied. Finally, the query is invoked. The date and time invocations allow to retrospectively study the timings.

Typically, the script is then executed like so

# Input file.
cat query.sql
use mydatabase;
SELECT * FROM mytable;

# Execute the query at server dbhost using username dbuser and providing a
# password at the prompt.
$SCRIPTS/runsql.sh query.sql /outdir -h dbhost -u dbuser -p

# Possible results.
ls -1 /outdir
query.sql_20190322_140308.log
query.sql_20190322_140308.sql
query.sql_20190322_140308.tsv

The input get gets copied oth the output .sql file. The output .tsv will contain the tab-separated result data. If any errors occurred, they can be found in the .log file. In order to see them immediately, I usually run

# Execute the script and show the latest (tail -n 1) log file immediately
# afterwards.
Q=query.sql ; O=/outdir ; $SCRIPTS/runsql.sh $Q $O ; cat $(ls -1 $O/$(basename $Q)*log | tail -n 1) 

I use the script regularly to execute a long-running query in a new tmux pane. Sometimes, I even forget about the query. Next day at work, I happily find all the results I need!