Overview
Occasionally I find myself wanting to use a python function in a bash script. This is
fairly easy to do with python's -c
option and a heredoc, but can come with a small
initialisation overhead, especially if the python script needs to import any modules.
This is fine if it is only executed once within the bash script, but may present a
significant performance hit if run in a loop.
Instead of giving up and rewriting the whole script in python itself, we can use Linux named pipes to pipe to and from a single background instance of the python script.
This may be best described with an example. For the sake of demonstration, the python script just prints out the input it receives; you would obviously perform whatever python transformation you require.
Naive approach
naive.sh
# Sample data
arr=(apple orange pear strawberry raspberry blueberry grape banana)
# The main loop
for x in "${arr[@]}"; do
result="$(echo $x | python -c "
val = input()
# Perform python processing on val
print('python: ' + val)
")"
# Perform shell processing on $result
echo $result
done
When I time this with time naive.sh
, it takes about 0.2s real time to complete: this
is incredibly slow when all we are doing is printing 8 values! Because the overhead is
incurred with every loop, this will slow down linearly with the size of the data array.
Named pipe approach
fifos.sh
# Sample data
arr=(apple orange pear strawberry raspberry blueberry grape banana)
# Create temp files names
in_pipe="$(mktemp -u)"
out_pipe="$(mktemp -u)"
# Create input/output pipes
mkfifo $in_pipe
mkfifo $out_pipe
# Run python with -u to prevent output buffering
python -uc "
import sys
# Loop stdin, strip trailing newlines
for val in map(str.rstrip, sys.stdin):
# Perform processing on val
print('python: ' + val)
" <$in_pipe >$out_pipe &
# ^ Redirect stdin and stdout to pipes
# Hold pipes open
exec 3<>$in_pipe
exec 4<>$out_pipe
# The main loop
for x in "${arr[@]}"; do
# Write to input pipe
echo $x >$in_pipe
# Read from output pipe
read result <$out_pipe
# perform shell processing on $result
echo $result
done
# Cleanup: close pipes
exec 3>&-
exec 4>&-
# Cleanup: remove fifo files
rm $in_pipe
rm $out_pipe
When I time
the main loop here, it completes in 0.035s real time: quite an
improvement. You may think that 0.035s is still too long to print 8 values, but this
factors in the single instance of overhead initialising the python script. Crucially,
this delay won't scale with the size of the input array. If I time
just the main
loop (after already setting up the python script), it completes in 0.001s real time.
Explanation
Create named pipes
in_pipe="$(mktemp -u)"
mkfifo $in_pipe
We start by creating two named pipes to serve as the input/output to our python script.
The -u
option to mktemp
specifies that a name should be generated, but no file
created. The mkfifo
command then creates a named pipe at the specified location. You
can use hardcoded names in place of the mktemp
command, but you may not wish the pipe
files to clutter your working directory, and there may be a problem if files with those
names already exist.
Prevent python from buffering
python -uc "
The -u
option is important as it specifies that the output should be unbuffered;
without it, python will buffer a certain amount of data from print
before flushing it
to the output: output values would not be available at the time we wanted to read them.
Instruct python to iterate stdin
for val in map(str.rstrip, sys.stdin):
Because we will repeatedly be piping into the same python instance, we need a loop
inside our python script that will execute the same instructions for every input.
We can loop over sys.stdin
, which returns an iterable of input lines (delimited by a
newline character). We likely don't want the newline character that appears at the end,
so use map(str.rstrip, ...)
to get rid of these.
Redirect python's stdin/stdout and send to background
<$in_pipe >$out_pipe &
<$in_pipe
and >$out_pipe
replace stdin/stdout respectively with our input and output
pipes, and &
on the end sends the process to the background.
Keep the in/out pipes open
exec 3<>$in_pipe
exec 4<>$out_pipe
These commands are probably the least obvious, but they are critical. If you want to
understand these in depth you should research bash redirections, but I will attempt a
TLDR.
Our sys.stdin
python loop will continue iterating until an EOF
(end of file)
character is received. Linux will send EOF
to the pipe when all writers have closed.
We need to keep the input pipe open until we are done, else EOF
will be sent after our
first write to the pipe. The command exec 3<>$in_pipe
opens the input pipe and assigns
it to the shell's file descriptor number 3, keeping it open for until we explicitly
close it later.
The output pipe is a similar story; Linux will close it after python writes to it the
first time, unless kept open else where. We assign it to file descriptor 4 to hold it
open.
Main loop: write
echo $x >$in_pipe
Within our main loop: rather than echoing directly to python as in the naive example, we write our input to the in pipe we assigned to python.
Main loop: read
read result <$out_pipe
Within our main loop: read output values from our $out_pipe
, which is where we
instructed python to write it's results.
Cleanup: close the input pipe
exec 3>&-
Remember that python will keep iterating until it's stdin
closes, but we have held
stdin
open by assigning it to fd 3. This command closes the file descriptor, which
will cause python's sys.stdin
to stop providing values. The python for
loop will
exit, and the python script will exit in the background.
If your python script needs any cleanup, just insert it after the for val in map(str.rstrip, sys.stdin):
loop; it will run as normal when the loop finishes, before
python exits
Cleanup: other
exec 4>&-
rm $in_pipe
rm $out_pipe
Keep everything clean by closing file descriptor 4 and remove our temporary pipes. We're all done!