Processes, stdout, pipes and buffers using Ruby
Let's say you're using Ruby and need to run a "command", which means running another process with some arguments. Easy, right?
Maybe you need it to go through a shell or perhaps not: question one out of... well, more than you'd probably like. Do you want to just wait the command to complete and then have all its output piped into a variable? Or better to have it run asynchronously and then have a communication channel with the child process, say with STDIN or STOUD (or STDERR).
However it might be, sometimes it might be as simple as: just run this command to completion and give me its output in a variable. If that is the case, then this does it:
a = `ls -lh /`
That'll get you the list of files and directories as a whole string inside a
.
What about a longer command, like:
a = `yes`
That is where things get interesting!
That right there just never returns.
You see, the yes
command is a program that endlessly prints the string "y" followed by a line break.
Focus on the word endlessly.
A yes
written in ruby would be as simple as: loop { puts("y") }
. Again: endlessly.
So let's try something non-blocking (aka. asynchronous):
fd = IO.popen(['yes'], 'r')
This would return immediately. If you'd do fd.read
next, it'd block indefinetely. So let's read the first 16 bytes
of that output:
fd.read(16)
=> "y\ny\ny\ny\ny\ny\ny\ny\n"
It's important to note that this read
operation has consumed some data from a buffer.
And it begs the question: if the yes
command is being run in the background (say you're running those using irb
),
where is all that content being buffered to? Is it going to eat up all my RAM?
As an experiment, let's use the following ruby code in an executable at /tmp/experiment1.rb
:
#!/usr/bin/env ruby
line = 0
loop do
line += 1
line_str = "L" + line.to_s.rjust(10, "0") + "\n"
print line_str
File.write("/tmp/experiment1.out", line_str)
end
(don't forget to chmod +x /tmp/experiment1.rb
if you're running this experiment).
Then let's run the following code in an irb
repl:
fd = IO.popen(['/tmp/experiment1.rb'], 'r')
After waiting just a couple seconds, let's check the out-of-band value at /tmp/experiment1.out
:
$ cat /tmp/experiment1.out
L0000005463
It might vary from computer to computer, but that's the value I got on mine.
Running that same cat
command again, the same result is yielded.
This means the loop in the program ran until line 5463 and then it stopped.
In the irb
repl, I can get the child process PID with fd.pid
. Then I check with ps aux | grep <that pid number>
(ps aux | grep 9179
in my case) and I can see the process is still running:
myuser 9645 0.2 0.0 410059824 48 s038 S+ 8:07PM 0:00.00 grep 9179
myuser 9179 0.0 0.0 411455488 4960 s036 S+ 8:04PM 0:00.33 ruby /tmp/experiment1.rb
What is going on here? Well, the program of course hasn't been killed, nor it is fully running. It is kinda in a "pause" state. More precisely, it is trying to write to standard output (STDOUT), but stdout is blocked because the buffer is full.
Since every line the program prints is the same size (in terms of bytes), we can figure out with good precision what the size
of the buffer is. In the out-of-band file, the output is L0000005463
. The first line the program will output is L0000000001
.
Thus, we know that it has printed at least 5463 lines of 12 bytes each (one for each word/number and one more for the line break).
That's a total of 65556 bytes, or little over 64 KiB. In fact, it might have written a little more than that, but not enough to
unblock from the function call it is blocked on.
This particular program will always block at the line print line_str
because that is the line that writes to STDOUT, which
is the buffer that is limited in size here. If you'd run the /tmp/experiment1.rb
program in a terminal, it wouldn't stop ever
because the terminal would be consuming the program's output nonstop.
In this case, it's likely that your terminal app would start eating RAM away
because it would be the one storing all those lines.
This begs the question: what if we do consume some of the output buffer? We can do that by just reading from it with fd.read(n)
,
where n
is the amount of bytes we want to read. Let's read exactly one line (12 bytes) and check what's in the
/tmp/experiment1.out
file.
In the irb
repl:
> one_line = fd.read(12)
=> "L0000000001\n"
We just read the first line. I ran the cat
thing again and nothing changed. But then reading a few more lines
(that is, running the above again in the repl), it started to make progress. Reading some 20k lines with fd.read(12 * 20_000)
bumped
the /tmp/experiment1.out
content to L0000025953
. This shows us what we'd already expect: by consuming the buffer, more room is
made in it, therefore more data can be written to it and the program writing to it will get unblocked until the buffer is full again.
Searching for "linux pipes" in the web will surely give you some more resources to learn how how those work.
To close that program, just use fd.close
. The program in the other end will now be killed.
Let's now change the scenario a bit to something more common: a program that will produce a big amount of data (let's say many megabytes or gigabytes) but eventually exits. Let's call it /tmp/experiment2.rb
:
#!/usr/bin/env ruby
max_line = Integer(ARGV[0])
exit_code = Integer(ARGV.fetch(1, '0'))
line = 0
loop do
line += 1
line_str = "L" + line.to_s.rjust(10, "0") + "\n"
print line_str
File.write("/tmp/experiment2.out", line_str)
break if line >= max_line
end
exit(exit_code)
The first parameter is the number of the last line it'll print and the second parameter is the code it will exit with. Let's do a test run:
$ /tmp/experiment2.rb 4 9 ; echo $?
L0000000001
L0000000002
L0000000003
L0000000004
9
~ $ cat /tmp/experiment2.out
L0000000004
As expected, the last printed line is number 4, as it's the line at /tmp/experiment2.out
. Also, the exit code was 9 as expected.
Now let's run it from irb
, just with 10k lines this time:
fd = IO.popen(['/tmp/experiment2.rb', '10000', '9'], 'r')
Let's write some code to read chunks of 12 bytes and process them:
lines_read = []
loop do
line = fd.read(12)
break unless line
lines_read << line.chomp
end
lines_read.length
We can check that lines_read.length == 10_000
. And to get the exit code, we can use $?.exitstatus
.
It feels weird to me to use a global variable like $?
, but that's how its done.
To achieve the same but using Open3.popen2
, one would do:
si, so, ww = Open3.popen2('/tmp/experiment2.rb', '10000', '9')
Where si
stores the STDIN buffer, so
the STDOUT and ww
is a waiter object -- more on that soon.
To consume the output, let's use the following:
lines_read = []
loop do
line = so.read(12)
break unless line
lines_read << line.chomp
end
lines_read.length
We can use the ww.alive?
call to check whether the child process is still running. Before consuming the buffer above,
it returns true
. After all the buffer is consumed, it returns false
.
That waiter object gives us a nicer way to probe for the exit status:
> ww.value.exitstatus
=> 9
And the child process pid can be probed using ww.pid
or ww.value.pid
. It works even after the program has exited.