Sunday, August 17, 2014

The command "tee -" Considered Harmful

I spent much time debugging a corrupted buffer problem and thought the learning would be useful to a larger audience.

Note: I was debugging someone else script ! :-)

tl:dr: if you are using "tee -", you are better to know what you are doing.  If you have doubts, read on.

The script I was debugging is sending data over the network to a logging system.  That logging system is composed of many forwarder, splitter, sharder and so on.  Sending data to this system is easy, it only needs to be piped to netcat:
  • <data_producer> | nc <hostname> <port>
The problem was that we were receiving corrupted data at the other end of the logging system.  We already had some problem with this system because of bad handling of read/write buffers.  We thought at first it was a similar problem, but we were wrong.

The problem was on the producer side.  To keep a local copy of the data, a tee was introduced before sending the data to the network.  What should have looked like:
  • <data_producer> | tee <local_file> | nc <hostname> <port>
was written as (notice the "-" as the 1st argument of tee):
  • <data_producer> | tee - <local_file> | nc <hostname> <port>

From the tee manual, we can read:

       tee - read from standard input and write to standard output and files

       tee [OPTION]... [FILE]...

       If a FILE is -, copy again to standard output.

Wow ! we can use tee to duplicate its input.  Let's test this:

$ echo "Hello World" | tee -
Hello World
Hello World

But is it exactly doing what we want, let's try something else:

$ printf "Hello\nWorld\n" | tee -

Hum, it looks like tee outputs its input twice (one time after the other), not every line twice.  Will it always do that ?  Let's try something else:

$ ( echo Hello; sleep 1; echo World; ) | tee -

There is a pause in the output between Hello and World above, but more importantly, the output of tee is dependent of the timing of its input !  Weird, but we can get a sense of why.  Let's go further:

# print 50 lines of 100 chars each.
for n in $(seq 0 50); do
  # print the line number on 9 chars.
  printf "%-9d" $n
  # fill the line up to 99 chars.
  for i in $(seq 1 9); do
     printf "%-10d" $i
  echo # the last char is a \n.
done | tr " " "." | tee - 

I tested the above with 3 different systems obtaining 2 different results (the results are too long to put here, you can follow the links for the details):
On Cygwin, the first interesting line is starting with 10.  On openSUSE and on CentOS, it is at the line starting with 40.  If you do not get mangled output while testing, maybe it will be on the line starting with 81 or further (use 100 or more instead of 50 in the 1st for).

10.......1.........2....0........1.........2 ... 
                        ^ 1st wrong char at pos 1025

40.......1.. ... ...8.........9......0........1 ...
          1st wrong char at pos 4097 ^ 

So, what is happening here: we get mangled output and the 1st corrupted line ends like the 1st line of the input.  How is that possible ?  The answer lies in the way tee is implemented.  We can guess that:
  • tee calls read on stdin with a pre-allocated buffer (1024 bytes in CygWin and 4096 bytes in openSUSE and CentOS),
  • then it writes the buffer to stdout and to all the files passed as arguments.

So, if "-" is passed as an argument to tee, the buffer is written twice to stdout.  When the read buffer ends in the middle of a line, we will get output corruption.

I do not exactly know where the buffering happens (in the output of tr, in the pipe itself, or in the input of tee), but I do not think this is of capital importance.  The important thing to remember is that "tee -" gives unpredictable results.  I was not able to find a good use-case for this unpredictable command, if you are, please write it in the comments.

Coming back to the initial problem, the "-" was an error in the script, it should never has been used.  If you are using "tee -", think back at what you want to do:
  • if you want to print every line twice, you can use "... | sed -e p"
  • and if you want to print the complete input twice you can use "... | sed -e '1h;2,$H;$G'" (if the input does not fit in RAM, you can use "... | tee <tmp_file>; cat <tmp_file>").

PS: I found in the Single UNIX Specification v4 (registration required) that, in "tee -", the "-" should be treated as a file, not a stdout.  This was not specified in v3: it looks like I am not the first to hit this problem.

No comments:

Post a Comment