Wednesday, January 19, 2011

How to output Unix Join results in tab delimited format

If you like the performance of Unix join command to join two sorted data set, you'll like this one. For years I've converted the join output from space delimited to tab delimited using another script as this is join's default output format. Blogs here by GeekBrainDump, and JJinuxLand: UNIX: "join" show you the syntax and the way to insert a tab at the command line. Here is a summary:

For a left outer join with output in tab-delimited format:
join -a 1 -1 2 -2 2 -t '[use ctrl-v tab at the command line to insert the tab]' 1.txt 2.txt

Join is a very powerful command, and much faster than SQL join, there is a great tutorial of different ways of using the command:
Examples of the unix join command
Given the files:
::::::::::::::
1.txt
::::::::::::::
1 abc
2 lmn
3 pqr
::::::::::::::
2.txt
::::::::::::::
1 abc
3 lmn
9 opq

if we type:
join 1.txt 2.txtWe get
1 abc abc
3 pqr lmn
since the two files are trying to match on the 1st column, and both have a 1 and a 3 in that column. To tell join to use the second column to join with, we type:
join -1 2 -2 2 1.txt 2.txt (where "-1 2" stands for the 2nd filed of the 1st file, and "-2 2" stands for the 2nd field of the 2nd file) we get:
abc 1 1
lmn 2 3
which are just those fields that match in both files. This is refered to as an inner join. Inner joins look for rows that match rows in the other table. The problem with inner joins is that only rows that match between tables are returned.
If we type:
join -a1 -1 2 -2 2 1.txt 2.txt (where "-a1" says include all the records from the first file) we get:
abc 1 1
lmn 2 3
pqr 3
this is missing a number for pqr for the second file (since there is no pqr in that file) and is missing "9 opq" from the second file. This is an example of a left outer join, called such because it includes all of the rows from the left (or first) file specified.
If we type:
join -a1 -a2 -1 2 -2 2 1.txt 2.txt (adding "-a2" to tell join to also include all records from the second file) we get:
abc 1 1
lmn 2 3
opq 9
pqr 3
Which has all of the records. This is an example of a full outer join since it has all of the rows from both files. (Missing from these examples is the case where we had the -a2 without the -a1. That would have produced a right outer join which contained all of the records from the second (right) file and only those rows from the first (left) file that matched).
While this example has all of the rows from both files, we still have a problem since we can not tell which file the count is for on the last two records.
We now need to format the output using the "-o" command. The options for "-o" are as follows:
A "0" (that.s a zero) means display the join field
A number in the format of X.Y means to display the Y field from the X file (ex 2.1 means display the first field from the second file).
-e "0" says to replace any missing data fields with whatever is in-between the quotes (in this case a zero).
So if we type:
join -a1 -a2 -1 2 -2 2 -o 0 1.1 2.1 -e "0" 1.txt 2.txtwe get:
abc 1 1
lmn 2 3
opq 0 9
pqr 3 0
which has all of the records, and fills in the zeros for us. 

No comments:

Post a Comment