Shuffling the lines of a large file

There is a lot of methods for sampling a dataset. One simple way is the random sampling, where you shuffle the instances of a collection, here represented by the lines of a file. According to the size of your file it can be performed using the ‘shuf’ tool, provided by linux coreutils. One example of common usage is given above:

shuf [input-file] > [output-file]

However, this tool requires that all the input size fits in the memory. If your file size exceeds the memory size, you can use the linux ‘sort’ with parameter ‘-R’. An example of its usage is as follows:

sort -R [input-file] > [output-file]

Recall that in some environments the sort does not work properly for case sensitive texts (as is highlighted in sort manual). If this is your case, you may execute the following command before the sorting:

export "LC_ALL=C"

See ya!

Using mirrors in SED

It took me long to discover mirrors in SED. That’s a feature that may help you a lot in some replacing problems. Mirrors are denoted by \1, \2, \3, .. \9 and are used to cut a part of the input string to use it in the replacement string, like a temporary variable. For use mirrors in SED, you need to use extended regexp, flag -r. Let’s see some applications:

Chaging a data:

$ echo 12/31/2004 | sed -r 's@([0-9][0-9])/([0-9]{2})/([0-9]{4})@\1.\2.\3@'
$ echo 12/31/2004 | sed -r 's@([0-9][0-9])/([0-9]{2})/([0-9]{4})@\2.\1.\3@'
$ echo 12/31/2004 | sed -r 's@([0-9][0-9])/([0-9]{2})/([0-9]{4})@\2-\1-\3@'

Inverting field order:

$ echo abcdefg | sed -r 's/(a)(b)(c)/\3:\2:\1/'
$ echo abcdefg | sed -r 's/(a)(b)(c).*/\1:\2:\3/'
$ echo abcdefg | sed -r 's/(a)(b)(c).*/\3:\2:\1/'

Read more:

See ya!

How to generate a bash script with an embeeded tar.gz (self-extract)

Consider that you need to perform a routine in a remote server, where you need to decompress a and execute a list of commands on this data. One alternative is send the tar.gz file to the remote server throught a ftp or scp and then log in the remote server and run a shell script or run manually a list of commands. Recall Java JRE setup, they use script.bin that comes with an embeeded tar.gz, which is self-extracted in the beginning of script execution. To build the self-extraction script I follow a tutorial published by Stuart Wells, which consists in four steps:

1) Create/identify a tar.gz file that you wish to become self extracting.

2) Create the self extracting script. A sample script is shown below:

> cat
echo "Extracting file into `pwd`"
# searches for the line number where finish the script and start the tar.gz
SKIP=`awk '/^__TARFILE_FOLLOWS__/ { print NR + 1; exit 0; }' $0`
#remember our file name
# take the tarfile and pipe it into tar
tail -n +$SKIP $THIS | tar -xz
# Any script here will happen after the tar file extract.
echo "Finished"
exit 0
# NOTE: Don't place any newline characters after the last line below.

3) Concatenate The script and the tar file together.

> cat example.tar.gz >
> chmod +x

4) Now test in another directory.

> cp /tmp
> cd /tmp
> ./

See ya!

Running a matlab script in command line (misleading .m)

The syntax to run a matlab script in batch mode from command line is:

matlab -nodesktop -nosplash -r [command]

Since I’ve accostumed to pass file name as argument to several unix tools I made some confusion on execute a script in matlab. In matlab we should pass a command, not a file name as an argument. So for run a script from command line you should first ensure that your script is in malab path (or is in your current path). Then, remember that to make a call of a Matlab script with several commands without a function you just call it by the script file name (without the .m). So if your script name is foo.m you should call as follow:

matlab -nodesktop -nosplash -r foo

See ya

pssh: Execute a command on multiple hosts on terminal

If you deal with the task of manage multiple machines in a local network, it is usual need to apply a given settings over all hosts, or install a new program on the entire network. We’ve posted before a tool called cluster-ssh which hold it via interface (or exporting X). However, if you could not export X you can use pssh, a tool that dispatch a command to multiple hosts in terminal. It can be found on Ubuntu repositories, but in Karmic Koala it does not work. The installation completes, but it does not create the binary in /bin, /usr/bin neither in /usr/local/bin). The turnaround is presented below:

sudo add-apt-repository ppa:thelupine/ppa
sudo apt-get update
sudo apt-get install pssh

After installing it, you can check, for instance, who is accessing each machine, what’s the load on each machine, what are the mounted devices and so. First you need to create a file with a list of all hosts:

### hosts.txt ### 

Then you can dispatch the command to be distributed:

pssh -h hosts.txt -i "df -h"

Read more:

See ya