Shuffling the lines of a large file

There is a lot of methods for sampling a dataset. One simple way is the random sampling, where you shuffle the instances of a collection, here represented by the lines of a file. According to the size of your file it can be performed using the ‘shuf’ tool, provided by linux coreutils. One example of common usage is given above:

shuf [input-file] > [output-file]

However, this tool requires that all the input size fits in the memory. If your file size exceeds the memory size, you can use the linux ‘sort’ with parameter ‘-R’. An example of its usage is as follows:

sort -R [input-file] > [output-file]

Recall that in some environments the sort does not work properly for case sensitive texts (as is highlighted in sort manual). If this is your case, you may execute the following command before the sorting:

export "LC_ALL=C"

See ya!

Advertisements