Shuffling the lines of a large file

There is a lot of methods for sampling a dataset. One simple way is the random sampling, where you shuffle the instances of a collection, here represented by the lines of a file. According to the size of your file it can be performed using the ‘shuf’ tool, provided by linux coreutils. One example of common usage is given above:

shuf [input-file] > [output-file]

However, this tool requires that all the input size fits in the memory. If your file size exceeds the memory size, you can use the linux ‘sort’ with parameter ‘-R’. An example of its usage is as follows:

sort -R [input-file] > [output-file]

Recall that in some environments the sort does not work properly for case sensitive texts (as is highlighted in sort manual). If this is your case, you may execute the following command before the sorting:

export "LC_ALL=C"

See ya!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: