Learn more about this course.

File Manipulation: sort and uniq

File manipulation: sort and uniq

What are sort and uniq?

Ordering and manipulating data in Linux-based text files can be carried out using the sort and uniq utilities. The sort command orders a list of items both alphabetically and numerically, whereas the uniq command removes adjacent duplicate lines in a list.

Let’s work through an example together, please play along!

First, using nano editor (or any other text editor that you might prefer) create a text file with the following content (one fruit per line): orange pear apple banana grape satsuma melon pomegranate banana grape. Name it fruit.txt

Want to keep
learning?

This content is taken from
Wellcome Connecting Science online course,

Bioinformatics for Biologists: An Introduction to Linux, Bash Scripting, and R

View Course

How to use sort

The sort command accepts input from a text-based file and outputs its results to the screen.

sort fruit.txt

The sort results can also be output into another text file.

sort fruit.txt > sorted_fruit.txt

You can reverse the order of the sort with the -r option.

sort -r fruit.txt

Scrambling the order of lines is also possible with the -R option

sort -R fruit.txt

This can be scrambled even further using the system random number generator

sort -R fruit.txt –random-source=/dev/random

The -f option forces the sort to ignore the case of a letter when ordering lines.

sort -f fruit.txt

The -s option stabilises the sort by outputting identical lines in the same order as they appeared in the original file.

sort -s fruit.txt

Duplicate lines can be removed with the -u option

sort -u fruit.txt

How to use uniq

The uniq command accepts input from a text-based file and removes any repeated lines, only if they are adjacent to each other. That’s why it’s used in conjunction with sort to remove non-adjacent lines.

sort fruit.txt | uniq

Case differences can be ignored when dropping duplicate adjacent lines, using the -i option.

sort fruit.txt | uniq -i

Combining -i with the -c option for uniq, counts the number of times a line occurs in a file.

sort fruit.txt | uniq -ic

Using the -d option with -i inverts the behaviour of uniq and only prints the duplicated lines.

sort fruit.txt | uniq -id

It can be helpful to pipe this output into the input of another uniq command.

sort fruit.txt | uniq -id | uniq -i

For more on sort and uniq visit https://www.linode.com/docs/guides/manipulate-lists-with-sort-and-uniq/

Your task

How would you extract only the lines that repeat more than once in the file fruit.txt into a new file named repeated_fruit.txt file?

Post your answers to the comment area below and discuss your answers with the other learners.

Want to keep learning?

This content is taken from Wellcome Connecting Science online course

Bioinformatics for Biologists: An Introduction to Linux, Bash Scripting, and R

View Course

See other articles from this course

This article is from the free online

Bioinformatics for Biologists: An Introduction to Linux, Bash Scripting, and R

Created by

Join Now

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now

Learn more about this course.

File Manipulation: sort and uniq

What are sort and uniq?

Want to keep
learning?

Bioinformatics for Biologists: An Introduction to Linux, Bash Scripting, and R

Your task

Want to keep learning?

Bioinformatics for Biologists: An Introduction to Linux, Bash Scripting, and R

Bioinformatics for Biologists: An Introduction to Linux, Bash Scripting, and R

Bioinformatics for Biologists: An Introduction to Linux, Bash Scripting, and R

Reach your personal and professional goals

Register to receive updates

Learn more about this course.

Learn more about this course.

See all FutureLearn courses.

Learn more about this course.

File Manipulation: sort and uniq

What are sort and uniq?

Want to keep learning?

Bioinformatics for Biologists: An Introduction to Linux, Bash Scripting, and R

Your task

Want to keep learning?

Bioinformatics for Biologists: An Introduction to Linux, Bash Scripting, and R

Share this

Bioinformatics for Biologists: An Introduction to Linux, Bash Scripting, and R

Bioinformatics for Biologists: An Introduction to Linux, Bash Scripting, and R

Reach your personal and professional goals

Register to receive updates

Learn more about this course.

Learn more about this course.

See all FutureLearn courses.

Want to keep
learning?