An innocuous question from a colleague about what tools I use to process text got me thinking. I don’t have a hard and fast rule for when to preference one method over another; mostly it comes down to:
- what type of text I’m dealing with
- where the text is coming from
- the urgency of the results
- whether it needs to be reproduced or shared
- what I feel like doing at the time!
In no particular order, I use these methods:
Manual processing, especially if it’s just a few lines or a small file. The law of diminishing returns long taught me that futzing around with a tool can take far longer.
In a text editor, such as Vi(m) or perhaps Emacs in the future. This is especially useful for substitutions, or rearranging large blocks of text, or if the lines don’t have a structure you can easily parse or differentiate.
Shell scrips or inline Bourne or OpenBSD oksh, using awk, sed, tr, grep/ag, and pipes. I’ve been known to write cruddy, once-off XML parsers in them to process data, which you should never do.
LibreOffice spreadsheets. awk is brilliant for processing columns of data, but if you want to do some visual sorting and data selection, sometimes a spreadsheet really is easier. Exporting to csv is also a cinch.
SQLite3 databaes. They’re cheap to make and import data into, and then you’ve got standard(ish) SQL queries. I only started doing this recently, but for certain types of data it’s very quick.
Perl. I’ve told myself I need to learn Python or improve Ruby, but Perl hashes are stupendously useful and unreasonably flexible for mapping data structures and pulling out relevant information. Plus then I get XML, JSON, YAML, TOML, and other parsers for free.
What’s less clear is where one tool’s domain ends, and another begins. Sets on a Venn diagram would overlap more than not.