Processing text
SoftwareAn innocuous question from a colleague about what tools I use to process text got me thinking. I don’t have a hard and fast rule for when to preference one method over another; mostly it comes down to:
- what type of text I’m dealing with
- where the text is coming from
- the urgency of the results
- whether it needs to be reproduced or shared
- what I feel like doing at the time!
In no particular order, I use these methods:
-
Manual processing, especially if it’s just a few lines or a small file. The law of diminishing returns long taught me that futzing around with a tool can take far longer.
-
In a text editor, such as Vi(m) or perhaps Emacs in the future. This is especially useful for substitutions, or rearranging large blocks of text, or if the lines don’t have a structure you can easily parse or differentiate.
-
Shell scrips or inline Bourne or OpenBSD oksh, using awk, sed, tr, grep/ag, and pipes. I’ve been known to write cruddy, once-off XML parsers in them to process data, which you should never do.
-
LibreOffice spreadsheets. awk is brilliant for processing columns of data, but if you want to do some visual sorting and data selection, sometimes a spreadsheet really is easier. Exporting to csv is also a cinch.
-
SQLite3 databaes. They’re cheap to make and import data into, and then you’ve got standard(ish) SQL queries. I only started doing this recently, but for certain types of data it’s very quick.
-
Perl. I’ve told myself I need to learn Python or improve Ruby, but Perl hashes are stupendously useful and unreasonably flexible for mapping data structures and pulling out relevant information. Plus then I get XML, JSON, YAML, TOML, and other parsers for free.
What’s less clear is where one tool’s domain ends, and another begins. Sets on a Venn diagram would overlap more than not.