Initial reconnaissance.
Before getting too carried away, I wanted to reconnoiter the first column of the spreadsheet since it seemed to have the most diverse assortment of information in it.
Here are some command-line operations I performed in Terminal.app on the Mac after selecting the first column of FINALS.xlsx and copying to the clipboard:
Scale of stuff to look at
pbpaste | cat | wc
65536 9807 101150
-
pbpaste takes the contents of the clipboard and sends it to stdout
-
| pipe character which "pipes" stdout to stdin of the following command
-
cat concatenates to stdout . I use cat here defensively: cat seems to do some smart things with encodings, "conditioning" the text for use by other utilities and in this simple pipeline could have been omitted with identical results. I have encountered situations in which subsequent processing of the clipboard contents behaved better when using pbpaste if I inserted cat . It may be superfluous voodoo
-
wc performs a word count, reporting number of lines, number of words, number of characters * The clipboard apparently has
- 65536 lines -- more than I want to look at
-- Most are probably empty. That number is 2^16 and probably represents the maximum number of possible rows.
- 9807 words
- 101150 characters
Scale of unique stuff
pbpaste | cat | sort | uniq -c | wc
967 2571 14253
-
sort sorts the lines
-
uniq -c finds unique lines, the -c flag says to count how many instances of each line occurred
- I was actually wanting to see the unique lines, but by starting with
wc I got an idea of how much stuff I was going to need to look at, here nearly 1000 lines.
The unique stuff
pbpaste | cat | sort | uniq -c | less
- same as above except replace
wc with less which lets me page backwards and forwards through the output.
The unique stuff of likely interest that isn't a problem number
pbpaste | cat | grep '^[A-Z]' | sort | uniq -c | less
- similar to above, but only show lines which start with a capital letter
-
grep g eneralized r egular e xpression *p*arser looks at lines and passes ones which match to stdout discarding non-matches
-
^ anchors to the start of the line
-
[A-Z] matches any single character in the given range
- single quotes to protect the search pattern from interpretation by the shell
Check the other stuff
pbpaste | cat | grep -v '^[A-Z]' | sort | uniq -c | less
- same as above, except the
-v flag tells grep to reverse its behavior, send lines which do not match to stdout
- Why? To see if I missed anything of interest.
|