Handy Command-Line One-liners for Starting Data Scientists

[6/5/2017 update: I was asked if I had a PDF version of the one-liners below. Here it is. Data-Science-One-Liners.pdf ]

Experienced data scientists use Unix/Linux command-line utilities (like grep, sed and awk) a great deal in everyday work. But starting data scientists, particularly those without programming experience, are often unaware of the power and elegance of these utilities.

When interviewing candidates for data scientist positions, I ask simple data manipulation questions that can be done with a command-line one-liner. But often the answer is “I will fire up R, import the CSV into a data frame, and then …” or “I will load the data into Postgres and then …”.

The command-line can be much simpler and faster, especially for getting large data files ready for consumption by specialized tools like R. For example, rather than try to load a million-row CSV into R and sample 10% of it, you can quickly create a 10% sample using this one-liner … (read the rest of the post on Medium )

Share/Bookmark