21 essential command line interface tools for Data Scientists
In this article, we are going to look at the most convenient tools for quick analysis of data stored in text files (logs, reports, etc.).
Most often the needed data is not stored on our computer. So at first, we will examine how to gain access to the remote server and how to use it. For that, you need to use the most suitable SSH (Secure Shell - that is a cryptographic network protocol, to allow remote login and other network services to operate securely over an unsecured network).
In the terminal you can use one of the following commands to connect to a remote server over SSH in Ubuntu:
$ ssh user@host
$ ssh -p port host
connection with the key
$ ssh -i key.pem user@host
command execution on the remote host
$ ssh -i key.pem user@host 'command'
SSH Client is already in the list of default programs in Ubuntu. If not, you can install it using command sudo apt-get install ssh from the Terminal.
On Windows
In the case when you are using Microsoft Windows, you need to install a free program PuTTY, available at http://www.putty.org/ in order to work with SSH. To connect using PuTTY, follow next steps:
-
Run PuTTY and enter your connection settings:
a. Host Name:
example.com
b. Port:
22 (leave as default)
c. Connection Type:
SSH (leave as default)
-
Click
Open
to start a SSH session. -
Once the SSH Connection is open, you should see a terminal prompt asking for your username.
-
Next, enter your password. Please note that when entering your password you will NOT see your cursor moving or any input characters (such as ******). This is a standard PuTTY security feature. Press enter.
-
You can start typing at the command prompt.
One of the nice features of SSH is an opportunity to create secure tunnels.
On Ubuntu
1. Tunnel from the network into the world:
ssh -f -N -R 22:192.168.0.1:22 username@1.1.1.1
enter on the host 1.1.1.1:
gain access to the host 192.168.0.1
$ ssh localhost
Options:
-
-R
- perform remote redirection. When you call on the port of the remote machine, a SSH-tunnel will be set up, and the connection will be forwarded to the specified host port. -
-N
- do not run a command remotely. It is used only when forwarding ports. -
-f
- go to the background mode immediately after registration on the remote system.
A very useful way to use ssh tunnels is this traffic encryption. For example, if you are using an open network, but you don’t want someone could intercept your data.
2. Tunnel of the world into a network:
ssh -f -N -L 80:192.168.0.1:80 username@1.1.1.1
enter the command on your host: http://localhost:80
and gain access to a web host 192.168.0.1, which is located behind the host 1.1.1.1
Options:
-
-L
- execute a local port forwarding. When you call the port of the local machine, then the host port of specified host will be created on the tunnel port.
Tunnel from the world into a network (reverse tunnel) is used in the situation, when you need to get on the machine, protected by a firewall or located behind NAT.
The principle of operation is that the connection is initiated by the remote machine, and we fall back on ready-mix already. In such a tunnel, you can send any traffic, not only ssh.
To use the SSH-tunneling in PuTTY:
-
In Connection -> SSH -> Tunnels, enter
Source port: 22
Destination: localhost:22
-
Check the box to "Dynamic" and click "Add" button.
-
In Session, enter desired "Host Name" and select the SSH protocol. Then save option by typing in "Saved Sessions" name, for example, "HostName with ssh tunnel on 22" and click "Save". Double-click on a name and make the connection.
-
Once the SSH-connection is established, you can use your browser. Opening one of the web pages, on which IP-address is determined, we can see that it is now defined as the address of the remote machine.
Now we are connected to the host, and you can operate it. But some actions require root access. To receive it (if the root access is allowed on the host), you need a terminal to execute:
$ sudo -s
The connection to the remote host will be determined through the terminal After entering the root-user password, you will get root access:
We now know everything we need to connect to the host and to obtain the necessary rights, but it is not enough to analyze the data on the host. Consider the basic commands for control and data acquisition and processing on a remote host.
Work with file system
Let’s start to learn how to work with file system. Here is the list of the commands to:
1. Work with files and folders
create a folder dir_name
mkdir dir_name
rename folder dir_name to dir_name2
mv dir_name dir_name2
delete the folder / file
rm -rf dir_name
Options:
-r
,-R
- process all nested subdirectories.-i
- display a confirmation of each transaction removal.-f
- do not return false end if errors were caused by files that do not exist; do not ask for confirmation of transactions.
remove all files that begin with file2015; symbol *
denotes any character, you can use it anywhere to indicate any missing characters
rm -rf file2015*
create file file_name.txt
touch file_name.txt
rename the file
mv file_name.txt file_name2.txt
2. Permissions
ls -la
- view an access and ownership of all files / folders in the directory
-l
- key extended output
execute permissions of the file file
chmod 777 file.sh
R for all files in a folder dir_name
chmod -R 777 dir_name - set access 777 recursively
For more information you can look here: https://ru.wikipedia.org/wiki/Chmod
3. Owner / Group
Set an owner and group of the file file.txt
chown ubuntu:ubuntu file.txt
set an owner and group folders recursively
chown -R ubuntu:ubuntu dir_name
4. View disk space
view space of all sections
df -h
get folder size
du -sh dir_name/
get folder size and the size of subdirectories
du -h dir_name/*
get the file size
du -h filename
The following commands can be used to obtain information about the state of the host system:
-
View available RAM and swap
free
-
Console command that displays a list of current system processes and information about them
top/htop
Now we can work with the file system and obtain the information that we need. Unfortunately, in most cases, it is not human-readable information. Consider the tools to facilitate data analysis.
When you need to analyze the information, the first thing you need to do is to find it. Generally the command
find / var -name search _name
can also use the *
symbol, if the whole name is too long or unknown. In this case, use
find / var -name search_name*
We are looking for the logs of authentication on the host:
Once we’ve found required files, usually it is necessary to view them. You can use the following commands:
-
file output content
cat file_name
-
file_name1 file_name2 merge into one file_name3
cat file_name1 file_name2 > file_name3
-
combine multiple files at the beginning of the file_name
cat file_name* > file_name_end
-
combine multiple files at the beginning of the file_name enclosed in a lot of directories beginning with some_dir
cat some_dir*/file_name* > file_name_end
You can also redirect the output stream not only with files, but also with scripts and commands:
-
redirecting output to overwrite file
./some_script.sh > /path/to/file.txt
-
redirect output to append to a file
./some_script.sh >> /path/to/file.txt
-
Output the standard output (STDOUT) and the error stream (STDERR) in /dev/null:
some command > /dev/null 2>&1
The next step is to save the file contents
/var/log/auth.log
and
/var/log/auth.log.1
in the
/home/auth.log
file for the further analysis.
As we can see from the output, there is a lot of information that we don’t need. Now we can use filtering to get just the right information. There is a very useful utility grep for this task. The syntax is presented below:
-
trim cat’s output to get only the lines containing some_line
cat file_name | grep some_line
Grep also can be applied to the commands and scripts:
-
output only strings that contain errors
python run_sum_script.py | grep error
We can print the authorization records relating only to ssh, according to the sshd
filter:
You can also track changes to a file, it is useful, for example, when you are testing a web page. So you can "catch" it when querying log, or view new authorization records:
command will print new records of third-party application to a file.
tail -f some_web_server_log
It is also possible to use it in conjunction with grep:
prints new records which contain error
tail -f some_web_server_log | grep error
Let's take a look at the tail in action. First, set the tracking file
/var/log/auth.log
and use the command tail
-f /var/log/auth.log | grep sshd
After each new connection to the host via ssh, we will display information and authorization records.
Another useful tool for the text analysis of the files is awk
. With its help, you can easily cope with any text files structure. Awk is a command of contextual search and text. It can be viewed as a shell awk
in shell
. There are several ways to run an awk program. If the program is short, it is easier to include it in the command that runs awk, like this:
awk 'program' input-file1 input-file2
For more information please read the manual:
http://www.gnu.org/software/gawk/manual/gawk.html
We can use awk to see when users connect and include via ssh using command:
$ awk '/sshd/ && /pam_unix/ {print ($1,$2,$3,$8,$11)}' /var/log/auth.log
There is also such a useful command as sed
. This command copies files (the default standard input) to the standard output, edits them by their commands placed in the script
(in a batch file or a line editor [rather than shell]). Under the flag -f
command takes an image of sfile
file. If there is only the option -e script
, the -e
flag can be omitted. -n
flag suppresses (derived by default). Script
consists of editing commands, one per line, in the format:
[addr [, addr]] cmd [args]
"Sed" cyclically converts an input string at the output.
Example:
-
Replace every occurrence of John to Nick in report.txt file
sed 's/Nick/John/g' report.txt
-
Removes lines that were found
who | sed '2,4d'
-
and so on.
Consider some useful commands to facilitate the operation to the remote host. Tar and zip archives are often used to save space and traffic. The following commands can be used to work with them:
For tar:
-
ignore reading errors during compression.
tar czf new-tar-file-name.tar.gz file-or-folder-to-archive --ignore-failed-read
c
- create new archive.z
- compress an archive using gzip.f
- use archive file.
-
for multiple directories
tar -czf new-tar-file-name.tar.gz file1 file2 folder1 folder2 tar -xzf tar-file-name.tar.gz
x
- extract an archive.z
- uncompress an archive using gzip.f
- use an archive file.
For zip:
-
zip files
zip file-or-folder-to-archive new-zip-file-name.zip
-
unzip files
unzip -d new-zip-file-name.zip
Where -d
destination_folder
And, of course, work with the remote host can not be done without the transfer of files between the local computer and the remote host. To accomplish this, you can use the following commands:
-
download tar (paste from the local machine)
scp -r (recursive) username@server:(remote location) (local location)
-
upload files to the remote server
scp -r (local location) username@server:(remote location)
Above we have received time of connection and disconnection users via ssh with the help of awk command. Now we can save all this to a file, compress it into a tar file and upload.
Conclusions
With this article we wanted to show you how many possibilities has the CLI and how can it help you in your data science path. You are definitely spend a lot of time at the command line and the article was intended to help understand the basic commands you can use in your work.
Each instrument is useful in its own way and their combination in a larger pipeline can become a very powerful instrument. Now you can work with the classics such as grep, sed, and awk and it opens a great opportunities for you.
We are very interested what commands do you prefer to use in your daily tasks, so please leave your comments below and continue to study and improve yourself. Good luck!
Comments (0)
Add a new comment: