21 essential command line interface tools for Data Scientists


21 essential command line interface tools for Data Scientists

   On Windows

   On Ubuntu

   Work with file system

   Conclusions

In this article, we are going to look at the most convenient tools for quick analysis of data stored in text files (logs, reports, etc.).

Most often the needed data is not stored on our computer. So at first, we will examine how to gain access to the remote server and how to use it. For that, you need to use the most suitable SSH (Secure Shell - that is a cryptographic network protocol, to allow remote login and other network services to operate securely over an unsecured network).

In the terminal you can use one of the following commands to connect to a remote server over SSH in Ubuntu:

$ ssh user@host

$ ssh -p port host

connection with the key

$ ssh -i key.pem user@host

command execution on the remote host

$ ssh -i key.pem user@host 'command'

SSH Client is already in the list of default programs in Ubuntu. If not, you can install it using command sudo apt-get install ssh from the Terminal.

On Windows

In the case when you are using Microsoft Windows, you need to install a free program PuTTY, available at http://www.putty.org/ in order to work with SSH. To connect using PuTTY, follow next steps:

  1. Run PuTTY and enter your connection settings:

    a. Host Name: example.com

    b. Port: 22 (leave as default)

    c. Connection Type: SSH (leave as default)

  2. Click Open to start a SSH session.

  3. Once the SSH Connection is open, you should see a terminal prompt asking for your username.

  4. Next, enter your password. Please note that when entering your password you will NOT see your cursor moving or any input characters (such as ******). This is a standard PuTTY security feature. Press enter.

  5. You can start typing at the command prompt.

One of the nice features of SSH is an opportunity to create secure tunnels.

Virtual Machines for data science

On Ubuntu

1. Tunnel from the network into the world:

ssh -f -N -R 22:192.168.0.1:22 username@1.1.1.1

enter on the host 1.1.1.1:

gain access to the host 192.168.0.1

$ ssh localhost 

Options:

  • -R - perform remote redirection. When you call on the port of the remote machine, a SSH-tunnel will be set up, and the connection will be forwarded to the specified host port.

  • -N - do not run a command remotely. It is used only when forwarding ports.

  • -f- go to the background mode immediately after registration on the remote system.

A very useful way to use ssh tunnels is this traffic encryption. For example, if you are using an open network, but you don’t want someone could intercept your data.

2. Tunnel of the world into a network:

 ssh -f -N -L 80:192.168.0.1:80 username@1.1.1.1

enter the command on your host: http://localhost:80
and gain access to a web host 192.168.0.1, which is located behind the host 1.1.1.1

Options:

  • -L - execute a local port forwarding. When you call the port of the local machine, then the host port of specified host will be created on the tunnel port.

Tunnel from the world into a network (reverse tunnel) is used in the situation, when you need to get on the machine, protected by a firewall or located behind NAT.

The principle of operation is that the connection is initiated by the remote machine, and we fall back on ready-mix already. In such a tunnel, you can send any traffic, not only ssh.

To use the SSH-tunneling in PuTTY:

  • In Connection -> SSH -> Tunnels, enter

Source port: 22
Destination: localhost:22

  • Check the box to "Dynamic" and click "Add" button.

  • In Session, enter desired "Host Name" and select the SSH protocol. Then save option by typing in "Saved Sessions" name, for example, "HostName with ssh tunnel on 22" and click "Save". Double-click on a name and make the connection.

  • Once the SSH-connection is established, you can use your browser. Opening one of the web pages, on which IP-address is determined, we can see that it is now defined as the address of the remote machine.

Now we are connected to the host, and you can operate it. But some actions require root access. To receive it (if the root access is allowed on the host), you need a terminal to execute:

$ sudo -s

The connection to the remote host will be determined through the terminal After entering the root-user password, you will get root access:

 

We now know everything we need to connect to the host and to obtain the necessary rights, but it is not enough to analyze the data on the host. Consider the basic commands for control and data acquisition and processing on a remote host.

Work with file system

Let’s start to learn how to work with file system. Here is the list of the commands to:

1. Work with files and folders

create a folder dir_name

mkdir dir_name

rename folder dir_name to dir_name2

mv dir_name dir_name2

delete the folder / file

rm -rf dir_name

Options:

  • -r,-R- process all nested subdirectories.
  • -i - display a confirmation of each transaction removal.
  • -f - do not return false end if errors were caused by files that do not exist; do not ask for confirmation of transactions.


remove all files that begin with file2015; symbol * denotes any character, you can use it anywhere to indicate any missing characters

rm -rf file2015*

create file file_name.txt

touch file_name.txt

rename the file

mv file_name.txt file_name2.txt 

2. Permissions

ls -la - view an access and ownership of all files / folders in the directory

-l - key extended output

execute permissions of the file file

chmod 777 file.sh

R for all files in a folder dir_name

chmod -R 777 dir_name - set access 777 recursively

For more information you can look here: https://ru.wikipedia.org/wiki/Chmod

3. Owner / Group

Set an owner and group of the file file.txt

chown ubuntu:ubuntu file.txt

set an owner and group folders recursively

chown -R ubuntu:ubuntu dir_name

4. View disk space

view space of  all sections

df -h

get folder size

du -sh dir_name/

get folder size and the size of subdirectories

du -h dir_name/*

get the file size

du -h filename

The following commands can be used to obtain information about the state of the host system:

  • View available RAM and swap

    free
  • Console command that displays a list of current system processes and information about them

    top/htop

Now we can work with the file system and obtain the information that we need. Unfortunately, in most cases, it is not human-readable information. Consider the tools to facilitate data analysis.
When you need to analyze the information, the first thing you need to do is to find it. Generally the command

find / var -name search _name

can also use the * symbol, if the whole name is too long or unknown. In this case, use

find / var -name search_name*

We are looking for the logs of authentication on the host:

Once we’ve found required files, usually it is necessary to view them. You can use the following commands:

  • file output content

    cat file_name
  • file_name1 file_name2 merge into one file_name3

    cat file_name1 file_name2 > file_name3
  • combine multiple files at the beginning of the file_name

    cat file_name* > file_name_end
  • combine multiple files at the beginning of the file_name enclosed in a lot of directories beginning with some_dir

    cat some_dir*/file_name* > file_name_end

You can also redirect the output stream not only with files, but also with scripts and commands:

  • redirecting output to overwrite file

    ./some_script.sh > /path/to/file.txt
  • redirect output to append to a file

    ./some_script.sh >> /path/to/file.txt
  • Output the standard output (STDOUT) and the error stream (STDERR) in /dev/null:

    some command > /dev/null 2>&1

The next step is to save the file contents

/var/log/auth.log

and

/var/log/auth.log.1

in the

/home/auth.log

file for the further analysis.

 

As we can see from the output, there is a lot of information that we don’t need. Now we can use filtering to get just the right information. There is a very useful utility grep for this task. The syntax is presented below:

  • trim cat’s output to get only the lines containing some_line

    cat file_name | grep some_line

Grep also can be applied to the commands and scripts:

  • output only strings that contain errors

    python run_sum_script.py | grep error

We can print the authorization records relating only to ssh, according to the sshd filter:

You can also track changes to a file, it is useful, for example, when you are testing a web page. So you can "catch" it when querying log, or view new authorization records:

command will print new records of third-party application to a file.

tail -f some_web_server_log

It is also possible to use it in conjunction with grep:

prints  new records which contain error

tail -f some_web_server_log | grep error

Let's take a look at the tail in action. First, set the tracking file

/var/log/auth.log

and use the command tail

-f /var/log/auth.log | grep sshd

After each new connection to the host via ssh, we will display information and authorization records.

Another useful tool for the text analysis of the files is awk. With its help, you can easily cope with any text files structure. Awk is a command of contextual search and text. It can be viewed as a shell awk in shell. There are several ways to run an awk program. If the program is short, it is easier to include it in the command that runs awk, like this:

awk 'program' input-file1 input-file2 

For more information please read the manual:

http://www.gnu.org/software/gawk/manual/gawk.html

We can use awk to see when users connect and include via ssh using command:

$ awk '/sshd/ && /pam_unix/ {print ($1,$2,$3,$8,$11)}' /var/log/auth.log

 

There is also such a useful command as sed. This command copies files (the default standard input) to the standard output, edits them by their commands placed in the script (in a batch file or a line editor [rather than shell]). Under the flag -f command takes an image of sfile file. If there is only the option -e script, the -e flag can be omitted. -n flag suppresses (derived by default). Script consists of editing commands, one per line, in the format:

[addr [, addr]] cmd [args]

"Sed" cyclically converts an input string at the output.

Example:

  • Replace every occurrence of John to Nick in report.txt file

    sed 's/Nick/John/g' report.txt
  • Removes lines that were found

    who | sed '2,4d'
  • and so on.

Consider some useful commands to facilitate the operation to the remote host. Tar and zip archives are often used to save space and traffic. The following commands can be used to work with them:

For tar:

  • ignore reading errors during compression.

    tar  czf new-tar-file-name.tar.gz file-or-folder-to-archive
    --ignore-failed-read

    c - create new archive.

    z - compress an archive using gzip.

    f - use archive file.

  • for multiple directories

    tar -czf new-tar-file-name.tar.gz file1 file2 folder1 folder2
    tar -xzf tar-file-name.tar.gz

    x - extract an archive.

    z - uncompress an archive using gzip.

    f - use an archive file.

For zip:

  • zip files

    zip file-or-folder-to-archive new-zip-file-name.zip
  • unzip files

    unzip -d new-zip-file-name.zip

Where -d destination_folder

And, of course, work with the remote host can not be done without the transfer of files between the local computer and the remote host. To accomplish this, you can use the following commands:

  • download tar (paste from the local machine)

    scp -r (recursive) username@server:(remote location) (local location)
  • upload files to the remote server

    scp -r (local location) username@server:(remote location)

Above we have received time of connection and disconnection users via ssh with the help of awk command. Now we can save all this to a file, compress it into a tar file and upload.

Conclusions

With this article we wanted to show you how many possibilities has the CLI and how can it help you in your data science path. You are definitely spend a lot of time at the command line and the article was intended to help understand the basic commands you can use in your work.

Each instrument is useful in its own way and their combination in a larger pipeline can become a very powerful instrument. Now you can work with the classics such as grep, sed, and awk and it opens a great opportunities for you.
We are very interested what commands do you prefer to use in your daily tasks, so please leave your comments below and continue to study and improve yourself. Good luck!

 

Comments (0)

Add a new comment: