An Introduction to Machine Learning with Web Data is now available!

I’m really excited that An Introduction to Machine Learning with Web Data is now available for purchase!

This is a 2 hour and 43 minute instructional video that walks you through basic machine learning algorithms, first theoretically and mathematically, and then with Python example code (which is available here).

This video is an instructional take and builds on the material I covered in my Strange Loop 2010 keynote Machine Learning: A Love Story and the Data Bootcamp I did with Joe Adler, Drew Conway, and Jake Hofman at the Strata Conference in February.

I’d also like to acknowledge the many collaborators, colleagues, and friends who have made definite contributions to my thinking about this material and how best to present it, particularly Chris Wiggins who co-authored A Taxonomy of Data Science and Andrew, Dennis, Jan, Jesse, and Julie, the members of the studio audience for the class (who were amazing).

If you like it, please leave it a good review! As always, questions and comments are welcome here or by e-mail.

Intro to the Linux Command Line

This document will step through the process of accessing a Linux server, and several basic commands.


We will access the server via the SSH, or “Secure SHell”, protocol. SSH provides encrypted communication between two terminals. It is an alternative to telnet or rlogin.

First, make sure that you have an SSH client. If you are running Windows, I highly recommend downloading Putty. Click on the top Windows binary and save it to the desktop. You can run Putty.exe directly, without installing.

If you are using OS X, open a terminal and run ‘ssh’.

Logging In


Type your hostname into the box. You can find your hostname by asking your ISP, but a good bet is that “” will resolve to your server. You may need to request shell access

Click open. You’ll be presented with a warning, because you haven’t connected to that machine before. Approve the connection.

Next, you’ll see a login prompt. Enter your username. These will be assigned in class, but a good guess is your first initial and last name.

Enter your password. Both your username and password are case-sensitive.

You should see a screen like the one to the left. Congratulations! You’re logged into the system.

The Prompt

You’re now looking at an interactive shell prompt. The shell accepts the commands that you type and sends them to the kernel, which executes them. This is how you interact with the operating system.

The prompt currently shows your username, the name of the machine (PVDACDLN-01), and your current directory (~, see below).

By default, our system assigns users to the bash shell. There are many different shells available. In general, they all do the same thing, so most users choose a shell based on what they are most comfortable with.

Tip: By default, OS X uses bash as the shell in a terminal. All of these commands will work nearly identically in OS X.


The remainder of this document is an introduction to Linux commands and a reference.

The Filesystem

The Linux filesystem is organized into a heirarchical series of folders, or directories. ‘/’ indicates the root directory, which contains all other files and directories.

pwd – pwd stands for “print working directory”. It echoes the current directory to the screen (this is where, in the filesystem, you currently are).

[ham@PVDACDLN-01 ~]$ pwd

cd – cd stands for “change directory”. It changes your current working directory to a new directory. If you try to change to a directory that does not exist, you’ll get a No such file or directory error.

Tip: you can type “cd ~” at any time to return to your home directory, or, if you know someone’s username, type “cd ~username” to cd to their home directory.

[ham@PVDACDLN-01 ~]$ cd public_html
[ham@PVDACDLN-01 public_html]$

ls – ls stands for “list”. It prints a list of files and directories in the current working directory to the screen. This version of ls presents directories in blue.

[ham@PVDACDLN-01 public_html]$ cd ~
[ham@PVDACDLN-01 ~]$ ls
example.txt public_html
[ham@PVDACDLN-01 ~]$

The ls command also accepts various arguments that change the output of the command. For example, the “-a” argument means “display hidden files”, while the “-l” argument means “use long listings”. Command-line switches can be used singly or combined.

Filenames that begin with a “.” are hidden by default. Also note the “.” (current directory) and “..” (parent directory) operators.

[ham@PVDACDLN-01 ~]$ ls -a

. .bash_logout .bashrc_backup .gtkrc .viminfo
.. .bash_profile .emacs .kde .zshrc
.bash_history .bashrc example.txt public_html
[ham@PVDACDLN-01 ~]$ ls -l

total 12
-rw-r–r– 1 ham users 0 Feb 21 11:38 example.txt
drwxr-xr-x 2 ham users 4096 Feb 21 11:32 public_html
[ham@PVDACDLN-01 ~]$ ls -al
total 108

drwxr-xr-x 4 ham users 4096 Feb 21 11:38 .
drwxr-xr-x 4 root root 4096 Feb 3 13:39 ..
-rw——- 1 ham users 297 Feb 21 11:21 .bash_history
-rw-r–r– 1 ham users 24 Feb 3 13:39 .bash_logout
-rw-r–r– 1 ham users 191 Feb 3 13:39 .bash_profile
-rw-r–r– 1 ham users 142 Feb 3 13:41 .bashrc

-rw-r–r– 1 ham users 124 Feb 3 13:40 .bashrc_backup
-rw-r–r– 1 ham users 438 Feb 3 13:39 .emacs
-rw-r–r– 1 ham users 0 Feb 21 11:38 example.txt
-rw-r–r– 1 ham users 120 Feb 3 13:39 .gtkrc
drwxr-xr-x 3 ham users 4096 Feb 3 13:39 .kde
drwxr-xr-x 2 ham users 4096 Feb 21 11:32 public_html

-rw——- 1 ham users 632 Feb 3 13:41 .viminfo
-rw-r–r– 1 ham users 658 Feb 3 13:39 .zshrc
[ham@PVDACDLN-01 ~]$

clear – The clear command requires no parameters. It clears the terminal and presents a prompt. It’s almost never necessary to do this, but it can be good for mental clarity.


who – The who command requires no parameters. It prints a list of all logged in users to the screen.

ham pts/1 Feb 21 13:52 (

top – every program or command that runs on the machine is a process. The top command displays running processes in order of the CPU cycles consumed.

ps – ps stands for “process snapshot”. It prints a list of the current running processes. Use the “-e” argument to see everyone’s processes.

[ham@PVDACDLN-01 ~]$ ps
24948 pts/1 00:00:00 bash

25366 pts/1 00:00:00 ps
[ham@PVDACDLN-01 ~]$

man – man stands for “manual”. It accepts one command name as a parameter. It will less the man page for that command. Use the spacebar to page through it, or type ‘q’ to quit. Try typing ‘man ls‘.

Man pages are generally dense and occassionally uncomprehensible, but they do usually present every possible option for a command.

help – help is a bash command that provides help for builtin commands. It accepts a command name or pattern (partial name) as an argument.

Unfortunately, help only documents certain commands. Try ‘help help‘.

[ham@PVDACDLN-01 ~]$ help pwd

pwd: pwd [-PL]
Print the current working directory. With the -P option, pwd prints
the physical directory, without any symbolic links; the -L option
makes pwd follow symbolic links.
[ham@PVDACDLN-01 ~]$

Manipulating Files and Directories

cp – the cp command accepts two parameters. It copies a file to a file with the name specified in the second parameter. Each parameter may include a directory (relative or absolute) and a filename.

Use the -R (recursive) switch to copy entire directories.

[ham@PVDACDLN-01 ~]$ cp example.txt moreexamples.txt
[ham@PVDACDLN-01 ~]$ cp example.txt directory/example.txt

mv – the mv, or “move” command, accepts two parameters. It renames the first file to the second filename. Each parameter may include a directory (relative or absolute) and a filename.

The mv command is also used for renaming files.

[ham@PVDACDLN-01 ~]$ mv example.txt moreexamples.txt
[ham@PVDACDLN-01 ~]$ mv example.txt directory/example.txt

rm – the rm, or “remove” command, deletes a file.

[ham@PVDACDLN-01 ~]$ rm moreexamples.txt
[ham@PVDACDLN-01 ~]$ rm -r directory/

Resources and Help

Explore these links for additional information and explanations.