Handling Data with Pandas

What is Pandas?

pandas is a package commonly used to deal with data analysis. It simplifies the loading of data from external sources such as text files and databases, as well as providing ways of analysing and manipulating data once it is loaded into your computer. The features provided in pandas automate and simplify a lot of the common tasks that would take many lines of code to write in the basic Python langauge.

Pandas is a hugely popular, and still growing, Python library used across a range of disciplines from environmental and climate science, through to social science, linguistics, biology, as well as a number of applications in industry such as data analytics, financial trading, and many others.

Pandas is best suited for structured, labelled data, in other words, tabular data, that has headings associated with each column of data. The official Pandas website describes Pandas’ data-handling strengths as:

Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet.
Ordered and unordered (not necessarily fixed-frequency) time series data.
Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels.
Any other form of observational / statistical data sets. The data actually need not be labelled at all to be placed into a pandas data structure.

Some other important points to note about Pandas are:

Pandas is fast. Python sometimes gets a bad rap for being a bit slow compared to ‘compiled’ languages such as C and Fortran. But deep down in the internals of Pandas, it is actually written in C, and so processing large datasets is no problem for Pandas.
Pandas is a dependency of another library called statsmodels, making it an important part of the statistical computing ecosystem in Python.

2. Ways of running Python with Pandas

Here we briefly discuss the different ways you can folow this tutorial. There are lots of different ways to run Python programs, and I don’t want to prescribe any one way as being the ‘best’. Users of RStudio and Matlab may find that the Spyder programming environment most closely matches the feel of RStudio and Matlab, with a window for editing scripts and an ‘interactive’ mode that can be used along side. For a more minimalist approach, you may prefer to write their scripts/programs in a text editor such as Notepadd++ (Windows), vim, emacs, or other popular editors. (But do not use Windows Notepad!). Personally this is how I like to work with Python as it frees you from the distractions of an IDE like Spyder, and reduces the number of problems that can arise from the Spyder program being set-up incorrectly.

Finally there is IPython, which lets you type in Python commands line-by-line, similar to Matlab and and RStudio, or an R console session.

Some more information on the three methods is described below:

Spyder

If you are attending the workshop ‘live’ on-site at Edinburgh University, the easiest way is to use the Spyder IDE (Integrated Development Environment) which is installed on the GeoSciences computers. It can also be installed on your laptop relatively easily. It is included in the Anconda Python distibution which can be downloaded here. Be sure to download the Python 3 version!

Text Editor with the Linux/Mac Terminal or Command-line

You can follow this tutorial by writing scripts saved as .py files and then running the script from the terminal or command line with the python command. e.g.

python myscript.py

Interactively with IPython

IPython is an ‘interactive’ Python interpreter. It lets you type in Python commands line-by-line, and then immediately execute them. It’s a very useful way of quickly testing and exploring Python commands, because you don’t have to interact directly with the command line or run the entire script. Spyder has an IPython console built in to it (on the right hand panel), or it can be started in Linux/Mac from the command line by running:

ipython

Note for interactive (IPython) users: If you are following this tutorial with IPython, you do not need to use print functions to get IPython to display variables or other Python objects. IPython will automatically print out variable simply when you type in the variable name and press enter. So for example:

In [1]: my_var = "Hello, World!"

In [2]: my_var  # Now press ENTER
Out[2]: 'Hello, World!'

IPython users: When you see a print function used in this tutorial, e.g. print(my_var), you can omit it and simply type the variable name (e.g. my_var) and press ENTER.

On Windows you will find IPython in the start menu if it has been installed.

Conventions when using Pandas

All the examples in this tutorial assume you have installed the Python library pandas, either through installing a scientific Python distribution such as Anaconda, or by installing it using a package-manager, such as conda or pip. To use any of the features of Pandas, you will need to have an import statement at the top of your script like so:

import pandas as pd

By convention, the pandas module is almost always imported this way as pd. Every time we use a pandas feature thereafter, we can shorten what we type by just typing pd, such as pd.some_function().

If you are running Python interactively, such as in IPython, you will need to type in the same import statement at the start of each interactive session.

Try the following to see which version of Pandas you are running:

import pandas as pd

print(pd.__version__)

3. Understand the basic Pandas data structures

Pandas has two core data structures used to store data: The Series and the DataFrame.

Series

The series is a one-dimensional array-like structure designed to hold a single array (or ‘column’) of data and an associated array of data labels, called an index. We can create a series to experiment with by simply passing a list of data, let’s use numbers in this example:

import pandas as pd

my_series = pd.Series([4.6, 2.1, -4.0, 3.0])
print(my_series)

The output should be:

  4.6
  2.1
 -4.0
  3.0
dtype: float64

DataFrames

The DataFrame represents tabular data, a bit like a spreadsheet. DataFrames are organised into colums (each of which is a Series), and each column can store a single data-type, such as floating point numbers, strings, boolean values etc. DataFrames can be indexed by either their row or column names. (They are similar in many ways to R’s data.frame.)

We can create a DataFrame in Pandas from a Python dictionary, or by loading in a text file containing tabular data. First we are going to look at how to create one from a dictionary.

If we didn’t have any real data to play with from an external file, we could manually create a DataFrame from a Python dictionary. Using the scottish_hills dictionary above, we can load it into a Pandas DataFrame with this syntax:

dataframe = pd.DataFrame(scottish_hills)

5. Learn how to filter data in a Pandas DataFrame

We can also apply conditions to the data we are inspecting, such as to filter our data.

dataframe.Height > 1300

Would return:

   True
   True
  False
  False
  False
Name: Height, dtype: bool

This returns a new Series of True/False values though. To actually filter the data, we need to use this Series to mask our original DataFrame:

dataframe[dataframe.Height > 1300]

6. Learn how to read data from a file using Pandas

So far we have only created data in Python itself, but Pandas has built in tools for reading data from a variety of external data formats, including Excel spreadsheets, raw text and .csv files. It can also interface with databases such as MySQL, but we are not going to cover databases in this tutorial.

We’ve provided the scottish_hills.csv file in this Github repository. The file contains all the mountains above 3000 feet (about 914 metres) in Scotland. We can load this easily into a DataFrame with the read_csv function.

If you are writing a complete script to follow the tutorial, create a new file and enter:

import pandas as pd

dataframe = pd.read_csv("scottish_hills.csv")
print(dataframe.head(10))

Run the script, and you should get the following output:

                    Hill Name  Height   Latitude  Longitude    Osgrid
     A' Bhuidheanach Bheag   936.0  56.870342  -4.199001  NN660775
             A' Chailleach   997.0  57.693800  -5.128715  NH136714
             A' Chailleach   929.2  57.109564  -4.179285  NH681041
A' Chraileag (A' Chralaig)  1120.0  57.184186  -5.154837  NH094147
           A' Ghlas-bheinn   918.0  57.255090  -5.303687  NH008231
             A' Mhaighdean   967.0  57.719644  -5.346720  NH007749
            A' Mharconaich   973.2  56.857002  -4.290668  NN604762
                Am Basteir   934.0  57.247931  -6.202982  NG465253
                 Am Bodach  1031.8  56.741727  -4.983393  NN176650
             Am Faochagach   953.0  57.771801  -4.853899  NH303793

We’ve used the head() function to give us only the first 10 items in the DataFrame, and avoid printing all 282 hills out to screen…

It looks like this table contains the hills in alphabetical order. It would be nice to see them in order of height. We can sort the DataFrame using the sort_values method. You can add the following lines to your script:

sorted_hills = dataframe.sort_values(by=['Height'], ascending=False)

# Let's have a look at the top 5 to check
print(sorted_hills.head(5))

Run the script with these extra lines, and have a look at the output:

                       Hill Name  Height   Latitude  Longitude    Osgrid
                   Ben Nevis  1344.5  56.796891  -5.003675  NN166712
 Ben Macdui (Beinn Macduibh)  1309.0  57.070368  -3.669099  NN988989
                  Braeriach  1296.0  57.078298  -3.728389  NN953999
                 Cairn Toul  1291.0  57.054397  -3.710773  NN963972
      Sgor an Lochain Uaine  1258.0  57.058369  -3.725797  NN954976

We now have our hills sorted by height. Note how we’ve used the by=['Height'] argument to specify that we want to sort by height, and then the ascending=False argument to get the heights sorted in descending order, from highest to lowest.

All about coding

Search This Blog

Things you must know before coding

Handling Data with Pandas

What is Pandas?

2. Ways of running Python with Pandas

Spyder

Text Editor with the Linux/Mac Terminal or Command-line

Interactively with IPython

Conventions when using Pandas

3. Understand the basic Pandas data structures

Series

DataFrames

5. Learn how to filter data in a Pandas DataFrame

6. Learn how to read data from a file using Pandas

Comments

Post a Comment

Popular posts from this blog

Things you must know before coding

Things you must know before coding