What is Pandas?
pandas
is a package commonly used to deal with data analysis. It simplifies the loading of data from external sources such as text files and databases, as well as providing ways of analysing and manipulating data once it is loaded into your computer. The features provided in pandas
automate and simplify a lot of the common tasks that would take many lines of code to write in the basic Python langauge.
Pandas is a hugely popular, and still growing, Python library used across a range of disciplines from environmental and climate science, through to social science, linguistics, biology, as well as a number of applications in industry such as data analytics, financial trading, and many others.
Pandas is best suited for structured, labelled data, in other words, tabular data, that has headings associated with each column of data. The official Pandas website describes Pandas’ data-handling strengths as:
- Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet.
- Ordered and unordered (not necessarily fixed-frequency) time series data.
- Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels.
- Any other form of observational / statistical data sets. The data actually need not be labelled at all to be placed into a
pandas
data structure.
Some other important points to note about Pandas are:
- Pandas is fast. Python sometimes gets a bad rap for being a bit slow compared to ‘compiled’ languages such as C and Fortran. But deep down in the internals of Pandas, it is actually written in C, and so processing large datasets is no problem for Pandas.
- Pandas is a dependency of another library called
statsmodels
, making it an important part of the statistical computing ecosystem in Python.
2. Ways of running Python with Pandas
Here we briefly discuss the different ways you can folow this tutorial. There are lots of different ways to run Python programs, and I don’t want to prescribe any one way as being the ‘best’. Users of RStudio and Matlab may find that the Spyder programming environment most closely matches the feel of RStudio and Matlab, with a window for editing scripts and an ‘interactive’ mode that can be used along side. For a more minimalist approach, you may prefer to write their scripts/programs in a text editor such as Notepadd++ (Windows), vim, emacs, or other popular editors. (But do not use Windows Notepad!). Personally this is how I like to work with Python as it frees you from the distractions of an IDE like Spyder, and reduces the number of problems that can arise from the Spyder program being set-up incorrectly.
Finally there is IPython, which lets you type in Python commands line-by-line, similar to Matlab and and RStudio, or an R console session.
Some more information on the three methods is described below:
Spyder
If you are attending the workshop ‘live’ on-site at Edinburgh University, the easiest way is to use the Spyder IDE (Integrated Development Environment) which is installed on the GeoSciences computers. It can also be installed on your laptop relatively easily. It is included in the Anconda Python distibution which can be downloaded here. Be sure to download the Python 3 version!
Text Editor with the Linux/Mac Terminal or Command-line
You can follow this tutorial by writing scripts saved as .py
files and then running the script from the terminal or command line with the python
command. e.g.
python myscript.py
Interactively with IPython
IPython is an ‘interactive’ Python interpreter. It lets you type in Python commands line-by-line, and then immediately execute them. It’s a very useful way of quickly testing and exploring Python commands, because you don’t have to interact directly with the command line or run the entire script. Spyder has an IPython console built in to it (on the right hand panel), or it can be started in Linux/Mac from the command line by running:
ipython
Note for interactive (IPython) users: If you are following this tutorial with IPython, you do not need to use print functions to get IPython to display variables or other Python objects. IPython will automatically print out variable simply when you type in the variable name and press enter. So for example:
In [1]: my_var = "Hello, World!"
In [2]: my_var # Now press ENTER
Out[2]: 'Hello, World!'
IPython users: When you see a print function used in this tutorial, e.g. print(my_var)
, you can omit it and simply type the variable name (e.g. my_var
) and press ENTER
.
On Windows you will find IPython in the start menu if it has been installed.
Conventions when using Pandas
All the examples in this tutorial assume you have installed the Python library pandas
, either through installing a scientific Python distribution such as Anaconda, or by installing it using a package-manager, such as conda
or pip
. To use any of the features of Pandas, you will need to have an import
statement at the top of your script like so:
import pandas as pd
By convention, the pandas
module is almost always imported this way as pd
. Every time we use a pandas
feature thereafter, we can shorten what we type by just typing pd
, such as pd.some_function()
.
If you are running Python interactively, such as in IPython, you will need to type in the same import statement at the start of each interactive session.
Try the following to see which version of Pandas you are running:
import pandas as pd
print(pd.__version__)
3. Understand the basic Pandas data structures
Pandas has two core data structures used to store data: The Series and the DataFrame.
Series
The series is a one-dimensional array-like structure designed to hold a single array (or ‘column’) of data and an associated array of data labels, called an index. We can create a series to experiment with by simply passing a list of data, let’s use numbers in this example:
import pandas as pd
my_series = pd.Series([4.6, 2.1, -4.0, 3.0])
print(my_series)
The output should be:
0 4.6
1 2.1
2 -4.0
3 3.0
dtype: float64
DataFrames
The DataFrame represents tabular data, a bit like a spreadsheet. DataFrames are organised into colums (each of which is a Series), and each column can store a single data-type, such as floating point numbers, strings, boolean values etc. DataFrames can be indexed by either their row or column names. (They are similar in many ways to R’s data.frame
.)
We can create a DataFrame in Pandas from a Python dictionary, or by loading in a text file containing tabular data. First we are going to look at how to create one from a dictionary.
If we didn’t have any real data to play with from an external file, we could manually create a DataFrame from a Python dictionary. Using the scottish_hills
dictionary above, we can load it into a Pandas DataFrame with this syntax:
dataframe = pd.DataFrame(scottish_hills)
5. Learn how to filter data in a Pandas DataFrame
We can also apply conditions to the data we are inspecting, such as to filter our data.
dataframe.Height > 1300
Would return:
0 True
1 True
2 False
3 False
4 False
Name: Height, dtype: bool
This returns a new Series of True/False values though. To actually filter the data, we need to use this Series to mask our original DataFrame:
dataframe[dataframe.Height > 1300]
6. Learn how to read data from a file using Pandas
So far we have only created data in Python itself, but Pandas has built in tools for reading data from a variety of external data formats, including Excel spreadsheets, raw text and .csv
files. It can also interface with databases such as MySQL, but we are not going to cover databases in this tutorial.
We’ve provided the scottish_hills.csv
file in this Github repository. The file contains all the mountains above 3000 feet (about 914 metres) in Scotland. We can load this easily into a DataFrame with the read_csv
function.
If you are writing a complete script to follow the tutorial, create a new file and enter:
import pandas as pd
dataframe = pd.read_csv("scottish_hills.csv")
print(dataframe.head(10))
Run the script, and you should get the following output:
Hill Name Height Latitude Longitude Osgrid
0 A' Bhuidheanach Bheag 936.0 56.870342 -4.199001 NN660775
1 A' Chailleach 997.0 57.693800 -5.128715 NH136714
2 A' Chailleach 929.2 57.109564 -4.179285 NH681041
3 A' Chraileag (A' Chralaig) 1120.0 57.184186 -5.154837 NH094147
4 A' Ghlas-bheinn 918.0 57.255090 -5.303687 NH008231
5 A' Mhaighdean 967.0 57.719644 -5.346720 NH007749
6 A' Mharconaich 973.2 56.857002 -4.290668 NN604762
7 Am Basteir 934.0 57.247931 -6.202982 NG465253
8 Am Bodach 1031.8 56.741727 -4.983393 NN176650
9 Am Faochagach 953.0 57.771801 -4.853899 NH303793
We’ve used the head()
function to give us only the first 10 items in the DataFrame, and avoid printing all 282 hills out to screen…
It looks like this table contains the hills in alphabetical order. It would be nice to see them in order of height. We can sort the DataFrame using the sort_values
method. You can add the following lines to your script:
sorted_hills = dataframe.sort_values(by=['Height'], ascending=False)
# Let's have a look at the top 5 to check
print(sorted_hills.head(5))
Run the script with these extra lines, and have a look at the output:
Hill Name Height Latitude Longitude Osgrid
92 Ben Nevis 1344.5 56.796891 -5.003675 NN166712
88 Ben Macdui (Beinn Macduibh) 1309.0 57.070368 -3.669099 NN988989
104 Braeriach 1296.0 57.078298 -3.728389 NN953999
115 Cairn Toul 1291.0 57.054397 -3.710773 NN963972
212 Sgor an Lochain Uaine 1258.0 57.058369 -3.725797 NN954976
We now have our hills sorted by height. Note how we’ve used the by=['Height']
argument to specify that we want to sort by height, and then the ascending=False
argument to get the heights sorted in descending order, from highest to lowest.
Comments
Post a Comment