May 19, 2024

Wiki

Python

Aide

edit SideBar

Search

First steps with Pandas


Presentation

Pandas is a practical library to analyze and visualize big data, integrating the functionalities of Numpy and matplotlib.

Pandas defines three data structures:

  • Series: object labelled in the form of a one-dimensional table, capable of containing any type of object.
  • DataFrame: a two-dimensional data structure, where columns can be of different types.
  • Panel: three-dimensional data structure, DataFrames dictionary.

As DataFrames are two-dimensional arrays, they are ideal for CSV files. They correspond to a Series stack whose indexes are shared (and therefore aligned).

Creating data structures

Creation of Series

First example of Series creation:

  >>> import pandas as pd
  >>> import numpy as np
  >>> series = pd.Series([1, 2, 3, 4, np.nan, "chain"])
  >>> NetRate = pd.Series(np.random.random_integers(0,1,100))

The Series dtype is here object, when it would have been float64 in the absence of the string: np.nan is of numerical type, the Series being then seen as a series of numbers.

Creating DataFrames

DataFrames can be created from dictionaries, list of lists, Series, array or NumPy record lists, excel or CSV files, databases...

From a numpy array

Simple example of creation from a numpy array:

  >>> pd.DataFrame(np.array([1,2,3,3,4,5,6]).reshape(2,3))

we can, at the same time, name rows and columns, as follows:

  >>> pd.DataFrame(np.array([1,2,3,3,4,5,6]).reshape(2,3),
             columns = list('ABC'),
             index = list('XY'))

From a dictionary

Example of creating a dataframe from a dictionary (so the values are lists):

  >>> gender = [['F','H'][x] for x in np.random.random_integers(0,1,100)]
  >>> lateral = [['D','G'][x] for x in np.random.random_integers(0,1,100)]
  >>> age = np.random.random_integers(1, 100, 100)
  >>> df = pd.DataFrame({'Genre':genre,'Lateral':lateral,'Age':age})

From another DataFrame

To make a DataFrame from the columns of another DataFrame:

  >>> df1 = df[ ['Gender','Age'] ]

DataFrame and Series Concatenation

  • The NetRate Series can be concatenated to the DataFrame df as follows:
  >>> df1 = pd.concat([df, tauxnet], axis=1)

To set the name of the Series s column in df1 :

  >>> df1 = pd.concat([df, s.rename('name')], axis=1)
  • For example, to make an indicator for the variable Gender (1 for men, 0 for women):
  >>> code = pd.Series(np.zeros(df.shape[0]))
>>> code[ df['Genre']=='H' ] = 1

Or even:

  >>> code = df['Genre'].eq('H').astype('int')

The date_range

  • Pandas supports date indexing:
  >>> dtindex = pd.date_range(start='2014-04-28 00:00', periods=96, freq='H')
>>> date = pd.DataFrame(data, index=dtindex, columns=['measure'])

dtindex is a DatetimeIndex initialized on April 28, 2014 at 0:00 am, and includes 96 hourly frequency periods (4 days). Note that, in date_range, the number of periods can be replaced by an end date.

  • In the previous example, to calculate the maximum (hourly) per day, it is enough to resample into daily data'D', and say how to aggregate it all:
  >>> df.resample('D', how=np.max)
  • The above can be converted to quarter-hourly data, by replacing the missing data with the fixed time data:
  >>> df.resample('15min', fill_method='ffill')

in the absence of fill_method, the completed data will be NaN.

  • We can also add the average of the two known close values:
  >>> df.resample('15min').fillna(df.mean())

or interpolate missing data:

  >>> df.resample('15min').interpolate()

Basic manipulations

Copy

  • Copy a DataFrame:
  >>> df1 = df.copy()

Modify the data

  • To add an index column:
  >>> df_indices = df.reset_index()

reset_index(drop=True) is a method to get rid of NaN lines.

  • To rename the columns of a DataFrame:
  >>> df.rename(columns = lambda c: chr(65+c))

You can also rename the indices of a Series, and name it, with the arguments index (list) and name (str).

  • To sort the data according to a column:
  >>> df.sort_values(by='column')

with ascending option = True/False.

  • Mathematical functions can be applied directly to a Series:
  >>> s.sum()

Data operations

On the columns

  • To delete a variable:
  >>> df.drop(["age"], axis = 1, inplace = True)

This creates a copy of the data, without affecting df. Otherwise, do:

  >>> del df["age"]
  • New columns can be created from existing columns, either by applying a dictionary or from a function:
  >>> df['Sex'] = df.Genre.map({'man':0,'woman':1})
>>> df['Name_length'] = df.Name.apply(len)

Correlations

Calculate a linear correlation between two columns of a DataFrame:

  >>> df['Age'].corr(df['Duration'])

Calculate the correlation matrix between each pair of variables:

  >>> corr_matrix = df.corr()
  >>> corr_matrix["nbInterventions"].sort_values(ascending=False)

It should be noted that these are only linear correlations. Correlations of the type: "if x is close to 0, then there is a tendency to increase" do not appear here.

Note: Do not hesitate to combine variables to see if the correlation does not increase with the target. For example, the number of rooms per unit is more directly correlated with the price of the apartment in a given district than the number of rooms and the number of units. In the first case, the average size of dwellings in a given district is measured.

Series Features

  • When two Series are added together with named indices, they may have different sizes: only indices with the same name are added together.
    • The other lines become NaN.
    • Missing data can be filled with the fill_value = 0 argument.
  • The calculations are vectorized for Pandas Series, as under Numpy :
  >>> df['maximum rate']*np.log(df['age'])

View the data

Access to data

  • See the first n lines of a Dataframe:
  >>> df.head(n)

and the last n:

  >>> df.tail(n)

Data location

  • To extract the first 5 rows and 10 columns of the DataFrame:
  >>> df.iloc[:5, :10]

Slicing also works on Series.

  • To locate them with named clues:
  >>> df.loc[:5,'A':'D']
>>> df.loc[:5, ('A','D')]

Thus, between loc and iloc, the former refers to the name, the latter to the strict index. You can slicing with the column names directly, as follows:

  >>> df['Age'][0:3]
  >>> df[0:3]['Age']
  >>> df.Age[0:3]

All of the above is equivalent to:

  >>> df.loc[0:3,'Age']
  • To isolate a subset of observations:
  >>> df.loc[df['Age']==10,:]
  • Logical operators for location, combining conditions:
  >>> df.loc[(df['type']=='A')&(df['Age']==2),:]

for the or, use |, and ~ for the no.

  • Isolate a subset of observations on a subset of values:
  >>> df.loc[df['type'].isin(['A','B']),:]

In the above, we can replace the: by a list of columns.

Data iteration

You can iterate on the columns, or the rows of a column:

  >>> for col in df.columns :
  >>> for lign in df['Age']:

Data information

To get the size of a DataFrame:

  >>> df.shape

and for general information:

  >>> df.info()

Column information

  • To display the names of the columns:
  >>> df.columns

and for the type of each column:

  >>> df.dtypes
  • Calculation of the average of a column:
  >>> df['Age'].mean()
  • Number of occurrence of each value of a column: we can use describe on a column, and value_counts(). And, to get the number of times a value appears in a column:
  >>> (df['Age']==10).value_counts()

The return is in the form:

  False 250
  True 20
  • To obtain the indices of the sorted elements of a column:
  >>> df['Age'].argsort()

crosstab

  • To make crosses, we use crosstab :
  >>> pd.crosstab(df['sex'],df['heart'])
    heart   absence presence
     sex
female 67 20
    male         83      100

Use the normalize='index' option for online percentages.

  • Rather than a number of occurrences in the crossing, a calculation can be made on the populations of the crosses, for example the average age by sex and disease:
  >>> pd.crosstab(df['sex'], df['heart'], values = df['age'], aggfunc = pd.Series.mean)

describe

You can analyze one DataFrame per column, or one Series, via

  >>> df.describe()

which provides the size of the column, as well as the mean, standard deviation, min, max and quartiles. Basically, describe returns information only on digital data. We can pass the argument: include ='all', so as not to be limited to such data.

groupby

  • Pandas can do GROUP BY SQL style. So,
  >>> df.groupby(['Genre','Lateral']).aggregate(np.mean)

which gives:

                   Age
  Lateral Gender
  woman    right 45.47
            left 49.21
  man      right 41.57
            Left 55.82
  • groupby() allows to access the subDataFrame associated to each item of the grouping variable.
  # Data split by gender
>>> g = df.groupby('sex')
  # dimension of the subDataFrame associated with men
  >>> g.get_group('masculine').shape
  >>> Average age of men
  >>> g.get_group('masculine')['age'].mean()
  • Apply various functions to different columns, by group:
  >>> g[ ['age','depression'] ].agg([pd.Series.mean, pd.Series.std])
  • We can iterate on groups:
  >>> for group in g:
... print(group[0]) # group label
  ...     print(pd.Series.mean(group[1]['age']))

Graphics

  • Series and DataFrame have a plot function. For example:
  >>> df['Y'].plot()

To display the curves in the Jupyter notebook:

  >>> %matplotlib inline

Various base networks

  • Histogram:
  >>> df.hist(column='age')

Histogram of age by sex:

  >>> df.hist(column='age', by='sex')

The number of bars with bins=20 can be specified, and if it is not specified which column is considered, all the numerical columns are plotted in various histograms.

  • Density plot:
  >>> df['age'].plot.kde()
  • Comparison of distributions with a boxplot:
  >>> df.boxplot(column='age', by='sex')
  • Camembert:
  >>> df['sex'].value_counts().plot.pie()

Point clouds

  >>> df.plot.scatter(x='age', y='ratemax')

Pixel size according to a third value:

  >>> df.plot.scatter(x='age', y='ratemax', s=df['depression'])

or c='depression' to vary the color of the pixel. You can use a predefined color palette called "jet" with the cmap option, which ranges from blue (low values) to red (high), and add the colorbar as a legend:

  >>> df.plot(kind = "scatter", x="longitude", y="latitude",
		    c = "age", cmap = plt.get_cmap("jet"), colorbar = True)  
  >>> plt.legend()

To visualize the places where the point density is important, we can put alpha=0.1.

3D point clouds

To display a 3D point cloud, specifying the angle of view:

  >>> import matplotlib.pyplot as plt
  >>> from mpl_toolkits.mplot3D import Axes3D # change the view angle
  >>> fig = plt_figure()
  >>> ax = fig.add_subplot(111,projection='3d')

The following, to change the angle of view (in degrees)

  >>> ax.view_init(azim=30, elev=10)
  >>> ax.scatter(X[:,0], X[:,1], X[:,2])

Linear regression

To draw a point cloud with a linear regression line:

  >>> import seaborn as sns
  >>> sns.lmplot(x='Age', y='Duration', data=df, fit_reg=True)

Backup, data import

CSV

  • Read a CSV with Pandas:
  >>> pd.read_csv('file.csv')

which returns a DataFrame. The CSV headers are the names of the columns. Various options can be passed:

  >>> pd.read_csv('file.csv', parse_dates = True,
            index_col ='DateTime', names=['DateTime','Value'],
            header = None, sep = ',')
  • Write in a CSV:
  >>> pd.to_csv

also exists: to_excel.

Text file

To read the data in a text file:

  >>> df=pd.read_table("file.txt", sep='\t', header=0)

where:

  • sep provides the field separator,
  • header specifies the line of field names.

Page Actions

Recent Changes

Group & Page

Back Links