Pandas is a practical library to analyze and visualize big data, integrating the functionalities of Numpy and matplotlib.
Pandas defines three data structures:
As DataFrames are two-dimensional arrays, they are ideal for CSV files. They correspond to a Series stack whose indexes are shared (and therefore aligned).
First example of Series creation:
>>> import pandas as pd >>> import numpy as np >>> series = pd.Series([1, 2, 3, 4, np.nan, "chain"]) >>> NetRate = pd.Series(np.random.random_integers(0,1,100))
The Series dtype is here object, when it would have been float64 in the absence of the string: np.nan is of numerical type, the Series being then seen as a series of numbers.
DataFrames can be created from dictionaries, list of lists, Series, array or NumPy record lists, excel or CSV files, databases...
Simple example of creation from a numpy array:
>>> pd.DataFrame(np.array([1,2,3,3,4,5,6]).reshape(2,3))
we can, at the same time, name rows and columns, as follows:
>>> pd.DataFrame(np.array([1,2,3,3,4,5,6]).reshape(2,3), columns = list('ABC'), index = list('XY'))
Example of creating a dataframe from a dictionary (so the values are lists):
>>> gender = [['F','H'][x] for x in np.random.random_integers(0,1,100)] >>> lateral = [['D','G'][x] for x in np.random.random_integers(0,1,100)] >>> age = np.random.random_integers(1, 100, 100) >>> df = pd.DataFrame({'Genre':genre,'Lateral':lateral,'Age':age})
To make a DataFrame from the columns of another DataFrame:
>>> df1 = df[ ['Gender','Age'] ]
>>> df1 = pd.concat([df, tauxnet], axis=1)
To set the name of the Series s column in df1 :
>>> df1 = pd.concat([df, s.rename('name')], axis=1)
>>> code = pd.Series(np.zeros(df.shape[0]))
Or even:
>>> code = df['Genre'].eq('H').astype('int')
>>> dtindex = pd.date_range(start='2014-04-28 00:00', periods=96, freq='H')
dtindex is a DatetimeIndex initialized on April 28, 2014 at 0:00 am, and includes 96 hourly frequency periods (4 days). Note that, in date_range, the number of periods can be replaced by an end date.
>>> df.resample('D', how=np.max)
>>> df.resample('15min', fill_method='ffill')
in the absence of fill_method, the completed data will be NaN.
>>> df.resample('15min').fillna(df.mean())
or interpolate missing data:
>>> df.resample('15min').interpolate()
>>> df1 = df.copy()
>>> df_indices = df.reset_index()
reset_index(drop=True) is a method to get rid of NaN lines.
>>> df.rename(columns = lambda c: chr(65+c))
You can also rename the indices of a Series, and name it, with the arguments index (list) and name (str).
>>> df.sort_values(by='column')
with ascending option = True/False.
>>> s.sum()
>>> df.drop(["age"], axis = 1, inplace = True)
This creates a copy of the data, without affecting df. Otherwise, do:
>>> del df["age"]
>>> df['Sex'] = df.Genre.map({'man':0,'woman':1})
Calculate a linear correlation between two columns of a DataFrame:
>>> df['Age'].corr(df['Duration'])
Calculate the correlation matrix between each pair of variables:
>>> corr_matrix = df.corr() >>> corr_matrix["nbInterventions"].sort_values(ascending=False)
It should be noted that these are only linear correlations. Correlations of the type: "if x is close to 0, then there is a tendency to increase" do not appear here.
Note: Do not hesitate to combine variables to see if the correlation does not increase with the target. For example, the number of rooms per unit is more directly correlated with the price of the apartment in a given district than the number of rooms and the number of units. In the first case, the average size of dwellings in a given district is measured.
>>> df['maximum rate']*np.log(df['age'])
>>> df.head(n)
and the last n:
>>> df.tail(n)
>>> df.iloc[:5, :10]
Slicing also works on Series.
>>> df.loc[:5,'A':'D']
Thus, between loc and iloc, the former refers to the name, the latter to the strict index. You can slicing with the column names directly, as follows:
>>> df['Age'][0:3] >>> df[0:3]['Age'] >>> df.Age[0:3]
All of the above is equivalent to:
>>> df.loc[0:3,'Age']
>>> df.loc[df['Age']==10,:]
>>> df.loc[(df['type']=='A')&(df['Age']==2),:]
for the or, use |, and ~ for the no.
>>> df.loc[df['type'].isin(['A','B']),:]
In the above, we can replace the: by a list of columns.
You can iterate on the columns, or the rows of a column:
>>> for col in df.columns : >>> for lign in df['Age']:
To get the size of a DataFrame:
>>> df.shape
and for general information:
>>> df.info()
>>> df.columns
and for the type of each column:
>>> df.dtypes
>>> df['Age'].mean()
>>> (df['Age']==10).value_counts()
The return is in the form:
False 250 True 20
>>> df['Age'].argsort()
>>> pd.crosstab(df['sex'],df['heart']) heart absence presence sex
male 83 100
Use the normalize='index' option for online percentages.
>>> pd.crosstab(df['sex'], df['heart'], values = df['age'], aggfunc = pd.Series.mean)
You can analyze one DataFrame per column, or one Series, via
>>> df.describe()
which provides the size of the column, as well as the mean, standard deviation, min, max and quartiles. Basically, describe returns information only on digital data. We can pass the argument: include ='all', so as not to be limited to such data.
>>> df.groupby(['Genre','Lateral']).aggregate(np.mean)
which gives:
Age Lateral Gender woman right 45.47 left 49.21 man right 41.57 Left 55.82
# Data split by gender
# dimension of the subDataFrame associated with men >>> g.get_group('masculine').shape >>> Average age of men >>> g.get_group('masculine')['age'].mean()
>>> g[ ['age','depression'] ].agg([pd.Series.mean, pd.Series.std])
>>> for group in g:
... print(pd.Series.mean(group[1]['age']))
>>> df['Y'].plot()
To display the curves in the Jupyter notebook:
>>> %matplotlib inline
>>> df.hist(column='age')
Histogram of age by sex:
>>> df.hist(column='age', by='sex')
The number of bars with bins=20 can be specified, and if it is not specified which column is considered, all the numerical columns are plotted in various histograms.
>>> df['age'].plot.kde()
>>> df.boxplot(column='age', by='sex')
>>> df['sex'].value_counts().plot.pie()
>>> df.plot.scatter(x='age', y='ratemax')
Pixel size according to a third value:
>>> df.plot.scatter(x='age', y='ratemax', s=df['depression'])
or c='depression' to vary the color of the pixel. You can use a predefined color palette called "jet" with the cmap option, which ranges from blue (low values) to red (high), and add the colorbar as a legend:
>>> df.plot(kind = "scatter", x="longitude", y="latitude", c = "age", cmap = plt.get_cmap("jet"), colorbar = True) >>> plt.legend()
To visualize the places where the point density is important, we can put alpha=0.1.
To display a 3D point cloud, specifying the angle of view:
>>> import matplotlib.pyplot as plt >>> from mpl_toolkits.mplot3D import Axes3D # change the view angle >>> fig = plt_figure() >>> ax = fig.add_subplot(111,projection='3d')
The following, to change the angle of view (in degrees)
>>> ax.view_init(azim=30, elev=10) >>> ax.scatter(X[:,0], X[:,1], X[:,2])
To draw a point cloud with a linear regression line:
>>> import seaborn as sns >>> sns.lmplot(x='Age', y='Duration', data=df, fit_reg=True)
>>> pd.read_csv('file.csv')
which returns a DataFrame. The CSV headers are the names of the columns. Various options can be passed:
>>> pd.read_csv('file.csv', parse_dates = True, index_col ='DateTime', names=['DateTime','Value'], header = None, sep = ',')
>>> pd.to_csv
also exists: to_excel.
To read the data in a text file:
>>> df=pd.read_table("file.txt", sep='\t', header=0)
where: