Pandas. Data processing

Pandas is an essential data analysis library within Python ecosystem. For more details read Pandas Documentation.

Data structures

Pandas operates with three basic datastructures: Series, DataFrame, and Panel. There are extensions to this list, but for the purposes of this material even the first two are more than enough.

We start by importing NumPy and Pandas using their conventional short names:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: randn = np.random.rand # To shorten notation in the code that follows

Series

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:

>>> s = Series(data, index=index)

The first mandatory argument can be

  • array-like
  • dictionary
  • scalar

Array-like

If data is an array-like, index must be the same length as data. If no index is passed, one will be created having values [0, ..., len(data) - 1].

In [4]: s = pd.Series(randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [5]: s
Out[5]: 
a    0.803862
b    0.474233
c    0.224809
d    0.580670
e    0.747660
dtype: float64

In [6]: s.index
Out[6]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [7]: pd.Series(randn(5))
Out[7]: 
0    0.280662
1    0.782841
2    0.550591
3    0.260844
4    0.261612
dtype: float64

Dictionary

Dictionaries already have a natural candidate for the index, so passing the index separately seems redundant, although possible.

In [8]: d = {'a' : 0., 'b' : 1., 'c' : 2.}

In [9]: pd.Series(d)
Out[9]: 
a    0.0
b    1.0
c    2.0
dtype: float64

In [10]: pd.Series(d, index=['b', 'c', 'd', 'a'])
Out[10]: 
b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

Scalar

If data is a scalar value, an index must be provided. The value will be repeated to match the length of index.

In [11]: pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])
Out[11]: 
a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

Series is similar to array

Slicing and other operations on Series produce very similar results to those on array but with a twist. Index is also sliced and always remain a part of a data container.

In [12]: s[0]
Out[12]: 0.8038617545671074

In [13]: s[:3]
Out[13]: 
a    0.803862
b    0.474233
c    0.224809
dtype: float64

In [14]: s[s > s.median()]
Out[14]: 
a    0.803862
e    0.747660
dtype: float64

In [15]: s[[4, 3, 1]]
Out[15]: 
e    0.747660
d    0.580670
b    0.474233
dtype: float64

Similarly to NumPy arrays, Series can be used to speed up loops by using vectorization.

In [16]: s + s
Out[16]: 
a    1.607724
b    0.948467
c    0.449618
d    1.161341
e    1.495321
dtype: float64

In [17]: s * 2
Out[17]: 
a    1.607724
b    0.948467
c    0.449618
d    1.161341
e    1.495321
dtype: float64

In [18]: np.exp(s)
Out[18]: 
a    2.234152
b    1.606782
c    1.252084
d    1.787236
e    2.112053
dtype: float64

A key difference between Series and array is that operations between Series automatically align the data based on label. Thus, you can write computations without giving consideration to whether the Series involved have the same labels.

In [19]: s[1:] + s[:-1]
Out[19]: 
a         NaN
b    0.948467
c    0.449618
d    1.161341
e         NaN
dtype: float64

The result of an operation between unaligned Series will have the union of the indexes involved. If a label is not found in one Series or the other, the result will be marked as missing NaN. Being able to write code without doing any explicit data alignment grants immense freedom and flexibility in interactive data analysis and research. The integrated data alignment features of the pandas data structures set pandas apart from the majority of related tools for working with labeled data.

Series is similar to dictionary

A few examples to illustrate the heading.

In [20]: s['a']
Out[20]: 0.8038617545671074

In [21]: s['e'] = 12.

In [22]: s
Out[22]: 
a     0.803862
b     0.474233
c     0.224809
d     0.580670
e    12.000000
dtype: float64

In [23]: 'e' in s
Out[23]: True

In [24]: 'f' in s
Out[24]: False

Name attribute

Series can also have a name attribute which will become very useful when summarizing data with tables and plots.

In [25]: s = pd.Series(np.random.randn(5), name='random series')

In [26]: s
Out[26]: 
0    1.156323
1    0.508412
2    0.839466
3    0.635020
4   -2.117160
Name: random series, dtype: float64

In [27]: s.name
Out[27]: 'random series'

DataFrame

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. Like Series, DataFrame accepts many different kinds of input:

  • Dict of 1D ndarrays, lists, dicts, or Series
  • 2-D numpy.ndarray
  • A Series
  • Another DataFrame

Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index.

If axis labels are not passed, they will be constructed from the input data based on common sense rules.

From dict of Series or dicts

The result index will be the union of the indexes of the various Series. If there are any nested dicts, these will be first converted to Series. If no columns are passed, the columns will be the sorted list of dict keys.

In [28]: d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
   ....:      'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
   ....: 

In [29]: df = pd.DataFrame(d)

In [30]: df
Out[30]: 
   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0

In [31]: pd.DataFrame(d, index=['d', 'b', 'a'])
Out[31]: 
   one  two
d  NaN  4.0
b  2.0  2.0
a  1.0  1.0

In [32]: pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])
Out[32]: 
   two three
d  4.0   NaN
b  2.0   NaN
a  1.0   NaN

The row and column labels can be accessed respectively by accessing the index and columns attributes:

In [33]: df.index
Out[33]: Index(['a', 'b', 'c', 'd'], dtype='object')

In [34]: df.columns
Out[34]: Index(['one', 'two'], dtype='object')

From dict of array-likes

The ndarrays must all be the same length. If an index is passed, it must clearly also be the same length as the arrays. If no index is passed, the result will be range(n), where n is the array length.

In [35]: d = {'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.]}

In [36]: pd.DataFrame(d)
Out[36]: 
   one  two
0  1.0  4.0
1  2.0  3.0
2  3.0  2.0
3  4.0  1.0

In [37]: pd.DataFrame(d, index=['a', 'b', 'c', 'd'])
Out[37]: 
   one  two
a  1.0  4.0
b  2.0  3.0
c  3.0  2.0
d  4.0  1.0

From a list of dicts

In [38]: data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]

In [39]: pd.DataFrame(data2)
Out[39]: 
   a   b     c
0  1   2   NaN
1  5  10  20.0

In [40]: pd.DataFrame(data2, index=['first', 'second'])
Out[40]: 
        a   b     c
first   1   2   NaN
second  5  10  20.0

In [41]: pd.DataFrame(data2, columns=['a', 'b'])
Out[41]: 
   a   b
0  1   2
1  5  10

From a dict of tuples

In [42]: pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
   ....:               ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
   ....:               ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
   ....:               ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
   ....:               ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})
   ....: 
Out[42]: 
       a              b      
       a    b    c    a     b
A B  4.0  1.0  5.0  8.0  10.0
  C  3.0  2.0  6.0  7.0   NaN
  D  NaN  NaN  NaN  NaN   9.0

From a Series

The result will be a DataFrame with the same index as the input Series, and with one column whose name is the original name of the Series (only if no other column name provided).

Basic functionality

Here are the data sets that will be used below.

In [43]: index = pd.date_range('1/1/2000', periods=8)

In [44]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [45]: df = pd.DataFrame(np.random.randn(8, 3), index=index,
   ....:                   columns=['A', 'B', 'C'])
   ....: 

Head and Tail

To view a small sample of a Series or DataFrame object, use the head() and tail() methods. The default number of elements to display is five, but you may pass a custom number.

In [46]: long_series = pd.Series(np.random.randn(1000))

In [47]: long_series.head()
Out[47]: 
0   -2.204349
1    0.937905
2    0.080988
3   -0.906226
4   -0.175094
dtype: float64

In [48]: long_series.tail(3)
Out[48]: 
997    1.354623
998    0.666298
999    0.421843
dtype: float64

Attributes and the raw values

Pandas objects have a number of attributes enabling you to access the metadata

  • shape: gives the axis dimensions of the object, consistent with ndarray

  • Axis labels

    • Series: index (only axis)
    • DataFrame: index (rows) and columns

Note, these attributes can be safely assigned to!

In [49]: df[:2]
Out[49]: 
                   A         B         C
2000-01-01  2.051798 -0.219221 -0.306257
2000-01-02 -0.074402  1.372959 -0.142413

In [50]: df.columns = [x.lower() for x in df.columns]

In [51]: df
Out[51]: 
                   a         b         c
2000-01-01  2.051798 -0.219221 -0.306257
2000-01-02 -0.074402  1.372959 -0.142413
2000-01-03 -1.737715 -0.567596 -2.157891
2000-01-04 -0.198540  0.429065 -1.384287
2000-01-05  0.927509  0.899038 -0.414452
2000-01-06  0.688222 -1.121087 -0.494530
2000-01-07 -0.142898 -1.443570 -0.152030
2000-01-08 -0.330306 -1.388025  2.315416

To get the actual data inside a data structure, one need only access the values property:

In [52]: s.values
Out[52]: array([-1.29503689, -1.05896664, -0.81633751, -1.78494207,  0.79900381])

In [53]: df.values
Out[53]: 
array([[ 2.05179753, -0.21922114, -0.3062565 ],
       [-0.07440165,  1.37295907, -0.1424128 ],
       [-1.73771544, -0.56759615, -2.15789072],
       [-0.19853994,  0.42906546, -1.38428694],
       [ 0.92750875,  0.89903831, -0.41445218],
       [ 0.68822161, -1.1210866 , -0.49453017],
       [-0.14289804, -1.44356981, -0.15203033],
       [-0.33030559, -1.38802519,  2.31541608]])

Descriptive statistics

A large number of methods for computing descriptive statistics and other related operations on Series and DataFrame. Most of these are aggregations (hence producing a lower-dimensional result) like sum(), mean(), and quantile(), but some of them, like cumsum() and cumprod(), produce an object of the same size. Generally speaking, these methods take an axis argument, just like ndarray.{sum, std, ...}, but the axis can be specified by name or integer:

  • Series: no axis argument needed
  • DataFrame: “index” (axis=0, default), “columns” (axis=1)
In [54]: df = pd.DataFrame({'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
   ....:                    'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
   ....:                    'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
   ....: 

In [55]: df.mean(0)
Out[55]: 
one      1.071293
three   -0.060576
two      0.988616
dtype: float64

In [56]: df.mean(1)
Out[56]: 
a    1.288946
b    1.213444
c   -0.026536
d    0.423999
dtype: float64

All such methods have a skipna option signaling whether to exclude missing data (True by default):

In [57]: df.sum(0, skipna=False)
Out[57]: 
one           NaN
three         NaN
two      3.954464
dtype: float64

In [58]: df.sum(axis=1, skipna=True)
Out[58]: 
a    2.577893
b    3.640333
c   -0.079609
d    0.847998
dtype: float64

Combined with the broadcasting / arithmetic behavior, one can describe various statistical procedures, like standardization (rendering data zero mean and standard deviation 1), very concisely:

In [59]: ts_stand = (df - df.mean()) / df.std()

In [60]: ts_stand.std()
Out[60]: 
one      1.0
three    1.0
two      1.0
dtype: float64

In [61]: xs_stand = df.sub(df.mean(1), axis=0).div(df.std(1), axis=0)

In [62]: xs_stand.std(1)
Out[62]: 
a    1.0
b    1.0
c    1.0
d    1.0
dtype: float64

Series also has a method nunique() which will return the number of unique non-null values:

In [63]: series = pd.Series(np.random.randn(500))

In [64]: series[20:500] = np.nan

In [65]: series[10:20] = 5

In [66]: series.nunique()
Out[66]: 11

Summarizing data: describe

There is a convenient describe() function which computes a variety of summary statistics about a Series or the columns of a DataFrame:

In [67]: series = pd.Series(np.random.randn(1000))

In [68]: series[::2] = np.nan

In [69]: series.describe()
Out[69]: 
count    500.000000
mean      -0.025277
std        1.038469
min       -3.262829
25%       -0.668483
50%        0.005974
75%        0.642145
max        3.495553
dtype: float64

In [70]: frame = pd.DataFrame(np.random.randn(1000, 5),
   ....:                      columns=['a', 'b', 'c', 'd', 'e'])
   ....: 

In [71]: frame.ix[::2] = np.nan

In [72]: frame.describe()
Out[72]: 
                a           b           c           d           e
count  500.000000  500.000000  500.000000  500.000000  500.000000
mean     0.024593    0.004004    0.049564   -0.010214    0.070455
std      1.013887    0.993945    0.999204    0.998887    0.986454
min     -3.592443   -3.087312   -3.650603   -3.456179   -3.870670
25%     -0.692612   -0.704272   -0.669172   -0.603277   -0.559860
50%      0.079318    0.026328    0.055656    0.017281    0.126330
75%      0.695259    0.652323    0.744712    0.646474    0.734037
max      2.859714    3.222803    3.066365    2.673373    2.883375

You can select specific percentiles to include in the output:

In [73]: series.describe(percentiles=[.05, .25, .75, .95])
Out[73]: 
count    500.000000
mean      -0.025277
std        1.038469
min       -3.262829
5%        -1.732858
25%       -0.668483
50%        0.005974
75%        0.642145
95%        1.651018
max        3.495553
dtype: float64

For a non-numerical Series object, describe() will give a simple summary of the number of unique values and most frequently occurring values:

In [74]: s = pd.Series(['a', 'a', 'b', 'b', 'a', 'a', np.nan, 'c', 'd', 'a'])

In [75]: s.describe()
Out[75]: 
count     9
unique    4
top       a
freq      5
dtype: object

Note that on a mixed-type DataFrame object, describe() will restrict the summary to include only numerical columns or, if none are, only categorical columns:

In [76]: frame = pd.DataFrame({'a': ['Yes', 'Yes', 'No', 'No'], 'b': range(4)})

In [77]: frame.describe()
Out[77]: 
              b
count  4.000000
mean   1.500000
std    1.290994
min    0.000000
25%    0.750000
50%    1.500000
75%    2.250000
max    3.000000

This behaviour can be controlled by providing a list of types as include/exclude arguments. The special value all can also be used:

In [78]: frame.describe(include=['object'])
Out[78]: 
         a
count    4
unique   2
top     No
freq     2

In [79]: frame.describe(include=['number'])
Out[79]: 
              b
count  4.000000
mean   1.500000
std    1.290994
min    0.000000
25%    0.750000
50%    1.500000
75%    2.250000
max    3.000000

In [80]: frame.describe(include='all')
Out[80]: 
          a         b
count     4  4.000000
unique    2       NaN
top      No       NaN
freq      2       NaN
mean    NaN  1.500000
std     NaN  1.290994
min     NaN  0.000000
25%     NaN  0.750000
50%     NaN  1.500000
75%     NaN  2.250000
max     NaN  3.000000

Index of Min/Max Values

The idxmin() and idxmax() functions on Series and DataFrame compute the index labels with the minimum and maximum corresponding values:

In [81]: s1 = pd.Series(np.random.randn(5))

In [82]: s1
Out[82]: 
0   -0.338489
1   -1.488132
2   -0.158881
3   -0.402446
4    1.045299
dtype: float64

In [83]: s1.idxmin(), s1.idxmax()
Out[83]: (1, 4)

In [84]: df1 = pd.DataFrame(np.random.randn(5,3), columns=['A','B','C'])

In [85]: df1
Out[85]: 
          A         B         C
0 -0.667021  0.067390 -0.574891
1  0.477648  0.310007 -1.071278
2  0.658044 -0.749574  0.669567
3 -0.154715 -1.198873 -0.662044
4 -1.504282  1.379718  0.694963

In [86]: df1.idxmin(axis=0)
Out[86]: 
A    4
B    3
C    1
dtype: int64

In [87]: df1.idxmin(axis=1)
Out[87]: 
0    A
1    C
2    B
3    B
4    A
dtype: object

When there are multiple rows (or columns) matching the minimum or maximum value, idxmin() and idxmax() return the first matching index:

In [88]: df3 = pd.DataFrame([2, 1, 1, 3, np.nan], columns=['A'], index=list('edcba'))

In [89]: df3
Out[89]: 
     A
e  2.0
d  1.0
c  1.0
b  3.0
a  NaN

In [90]: df3['A'].idxmin()
Out[90]: 'd'

Value counts (histogramming) / Mode

The value_counts() Series method and top-level function computes a histogram of a 1D array of values.

In [91]: data = np.random.randint(0, 7, size=50)

In [92]: data
Out[92]: 
array([3, 6, 3, 3, 2, 2, 3, 6, 0, 0, 4, 1, 0, 3, 3, 3, 0, 4, 5, 4, 0, 3, 2,
       4, 1, 0, 0, 3, 5, 5, 2, 4, 4, 4, 5, 5, 3, 5, 0, 2, 0, 1, 2, 6, 4, 6,
       6, 2, 0, 3])

In [93]: s = pd.Series(data)

In [94]: s.value_counts()
Out[94]: 
3    11
0    10
4     8
2     7
5     6
6     5
1     3
dtype: int64

Similarly, you can get the most frequently occurring value(s) (the mode) of the values in a Series or DataFrame:

In [95]: s5 = pd.Series([1, 1, 3, 3, 3, 5, 5, 7, 7, 7])

In [96]: s5.mode()
Out[96]: 
0    3
1    7
dtype: int64

In [97]: df5 = pd.DataFrame({'A': np.random.randint(0, 7, size=50),
   ....:                     'B': np.random.randint(-10, 15, size=50)})
   ....: 

In [98]: df5.mode()
Out[98]: 
     A   B
0  5.0  -9
1  NaN  13

Discretization and quantiling

Continuous values can be discretized using the cut() (bins based on values) and qcut() (bins based on sample quantiles) functions:

In [99]: arr = np.random.randn(20)

In [100]: factor = pd.cut(arr, 4)

In [101]: factor
Out[101]: 
[(-0.623, 0.539], (-1.785, -0.623], (0.539, 1.701], (-0.623, 0.539], (-1.785, -0.623], ..., (0.539, 1.701], (-1.785, -0.623], (0.539, 1.701], (-1.785, -0.623], (-0.623, 0.539]]
Length: 20
Categories (4, object): [(-2.951, -1.785] < (-1.785, -0.623] < (-0.623, 0.539] < (0.539, 1.701]]

In [102]: factor = pd.cut(arr, [-5, -1, 0, 1, 5])

In [103]: factor
Out[103]: 
[(0, 1], (-1, 0], (0, 1], (-1, 0], (-5, -1], ..., (0, 1], (-5, -1], (1, 5], (-1, 0], (0, 1]]
Length: 20
Categories (4, object): [(-5, -1] < (-1, 0] < (0, 1] < (1, 5]]

qcut() computes sample quantiles. For example, we could slice up some normally distributed data into equal-size quartiles like so:

In [104]: factor = pd.qcut(arr, [0, .25, .5, .75, 1])

In [105]: factor
Out[105]: 
[(-0.274, 0.64], (-0.905, -0.274], (0.64, 1.701], (-0.905, -0.274], [-2.947, -0.905], ..., (0.64, 1.701], [-2.947, -0.905], (0.64, 1.701], (-0.905, -0.274], (-0.274, 0.64]]
Length: 20
Categories (4, object): [[-2.947, -0.905] < (-0.905, -0.274] < (-0.274, 0.64] < (0.64, 1.701]]

In [106]: pd.value_counts(factor)
Out[106]: 
(0.64, 1.701]       5
(-0.274, 0.64]      5
(-0.905, -0.274]    5
[-2.947, -0.905]    5
dtype: int64

We can also pass infinite values to define the bins:

In [107]: arr = np.random.randn(20)

In [108]: factor = pd.cut(arr, [-np.inf, 0, np.inf])

In [109]: factor
Out[109]: 
[(-inf, 0], (0, inf], (-inf, 0], (-inf, 0], (-inf, 0], ..., (0, inf], (0, inf], (-inf, 0], (0, inf], (0, inf]]
Length: 20
Categories (2, object): [(-inf, 0] < (0, inf]]

Function application

Row or Column-wise Function Application

Arbitrary functions can be applied along the axes of a DataFrame using the apply() method, which, like the descriptive statistics methods, take an optional axis argument:

In [110]: df = pd.DataFrame({'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
   .....: 'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
   .....: 'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
   .....: 

In [111]: df
Out[111]: 
        one     three       two
a -0.857537       NaN -0.569614
b  0.379606  2.636734 -2.030135
c  2.231425  0.988285  0.859822
d       NaN -1.472950 -0.389827

In [112]: df.apply(np.mean)
Out[112]: 
one      0.584498
three    0.717356
two     -0.532439
dtype: float64

In [113]: df.apply(np.mean, axis=1)
Out[113]: 
a   -0.713575
b    0.328735
c    1.359844
d   -0.931388
dtype: float64

In [114]: df.apply(lambda x: x.max() - x.min())
Out[114]: 
one      3.088961
three    4.109683
two      2.889957
dtype: float64

In [115]: df.apply(np.cumsum)
Out[115]: 
        one     three       two
a -0.857537       NaN -0.569614
b -0.477930  2.636734 -2.599750
c  1.753494  3.625018 -1.739928
d       NaN  2.152069 -2.129755

Depending on the return type of the function passed to apply(), the result will either be of lower dimension or the same dimension.

apply() combined with some cleverness can be used to answer many questions about a data set. For example, suppose we wanted to extract the date where the maximum value for each column occurred:

In [116]: tsdf = pd.DataFrame(np.random.randn(1000, 3), columns=['A', 'B', 'C'],
   .....: index=pd.date_range('1/1/2000', periods=1000))
   .....: 

In [117]: tsdf.apply(lambda x: x.idxmax())
Out[117]: 
A   2000-09-07
B   2000-07-08
C   2001-08-04
dtype: datetime64[ns]

You may also pass additional arguments and keyword arguments to the apply() method. For instance, consider the following function you would like to apply:

In [118]: def subtract_and_divide(x, sub, divide=1):
   .....:     return (x - sub) / divide
   .....: 

In [119]: df.apply(subtract_and_divide, args=(5,), divide=3)
Out[119]: 
        one     three       two
a -1.952512       NaN -1.856538
b -1.540131 -0.787755 -2.343378
c -0.922858 -1.337238 -1.380059
d       NaN -2.157650 -1.796609

Another useful feature is the ability to pass Series methods to carry out some Series operation on each column or row:

In [120]: tsdf = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'],
   .....: index=pd.date_range('1/1/2000', periods=10))
   .....: 

In [121]: tsdf.ix[4:8] = np.nan

In [122]: tsdf
Out[122]: 
                   A         B         C
2000-01-01 -0.266824  0.489647  0.667215
2000-01-02 -0.160550  1.951419  3.066388
2000-01-03  1.845223 -1.963037  1.004272
2000-01-04  1.743428 -0.480957 -1.656485
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08       NaN       NaN       NaN
2000-01-09 -0.393886  1.202304 -1.313902
2000-01-10  1.177766  0.066169  0.286364

In [123]: tsdf.apply(pd.Series.interpolate)
Out[123]: 
                   A         B         C
2000-01-01 -0.266824  0.489647  0.667215
2000-01-02 -0.160550  1.951419  3.066388
2000-01-03  1.845223 -1.963037  1.004272
2000-01-04  1.743428 -0.480957 -1.656485
2000-01-05  1.315966 -0.144305 -1.587969
2000-01-06  0.888503  0.192348 -1.519452
2000-01-07  0.461040  0.529000 -1.450935
2000-01-08  0.033577  0.865652 -1.382418
2000-01-09 -0.393886  1.202304 -1.313902
2000-01-10  1.177766  0.066169  0.286364

Applying elementwise Python functions

Since not all functions can be vectorized (accept NumPy arrays and return another array or value), the methods applymap() on DataFrame and analogously map() on Series accept any Python function taking a single value and returning a single value. For example:

In [124]: df
Out[124]: 
        one     three       two
a -0.857537       NaN -0.569614
b  0.379606  2.636734 -2.030135
c  2.231425  0.988285  0.859822
d       NaN -1.472950 -0.389827

In [125]: df['one'].map(lambda x: len(str(x)))
Out[125]: 
a    19
b    18
c    17
d     3
Name: one, dtype: int64

In [126]: df.applymap(lambda x: len(str(x)))
Out[126]: 
   one  three  two
a   19      3   19
b   18     17   19
c   17     18   18
d    3     19   19

Reindexing and altering labels

reindex() is the fundamental data alignment method in pandas. It is used to implement nearly all other features relying on label-alignment functionality. To reindex means to conform the data to match a given set of labels along a particular axis. This accomplishes several things:

  • Reorders the existing data to match a new set of labels
  • Inserts missing value (NA) markers in label locations where no data for that label existed
  • If specified, fill data for missing labels using logic (highly relevant to working with time series data)

Here is a simple example:

In [127]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [128]: s.reindex(['e', 'b', 'f', 'd'])
Out[128]: 
e    1.333735
b    0.631174
f         NaN
d   -0.812845
dtype: float64

With a DataFrame, you can simultaneously reindex the index and columns:

In [129]: df
Out[129]: 
        one     three       two
a -0.857537       NaN -0.569614
b  0.379606  2.636734 -2.030135
c  2.231425  0.988285  0.859822
d       NaN -1.472950 -0.389827

In [130]: df.reindex(index=['c', 'f', 'b'], columns=['three', 'two', 'one'])
Out[130]: 
      three       two       one
c  0.988285  0.859822  2.231425
f       NaN       NaN       NaN
b  2.636734 -2.030135  0.379606

Reindexing to align with another object

You may wish to take an object and reindex its axes to be labeled the same as another object.

In [131]: df.reindex_like(df.ix[:2, 2:])
Out[131]: 
        two
a -0.569614
b -2.030135

Aligning objects with each other with align

The align() method is the fastest way to simultaneously align two objects. It supports a join argument (related to joining and merging):

  • join='outer': take the union of the indexes (default)
  • join='left': use the calling object’s index
  • join='right': use the passed object’s index
  • join='inner': intersect the indexes

It returns a tuple with both of the reindexed Series:

In [132]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [133]: s1 = s[:4]

In [134]: s2 = s[1:]

In [135]: s1.align(s2)
Out[135]: 
(a   -1.059234
 b    0.259373
 c    1.928774
 d   -0.058033
 e         NaN
 dtype: float64, a         NaN
 b    0.259373
 c    1.928774
 d   -0.058033
 e    2.048256
 dtype: float64)

In [136]: s1.align(s2, join='inner')
Out[136]: 
(b    0.259373
 c    1.928774
 d   -0.058033
 dtype: float64, b    0.259373
 c    1.928774
 d   -0.058033
 dtype: float64)

In [137]: s1.align(s2, join='left')
Out[137]: 
(a   -1.059234
 b    0.259373
 c    1.928774
 d   -0.058033
 dtype: float64, a         NaN
 b    0.259373
 c    1.928774
 d   -0.058033
 dtype: float64)

For DataFrames, the join method will be applied to both the index and the columns by default:

In [138]: df = pd.DataFrame({'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
   .....: 'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
   .....: 'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
   .....: 

In [139]: df2 = pd.DataFrame({'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'e']),
   .....: 'three' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd'])})
   .....: 

In [140]: 

In [140]: df2
Out[140]: 
      three       two
a -1.144876  1.172496
b  0.605939  0.144176
c -1.051258 -0.085485
d -0.057191       NaN
e       NaN  0.947694

In [141]: df.align(df2, join='inner')
Out[141]: 
(      three       two
 a       NaN  0.688195
 b -0.359800  1.210830
 c -0.350740  0.032263
 d -1.083179 -0.881242,       three       two
 a -1.144876  1.172496
 b  0.605939  0.144176
 c -1.051258 -0.085485
 d -0.057191       NaN)

You can also pass an axis option to only align on the specified axis:

In [142]: df.align(df2, join='inner', axis=0)
Out[142]: 
(        one     three       two
 a  1.664420       NaN  0.688195
 b -0.173819 -0.359800  1.210830
 c -0.640030 -0.350740  0.032263
 d       NaN -1.083179 -0.881242,       three       two
 a -1.144876  1.172496
 b  0.605939  0.144176
 c -1.051258 -0.085485
 d -0.057191       NaN)

Filling while reindexing

reindex() takes an optional parameter method which is a filling method chosen from the following options:

  • pad / ffill: Fill values forward
  • bfill / backfill: Fill values backward
  • nearest: Fill from the nearest index value

These methods require that the indexes are ordered increasing or decreasing.

We illustrate these fill methods on a simple Series:

In [143]: rng = pd.date_range('1/3/2000', periods=8)

In [144]: ts = pd.Series(np.random.randn(8), index=rng)

In [145]: ts2 = ts[[0, 3, 6]]

In [146]: ts
Out[146]: 
2000-01-03   -1.338091
2000-01-04   -1.180912
2000-01-05   -0.184432
2000-01-06   -0.040998
2000-01-07   -0.589521
2000-01-08   -0.149317
2000-01-09   -0.540106
2000-01-10   -0.230329
Freq: D, dtype: float64

In [147]: ts2
Out[147]: 
2000-01-03   -1.338091
2000-01-06   -0.040998
2000-01-09   -0.540106
dtype: float64

In [148]: ts2.reindex(ts.index)
Out[148]: 
2000-01-03   -1.338091
2000-01-04         NaN
2000-01-05         NaN
2000-01-06   -0.040998
2000-01-07         NaN
2000-01-08         NaN
2000-01-09   -0.540106
2000-01-10         NaN
Freq: D, dtype: float64

In [149]: ts2.reindex(ts.index, method='ffill')
Out[149]: 
2000-01-03   -1.338091
2000-01-04   -1.338091
2000-01-05   -1.338091
2000-01-06   -0.040998
2000-01-07   -0.040998
2000-01-08   -0.040998
2000-01-09   -0.540106
2000-01-10   -0.540106
Freq: D, dtype: float64

In [150]: ts2.reindex(ts.index, method='bfill')
Out[150]: 
2000-01-03   -1.338091
2000-01-04   -0.040998
2000-01-05   -0.040998
2000-01-06   -0.040998
2000-01-07   -0.540106
2000-01-08   -0.540106
2000-01-09   -0.540106
2000-01-10         NaN
Freq: D, dtype: float64

In [151]: ts2.reindex(ts.index, method='nearest')
Out[151]: 
2000-01-03   -1.338091
2000-01-04   -1.338091
2000-01-05   -0.040998
2000-01-06   -0.040998
2000-01-07   -0.040998
2000-01-08   -0.540106
2000-01-09   -0.540106
2000-01-10   -0.540106
Freq: D, dtype: float64

Dropping labels from an axis

A method closely related to reindex is the drop() function. It removes a set of labels from an axis:

In [152]: df
Out[152]: 
        one     three       two
a  1.664420       NaN  0.688195
b -0.173819 -0.359800  1.210830
c -0.640030 -0.350740  0.032263
d       NaN -1.083179 -0.881242

In [153]: df.drop(['a', 'd'], axis=0)
Out[153]: 
        one    three       two
b -0.173819 -0.35980  1.210830
c -0.640030 -0.35074  0.032263

In [154]: df.drop(['one'], axis=1)
Out[154]: 
      three       two
a       NaN  0.688195
b -0.359800  1.210830
c -0.350740  0.032263
d -1.083179 -0.881242

Renaming / mapping labels

The rename() method allows you to relabel an axis based on some mapping (a dict or Series) or an arbitrary function.

In [155]: s
Out[155]: 
a   -1.059234
b    0.259373
c    1.928774
d   -0.058033
e    2.048256
dtype: float64

In [156]: s.rename(str.upper)
Out[156]: 
A   -1.059234
B    0.259373
C    1.928774
D   -0.058033
E    2.048256
dtype: float64

If you pass a function, it must return a value when called with any of the labels (and must produce a set of unique values). But if you pass a dict or Series, it need only contain a subset of the labels as keys:

In [157]: df.rename(columns={'one' : 'foo', 'two' : 'bar'},
   .....: index={'a' : 'apple', 'b' : 'banana', 'd' : 'durian'})
   .....: 
Out[157]: 
             foo     three       bar
apple   1.664420       NaN  0.688195
banana -0.173819 -0.359800  1.210830
c      -0.640030 -0.350740  0.032263
durian       NaN -1.083179 -0.881242

The rename() method also provides an inplace named parameter that is by default False and copies the underlying data. Pass inplace=True to rename the data in place.

Sorting by index and value

There are two obvious kinds of sorting that you may be interested in: sorting by label and sorting by actual values. The primary method for sorting axis labels (indexes) across data structures is the sort_index() method.

In [158]: unsorted_df = df.reindex(index=['a', 'd', 'c', 'b'],
   .....: columns=['three', 'two', 'one'])
   .....: 

In [159]: unsorted_df.sort_index()
Out[159]: 
      three       two       one
a       NaN  0.688195  1.664420
b -0.359800  1.210830 -0.173819
c -0.350740  0.032263 -0.640030
d -1.083179 -0.881242       NaN

In [160]: unsorted_df.sort_index(ascending=False)
Out[160]: 
      three       two       one
d -1.083179 -0.881242       NaN
c -0.350740  0.032263 -0.640030
b -0.359800  1.210830 -0.173819
a       NaN  0.688195  1.664420

In [161]: unsorted_df.sort_index(axis=1)
Out[161]: 
        one     three       two
a  1.664420       NaN  0.688195
d       NaN -1.083179 -0.881242
c -0.640030 -0.350740  0.032263
b -0.173819 -0.359800  1.210830

DataFrame.sort_index() can accept an optional by argument for axis=0 which will use an arbitrary vector or a column name of the DataFrame to determine the sort order:

In [162]: df1 = pd.DataFrame({'one':[2,1,1,1],'two':[1,3,2,4],'three':[5,4,3,2]})

In [163]: df1.sort_index(by='two')
Out[163]: 
   one  three  two
0    2      5    1
2    1      3    2
1    1      4    3
3    1      2    4

The by argument can take a list of column names, e.g.:

In [164]: df1[['one', 'two', 'three']].sort_index(by=['one','two'])
Out[164]: 
   one  two  three
2    1    2      3
1    1    3      4
3    1    4      2
0    2    1      5

Smallest / largest values

Series has the nsmallest() and nlargest() methods which return the smallest or largest n values. For a large Series this can be much faster than sorting the entire Series and calling head(n) on the result.

In [165]: s = pd.Series(np.random.permutation(10))

In [166]: s
Out[166]: 
0    1
1    9
2    3
3    4
4    8
5    7
6    6
7    5
8    2
9    0
dtype: int64

In [167]: s.order()
Out[167]: 
9    0
0    1
8    2
2    3
3    4
7    5
6    6
5    7
4    8
1    9
dtype: int64

In [168]: s.nsmallest(3)
Out[168]: 
9    0
0    1
8    2
dtype: int64

In [169]: s.nlargest(3)
Out[169]: 
1    9
4    8
5    7
dtype: int64

Sorting by a multi-index column

You must be explicit about sorting when the column is a multi-index, and fully specify all levels to by.

In [170]: df1.columns = pd.MultiIndex.from_tuples([('a','one'),('a','two'),('b','three')])

In [171]: df1.sort_index(by=('a','two'))
Out[171]: 
    a         b
  one two three
3   1   2     4
2   1   3     2
1   1   4     3
0   2   5     1

Indexing and selecting data

Different Choices for Indexing

Pandas supports three types of multi-axis indexing.

  • .loc is primarily label based, but may also be used with a boolean array. .loc will raise KeyError when the items are not found. Allowed inputs are:

    • A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index. This use is not an integer position along the index)
    • A list or array of labels ['a', 'b', 'c']
    • A slice object with labels 'a':'f', (note that contrary to usual python slices, both the start and the stop are included!)
    • A boolean array
  • .iloc is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array. .iloc will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing. Allowed inputs are:

    • An integer e.g. 5
    • A list or array of integers [4, 3, 0]
    • A slice object with ints 1:7
    • A boolean array
  • .ix supports mixed integer and label based access. It is primarily label based, but will fall back to integer positional access unless the corresponding axis is of integer type. .ix is the most general and will support any of the inputs in .loc and .iloc. .ix also supports floating point label schemes. .ix is exceptionally useful when dealing with mixed positional and label based hierachical indexes.

However, when an axis is integer based, ONLY label based access and not positional access is supported. Thus, in such cases, it’s usually better to be explicit and use .iloc or .loc.

Selection By Position

A few basic examples:

In [172]: s1 = pd.Series(np.random.randn(5),index=list(range(0,10,2)))

In [173]: s1
Out[173]: 
0   -0.712657
2   -0.416819
4    2.068256
6    1.355886
8    0.445316
dtype: float64

In [174]: s1.iloc[:3]
Out[174]: 
0   -0.712657
2   -0.416819
4    2.068256
dtype: float64

In [175]: s1.iloc[3]
Out[175]: 1.3558857604071533

In [176]: s1.iloc[:3] = 0

In [177]: s1
Out[177]: 
0    0.000000
2    0.000000
4    0.000000
6    1.355886
8    0.445316
dtype: float64

With a DataFrame:

In [178]: df1 = pd.DataFrame(np.random.randn(6,4),
   .....:                    index=list(range(0,12,2)),
   .....:                    columns=list(range(0,8,2)))
   .....: 

In [179]: df1
Out[179]: 
           0         2         4         6
0  -0.193341 -0.007379  0.189982  0.155440
2   2.217550 -0.176657  0.505735 -1.413420
4   0.158097  0.045783  1.342493 -0.579776
6  -0.071243  0.828100  0.086477  0.253491
8  -2.096503 -0.230619  0.134489  2.463292
10 -0.370085  1.681855  0.964464 -1.247107

In [180]: df1.iloc[:3]
Out[180]: 
          0         2         4         6
0 -0.193341 -0.007379  0.189982  0.155440
2  2.217550 -0.176657  0.505735 -1.413420
4  0.158097  0.045783  1.342493 -0.579776

In [181]: df1.iloc[1:5, 2:4]
Out[181]: 
          4         6
2  0.505735 -1.413420
4  1.342493 -0.579776
6  0.086477  0.253491
8  0.134489  2.463292

In [182]: df1.iloc[[1, 3, 5], [1, 3]]
Out[182]: 
           2         6
2  -0.176657 -1.413420
6   0.828100  0.253491
10  1.681855 -1.247107

In [183]: df1.iloc[1:3, :]
Out[183]: 
          0         2         4         6
2  2.217550 -0.176657  0.505735 -1.413420
4  0.158097  0.045783  1.342493 -0.579776

In [184]: df1.iloc[:, 1:3]
Out[184]: 
           2         4
0  -0.007379  0.189982
2  -0.176657  0.505735
4   0.045783  1.342493
6   0.828100  0.086477
8  -0.230619  0.134489
10  1.681855  0.964464

In [185]: df1.iloc[1, 1]
Out[185]: -0.17665674332428558

In [186]: df1.iloc[1]
Out[186]: 
0    2.217550
2   -0.176657
4    0.505735
6   -1.413420
Name: 2, dtype: float64

Boolean indexing

Another common operation is the use of boolean vectors to filter the data. The operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses.

Using a boolean vector to index a Series works exactly as in a numpy ndarray:

In [187]: s = pd.Series(range(-3, 4))

In [188]: s
Out[188]: 
0   -3
1   -2
2   -1
3    0
4    1
5    2
6    3
dtype: int64

In [189]: s[s > 0]
Out[189]: 
4    1
5    2
6    3
dtype: int64

In [190]: s[(s < -1) | (s > 0.5)]
Out[190]: 
0   -3
1   -2
4    1
5    2
6    3
dtype: int64

In [191]: s[~(s < 0)]
Out[191]: 
3    0
4    1
5    2
6    3
dtype: int64

You may select rows from a DataFrame using a boolean vector the same length as the DataFrame’s index (for example, something derived from one of the columns of the DataFrame):

In [192]: df = pd.DataFrame({'a' : ['one', 'one', 'two', 'three', 'two', 'one', 'six'],
   .....:                    'b' : ['x', 'y', 'y', 'x', 'y', 'x', 'x'],
   .....:                    'c' : np.random.randn(7)})
   .....: 

In [193]: df
Out[193]: 
       a  b         c
0    one  x -0.593495
1    one  y -2.165981
2    two  y  0.235662
3  three  x -0.817657
4    two  y -1.375797
5    one  x  1.164189
6    six  x -0.036121

In [194]: df[df['c'] > 0]
Out[194]: 
     a  b         c
2  two  y  0.235662
5  one  x  1.164189

In [195]: criterion = df['a'].map(lambda x: x.startswith('t'))

In [196]: df[criterion]
Out[196]: 
       a  b         c
2    two  y  0.235662
3  three  x -0.817657
4    two  y -1.375797

In [197]: df[criterion & (df['b'] == 'x')]
Out[197]: 
       a  b         c
3  three  x -0.817657

In [198]: df.loc[criterion & (df['b'] == 'x'), 'b':'c']
Out[198]: 
   b         c
3  x -0.817657

Indexing with isin

Consider the isin method of Series, which returns a boolean vector that is true wherever the Series elements exist in the passed list. This allows you to select rows where one or more columns have values you want:

In [199]: s = pd.Series(np.arange(5), index=np.arange(5)[::-1])

In [200]: s
Out[200]: 
4    0
3    1
2    2
1    3
0    4
dtype: int64

In [201]: s.isin([2, 4, 6])
Out[201]: 
4    False
3    False
2     True
1    False
0     True
dtype: bool

In [202]: s[s.isin([2, 4, 6])]
Out[202]: 
2    2
0    4
dtype: int64

The same method is available for Index objects and is useful for the cases when you don’t know which of the sought labels are in fact present:

In [203]: s[s.index.isin([2, 4, 6])]
Out[203]: 
4    0
2    2
dtype: int64

In [204]: s[[2, 4, 6]]
Out[204]: 
2    2.0
4    0.0
6    NaN
dtype: float64

In addition to that, MultiIndex allows selecting a separate level to use in the membership check:

In [205]: s_mi = pd.Series(np.arange(6),
   .....: index=pd.MultiIndex.from_product([[0, 1], ['a', 'b', 'c']]))
   .....: 

In [206]: s_mi
Out[206]: 
0  a    0
   b    1
   c    2
1  a    3
   b    4
   c    5
dtype: int64

In [207]: s_mi.iloc[s_mi.index.isin([(1, 'a'), (2, 'b'), (0, 'c')])]
Out[207]: 
0  c    2
1  a    3
dtype: int64

In [208]: s_mi.iloc[s_mi.index.isin(['a', 'c', 'e'], level=1)]
Out[208]: 
0  a    0
   c    2
1  a    3
   c    5
dtype: int64

DataFrame also has an isin method. When calling isin, pass a set of values as either an array or dict. If values is an array, isin returns a DataFrame of booleans that is the same shape as the original DataFrame, with True wherever the element is in the sequence of values.

In [209]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'n'],
   .....: 'ids2': ['a', 'n', 'c', 'n']})
   .....: 

In [210]: df
Out[210]: 
  ids ids2  vals
0   a    a     1
1   b    n     2
2   f    c     3
3   n    n     4

In [211]: df.isin(['a', 'b', 1, 3])
Out[211]: 
     ids   ids2   vals
0   True   True   True
1   True  False  False
2  False  False   True
3  False  False  False

In [212]: df.isin({'ids': ['a', 'b'], 'vals': [1, 3]})
Out[212]: 
     ids   ids2   vals
0   True  False   True
1   True  False  False
2  False  False   True
3  False  False  False

Set / Reset Index

DataFrame has a set_index method which takes a column name (for a regular Index) or a list of column names (for a MultiIndex), to create a new, indexed DataFrame:

In [213]: data = pd.DataFrame({'a' : ['bar', 'bar', 'foo', 'foo'],
   .....:                      'b' : ['one', 'two', 'one', 'two'],
   .....:                      'c' : ['z', 'y', 'x', 'w'],
   .....:                      'd' : range(1, 5)})
   .....: 

In [214]: data
Out[214]: 
     a    b  c  d
0  bar  one  z  1
1  bar  two  y  2
2  foo  one  x  3
3  foo  two  w  4

In [215]: data.set_index('c')
Out[215]: 
     a    b  d
c             
z  bar  one  1
y  bar  two  2
x  foo  one  3
w  foo  two  4

In [216]: data.set_index(['a', 'b'])
Out[216]: 
         c  d
a   b        
bar one  z  1
    two  y  2
foo one  x  3
    two  w  4

In [217]: data.set_index(['a', 'b'], inplace=True)

reset_index is the inverse operation to set_index.

In [218]: data.reset_index()
Out[218]: 
     a    b  c  d
0  bar  one  z  1
1  bar  two  y  2
2  foo  one  x  3
3  foo  two  w  4

In [219]: data.reset_index(level='a')
Out[219]: 
       a  c  d
b             
one  bar  z  1
two  bar  y  2
one  foo  x  3
two  foo  w  4

Todo

Complete Pandas section