Pandas is an open-source Python Library providing high-performance data manipulation and analysis tool using its powerful data structures. The name Pandas is derived from the word Panel Data – an Econometrics from Multidimensional data. Pandas were initially developed by Wes McKinney in 2008 while he was working at AQR Capital Management. He convinced the AQR to allow him to open source the Pandas. Another AQR employee, Chang She, joined as the second major contributor to the library in 2012. Over time many versions of pandas have been released. The latest version of the pandas is 1.4.1
Using Pandas, we can accomplish five typical steps in the processing and analysis of data, regardless of the origin of data — load, prepare, manipulate, model, and analyze. Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc.
Fast and efficient DataFrame object with default and customized indexing.
Tools for loading data into in-memory data objects from different file formats.
Data from different file objects can be loaded.
Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
Data alignment and integrated handling of missing data.
Reshaping and pivoting of data sets.
Label-based slicing, indexing, and subsetting of large data sets.
Columns from a data structure can be deleted or inserted.
Group by data for aggregation and transformations.
Time Series functionality.
• It provides a flexible way to merge, concatenate or reshape the data.
A lightweight alternative is to install NumPy using the popular Python package installer, pip.
pip install pandas
Pandas deals with the following three data structures:
1. Series
2. Data Frame
3. Panel
Pandas data structures, the mental effort of the user is reduced. For example, with tabular data (DataFrame) it is more semantically helpful to think of the index (the rows) and the columns rather than axis 0 and axis 1.
All Pandas data structures are value mutable (can be changed) and except Series all are size mutable. Series is size immutable. Note: DataFrame is widely used and one of the most important data structures. Panel is very less used.
Series
Series is a one-dimensional array like structure with homogeneous data.
Key Points:
Homogeneous data
Size Immutable
Values of Data Mutable
pandas.Series (data, index=idx (optional))
Where data may be python sequence (Lists), ndarray, scalar value, or a python dictionary.
How to create Series with Mutable index
import pandas as pd
import numpy as np
arr=np.array(['a','b','c','d'])
s=pd.Series(arr, index=['first','second','third','fourth'])
import pandas as pd
import numpy as np
arr=np.array([10,15,18,22])
s = pd.Series(arr)
We can say that Series is a labeled one-dimensional array which can hold any type of data.
✓ Data of Series is always mutable, means it can be changed.
✓ But the size of Data of Series is always immutable, means it cannot be changed.
✓ Series may be considered as a Data Structure with two arrays out which one array works as Index (Labels) and the second array works as original Data.
✓ Row Labels in Series are called Index.
head (): It is used to access the first 5 rows of a series.
Note :To access the first 3 rows we can call series_name.head(3)
tail(): It is used to access the last 5 rows of a series.
Note :To access last 4 rows we can call series_name.tail (4)
Series provides index label loc and ilocand [] to access rows and columns.
series_name.loc[StartRange: StopRange]
Selection Using iloc index label :
Syntax:-series_name.iloc[StartRange : StopRange]
Slicing is a way to retrieve subsets of data from a pandas object. A slice object syntax is – SERIES_NAME [start:end: step]
The segments start representing the first item, the end representing the last item and step representing the increment between each item that you would like.
DataFrame
It is a two-dimensional object that is useful in representing data in the form of rows and columns. It is similar to a spreadsheet or an SQL table. This is the most commonly used pandas object. Once we store the data in the Dataframe, we can perform various operations that are useful in analyzing and understanding the data.
1. A Dataframe has axes (indices)-
➢ Row index (axis=0)
➢ Column index (axes=1)
2. It is similar to a spreadsheet , whose row index is called index and column index is called column name.
3. A Dataframe contains Heterogeneous data.
4. A Dataframe Size is Mutable.
5. A Dataframe Data is Mutable.
A data frame can be created using any of the following:
1. Series 2. Lists 3. Dictionary 4. A NumPy 2D array.
import pandas as pd
s = pd.Series(['a','b','c','d'])
df=pd.DataFrame(s)
If we want to access record or data from a data frame row-wise or column-wise then iteration is used. Pandas provide 2 functions to perform iterations
1. iterrows () 2. iteritems ()
import pandas as pd
s = pd.Series([10,15,18,22])
df=pd.DataFrame(s)
df.columns=[‘List1’] # To Rename the default column of Data Frame as List1
df[‘List2’]=20 T # To create a new column List2 with all values as 20
df[‘List3’]=df[‘List1’]+df[‘List2’] # Column1 and Column2 and store in New column List3
We can delete the column from a data frame by using any of the the following –
1. del 2. pop() 3. drop()
Pandas provide loc() and iloc() methods to access the subset from a data frame using row/column.
Df.loc[StartRow : EndRow, StartColumn : EndColumn]
Df.loc[StartRowindexs : EndRowindex, StartColumnindex : EndColumnindex]
The method head() gives the first 5 rows and the method tail() returns the last 5 rows.
Boolean indexing helps us to select the data from the DataFrames using a boolean vector. We create a DataFrame with a boolean index to use the boolean indexing.
Pandas provides various facilities for easily combining together Series, DataFrame.
pd.concat(objs, axis=0, join='outer', join_axes=None,ignore_index=False)
• objs − This is a sequence or mapping of Series, DataFrame, or Panel objects.
• axis − {0, 1, ...}, default 0. This is the axis to concatenate along.
• join − {‘inner’, ‘outer’}, default ‘outer’. How to handle indexes on other axis(es). Outer for union and inner for intersection.
• ignore_index − boolean, default False. If True, do not use the index values on the concatenation axis. The resulting axis will be labeled 0, ..., n - 1.
• join_axes − This is the list of Index objects. Specific indexes to use for the other (n-1) axes instead of performing inner/outer set logic.
Two DataFrames might hold different kinds of information about the same entity and linked by some common feature/column. To join these DataFrames, pandas provides multiple functions like merge(), join() etc
Full Outer Join:- The full outer join combines the results of both the left and the right outer joins. The joined data frame will contain all records from both the data frames and fill in NaNs for missing matches on either side. You can perform a full outer join by specifying the how argument as outer in merge() function.
Inner Join :- The inner join produce only those records that match in both the data frame. You have to pass inner in how argument inside merge() function.
RightJoin :-The right join produce a complete set of records from data frame B(Right side Data Frame) with the matching records (where available) in data frame A( Left side data frame). If there is no match right side will contain null. You have to pass right in how argument inside merge() function.
Left Join :- The left join produce a complete set of records from data frame A(Left side Data Frame) with the matching records (where available) in data frame B( Right side data frame). If there is no match left side will contain null. You have to pass left in how argument inside merge() function.
Joining on Index :-Sometimes you have to perform the join on the indexes or the row labels. For that you have to specify right_index( for the indexes of the right data frame ) and left_index( for the indexes of left data frame) as True.
CSV File
A CSV is a comma-separated values file, which allows data to be saved in a tabular format. CSV is a simple file such as a spreadsheet or database. Files in the CSV format can be imported and exported from programs that store data in tables, such as Microsoft Excel or Open Office. CSV files data fields are most often separated or delimited by a comma. Here the data in each row are delimited by comma and individual rows are separated by newline.
pd.read_csv() method is used to read a csv file.
To export a data frame into a CSV file first of all, we create a data frame say df1, and use dataframe.to_csv(‘ E:\Dataframe1.csv ’ ) method to export data frame df1 into CSV file Dataframe1.csv.
Panel
Panel is a three-dimensional data structure with heterogeneous data. The names for the 3 axes are intended to give some semantic meaning to describing operations involving panel data and, in particular, econometric analysis of panel data.
In Pandas Panel.shape can be used to get a tuple of axis dimensions.
Key Points
Heterogeneous data
Size Mutable
Data Mutable
pandas.Panel(data, items, major_axis, minor_axis, dtype, copy)
Pandas are used in conjunction with other libraries that are used for data science. It is built on the top of the NumPy library which means that a lot of structures of NumPy are used or replicated in Pandas. The data produced by Pandas are often used as input for plotting functions of Matplotlib, statistical analysis in SciPy, and machine learning algorithms in Scikit-learn.