Ninja with Laptop: Python Pandas

How to use pandas? (https://pandas.pydata.org/)

Simple, download and use Anaconda, everything comes bundled.

if you wish to install it on Server then do all the pip install, etc.

To use pandas in any python script using following:

import numpy as np

import pandas as pd

You are all set to use Pandas.

What is Pandas? why use Pandas?

Pandas is useful for data manipulation, just like you do in Excel.

you may retrieve & store data from Excel, CSV, JSON files or even Database tables using SQL queries.

It seems some people use it as ETL (extract transform and load)

What Pandas can do?

Pandas is good at data (array) manipulation (called as DataFrames), it gives great functionality like SQL to carry out similar operations on 2D arrays.

Functionality to expect out of the box?

How to load the data into 2D array?

Load the data using following command:

mydata = pd.read_csv('Documents\mydata.csv',index_col=1)

Output:

fname	lname	grade	department	birthdate	joindate
empid
emp01	Asish	Kulkarni	4	IT	1977-01-02	2014-02-04
emp02	Bharat	Joshi	5	Finance	1967-05-05	2014-02-04
emp03	Ashish	Goenka	3	Sales	1999-07-08	2015-02-07
emp04	Himesh	Shah	2	Pre-Sales	1977-01-02	2019-02-04
emp05	Asish	Deshpande	4	Accounts	1988-01-02	2014-02-04
emp06	Amit	Kulkarni	4	IT	1977-01-02	2014-02-04
emp07	Asish	Lele	1	IT	2000-01-02	2014-02-04
emp08	Manish	Bhate	1	IT	2000-01-09	2014-02-04
emp08	Nilima	Bhat	1	IT	2000-01-10	2014-02-04
emp09	Aparna	Jamatani	1	IT	2000-07-02	2014-02-04

One more example:

twoD = pd.DataFrame({'lname' : 'Downey','fname': pd.Categorical(["Robert","Pepper","Maguna","Bob"]),'age': pd.Categorical(["43","40","12","4"]), 'bdate': pd.Categorical([pd.Timestamp('19700101'),pd.Timestamp('19760101'),pd.Timestamp('20080702'),pd.Timestamp('20080702')])})

Output:

lname	fname	age	bdate
0	Downey	Robert	43	1970-01-01
1	Downey	Pepper	40	1976-01-01
2	Downey	Maguna	12	2008-07-02
3	Downey	Bob	4	2008-07-02

How to save data to desired destination?

df.to_csv('output.csv', mode='w')

mydata2.to_json(r'newDataFrame.json',orient='table')

How to display data with desired columns or rows?

Similar to Unix head and tail command:

df.head(5)

df.tail(5)

Using key as 'emp09' from index and selecting columns

df.loc['emp09',['fname','lname']]

Using numeric value for selecting range 3 to 5 and column 0 & 1.

df.iloc[3:5,[0,1]]

How to sort and list the data?

mydata2.sort_values(by="grade")

Filter data using like

officer = mydata2[mydata2["grade"] > 1]

senior = mydata2[mydata2["birthdate"] < "19990101"]

empoloyee = mydata2[mydata2["fname"].str.contains("sis")]

How to add columns & rows?

Add a column name City with default value as "Mumbai"

mydata2.insert(5, "City","Mumbai", True)

Add a row with data with values

newrec2 = pd.Series({'fname': 'Manoj', 'lname': 'Pethe'},name='emp10')

mydata2.append(newrec2)

How to produce statistics? average / max / min / aggregation /cumsum

run the method on the column, returns a value

mydata2['grade'].sum()

mydata2['grade'].max()

mydata2['grade'].min()

Group by

mydata2.groupby(['department']).count()

mydata2.groupby(['department','grade']).agg('sum')

Joins

df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
...                    'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})

other = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
...                       'B': ['B0', 'B1', 'B2']})

df.set_index('key').join(other.set_index('key'))
      A    B
key
K0   A0   B0
K1   A1   B1
K2   A2   B2
K3   A3  NaN
K4   A4  NaN
K5   A5  NaN

df.join(other.set_index('key'), on='key')
  key   A    B
0  K0  A0   B0
1  K1  A1   B1
2  K2  A2   B2
3  K3  A3  NaN
4  K4  A4  NaN
5  K5  A5  NaN

Merge

Merge DataFrames df1 and df2 with specified left and right suffixes appended to any overlapping columns.

>>> df1.merge(df2, left_on='lkey', right_on='rkey',
...           suffixes=('_left', '_right'))
  lkey  value_left rkey  value_right
0  foo           1  foo            5
1  foo           1  foo            8
2  foo           5  foo            5
3  foo           5  foo            8
4  bar           2  bar            6
5  baz           3  baz            7

Delete column

del df['column_name']

Date & Time functions

Since this is getting too lengthy, I would create another article.

How to schedule a batch to reduce I/O?

Simple answer is reduce the size of dataset given to Pandas before you begin the transformation.

How to generate graph?

Since this is getting too lengthy, I would create another article.

Ninja with Laptop

Saturday, July 4, 2020

Python Pandas - Practical Guide

No comments:

Post a Comment

Counter

About Me