Saturday, July 4, 2020

Python Pandas - Practical Guide

How to use pandas? (https://pandas.pydata.org/)

Simple, download and use Anaconda, everything comes bundled.
if you wish to install it on Server then do all the pip install, etc.

To use pandas in any python script using following:

import numpy as np
import pandas as pd

You are all set to use Pandas.


What is Pandas? why use Pandas?

Pandas is useful for data manipulation, just like you do in Excel.
you may retrieve & store data from Excel, CSV, JSON files or even Database tables using SQL queries.

It seems some people use it as ETL (extract transform and load)

What Pandas can do?

Pandas is good at data (array) manipulation (called as DataFrames), it gives great functionality like SQL to carry out similar operations on 2D arrays.


Functionality to expect out of the box?

  • How to load the data into 2D array?
Load the data using following command:

mydata = pd.read_csv('Documents\mydata.csv',index_col=1)

Output:
fnamelnamegradedepartmentbirthdatejoindate
empid
emp01AsishKulkarni4IT1977-01-022014-02-04
emp02BharatJoshi5Finance1967-05-052014-02-04
emp03AshishGoenka3Sales1999-07-082015-02-07
emp04HimeshShah2Pre-Sales1977-01-022019-02-04
emp05AsishDeshpande4Accounts1988-01-022014-02-04
emp06AmitKulkarni4IT1977-01-022014-02-04
emp07AsishLele1IT2000-01-022014-02-04
emp08ManishBhate1IT2000-01-092014-02-04
emp08NilimaBhat1IT2000-01-102014-02-04
emp09AparnaJamatani1IT2000-07-022014-02-04



One more example:
twoD = pd.DataFrame({'lname' : 'Downey','fname': pd.Categorical(["Robert","Pepper","Maguna","Bob"]),'age': pd.Categorical(["43","40","12","4"]), 'bdate': pd.Categorical([pd.Timestamp('19700101'),pd.Timestamp('19760101'),pd.Timestamp('20080702'),pd.Timestamp('20080702')])})

Output:
lnamefnameagebdate
0DowneyRobert431970-01-01
1DowneyPepper401976-01-01
2DowneyMaguna122008-07-02
3DowneyBob42008-07-02
  • How to save data to desired destination?
df.to_csv('output.csv', mode='w')
mydata2.to_json(r'newDataFrame.json',orient='table')
  • How to display data with desired columns or rows?
Similar to Unix head and tail command:
df.head(5)
df.tail(5)

Using key as 'emp09' from index and selecting columns
df.loc['emp09',['fname','lname']]

Using numeric value for selecting range 3 to 5  and column 0 & 1.
df.iloc[3:5,[0,1]]

  • How to sort and list the data?
mydata2.sort_values(by="grade")
  • Filter data using like
officer = mydata2[mydata2["grade"] > 1]

senior = mydata2[mydata2["birthdate"] < "19990101"]

empoloyee = mydata2[mydata2["fname"].str.contains("sis")]
    • How to add columns & rows?
    Add a column name City with default value as "Mumbai"

    mydata2.insert(5, "City","Mumbai", True) 

    Add a row with data with values

    newrec2 = pd.Series({'fname': 'Manoj', 'lname': 'Pethe'},name='emp10')

    mydata2.append(newrec2)

    • How to produce statistics? average / max / min / aggregation /cumsum
    run the method on the column, returns a value

    mydata2['grade'].sum()

    mydata2['grade'].max()

    mydata2['grade'].min()

    • Group by
    mydata2.groupby(['department']).count()

    mydata2.groupby(['department','grade']).agg('sum')
    • Joins
    df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
    ...                    'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})

    other = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
    ...                       'B': ['B0', 'B1', 'B2']})

    df.set_index('key').join(other.set_index('key'))
          A    B
    key
    K0   A0   B0
    K1   A1   B1
    K2   A2   B2
    K3   A3  NaN
    K4   A4  NaN
    K5   A5  NaN


    df.join(other.set_index('key'), on='key')
      key   A    B
    0  K0  A0   B0
    1  K1  A1   B1
    2  K2  A2   B2
    3  K3  A3  NaN
    4  K4  A4  NaN
    5  K5  A5  NaN


    • Merge

    Merge DataFrames df1 and df2 with specified left and right suffixes appended to any overlapping columns.
    >>> df1.merge(df2, left_on='lkey', right_on='rkey',
    ...           suffixes=('_left', '_right'))
      lkey  value_left rkey  value_right
    0  foo           1  foo            5
    1  foo           1  foo            8
    2  foo           5  foo            5
    3  foo           5  foo            8
    4  bar           2  bar            6
    5  baz           3  baz            7

    • Delete column
    del df['column_name']
    • Date & Time functions
    Since this is getting too lengthy, I would create another article.



    How to schedule a batch to reduce I/O? 


    Simple answer is reduce the size of dataset given to Pandas before you begin the transformation.

    How to generate graph?

    Since this is getting too lengthy, I would create another article.

    No comments:

    Post a Comment