Python Pandas

Python Pandas


In Datascience and Machine Learning, one of the most important tasks is to organize data also called as cleaning the data so that learning algorithms can be used on the dataset. In a typical project, almost 80% of the job is to fit the data so that certain algorithms can be run on it. Data can come from various sources, from files, databases or even from logs.

Pandas is a data manipulation library build on top of numpy, which gives flexibility in handling complex datasets. Pandas provide Dataframe and Series. Unlike a numpy array

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.

The basic method to create a Series is to call:

import pandas as pd

s = pd.Series(data, index=index)

Dataframes are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dictionary of Series objects. It is generally the most commonly used pandas object. Like Series

import pandas as pd

df = pd.DataFrame(d)

Detailed documentation of pandas can be found here

Download data source csv file from Kaggle and put the file in a folder Dataset

Phone: 512-539-0390
NJ Training Academy Inc , 405 Dry Gulch Bend
Cedar Park, Texas, 78613