My Bible of Data Wrangling

Notice: this content will be updated constantly since I work on data wrangling a lot. So everytime I find a common problem in cleansing data, I will post it here.

As a data scientist, you may find yourself in the same situation every time you are tasked to clean a dataset using pandas. It’s quite annoying to me that I have to Google search (more specifically using Stackoverflow) to solve the same problem that I have faced before such as trouble with datetime format or using pyplot. Therefore, I have combined all of the Pandas tricks into a cheat sheet for my own purpose of reusing them in the future

Credit: Suzi Eszterhas/Minden Pictures

Here’s the list of all the tricks:

Get a quick summary of a DataFrame

df.dtypes

df.shape

df.info

Columns of a DataFrame

# list all column names

df.columns

# change column names

df.columns = ["col3", "col4", "col5"]

# unique values in this column

df.column1.unique()

# count each unique value in a column

df.column1.values_count()

# rearrange column order

df = df[[col1, col2, col3, col4]]

# split a column

df['col2'] = [d.time() for d in df['col1']]

Multiple filters

df[(df.col > value1) & (df.col < value2)] # and

df[(df.col > value1) | (df.col < value2)] # or

Dealing with Datetime format:

# define dates as strings

date_str1 = 'Wednesday, June 6, 2018'

date_str2 = '6/6/18'

date_str3 = '06-06-2018'

# define dates as datetime objects

date_dt1 = datetime.strptime(date_str1, '%A, %B %d, %Y')

date_dt2 = datetime.strptime(date_str2, '%m/%d/%y')

date_dt3 = datetime.strptime(date_str3, '%m-%d-%Y')

time_dt3 = datetime.strptime(date_str3, '%m-%d-%Y').time() # print time only

# convert to date format

df.col = pd.to_datetime(df.col) # or

df.col = pd.to_datetime(df.col, format = '%m/%d/%Y')

# create date ranges with start date and end date

pd.date_range(start_date, periods=10, freq='D')

pd.date_range(start_date, end_date, freq='W')

# import time

time.strftime("%H:%M:%S")

time.strftime("%d/%m/%Y")

# get current date and time

now = datetime.datetime.now()

now.hour

now.mintue

now.year

now.day

now.month

# compare datetime

pd.to_datetime("06:00:00").time() <= df.datetimecolumn[index]

What is the difference between strptime() and strftime()?

stringftime() is used to convert a datetime object to string object

stringptime(), the other way around, converts a date (and time) which is represented in a string format to a datetime object.

This works similarly in R programming as well.

Iterate over a Dataframe

for item, row in df.iterrows():

print row()

Sort values and reset index

# sort value by column

df = df.sort_values(['column1', "column2", "column3"])

# groupby function

df.groupby(by=['column1','column2'])['column3'].mean().reset_index(name='mean_values_grouped')

# reset index

df.reset_index(drop = True, inplace = True)

Duplicates

# print duplicate rows

df[df.duplicated() == True]

# drop duplicate rows, keep the first record of duplicate

df.drop_duplicates(keep=’first’, inplace=True)

Filter column elements by multiple elements contained on a list

df[df['A'].isin([3, 6])]

Series of values

# change value type of series

series.astype(float)

# apply a function to each item of a series

series.apply(lambda x: 0 if x=='a' else 1)

Sum values of all columns

df.sum(axis=1)

Use apply for multiple columns

def my_function(a, b):

return a + b

df.apply(lambda row: my_function(row['a'], row['b']), axis=1)

Combine different dataframe

pd.concat([df1, df2, df3])

pd.merge(df1, df2, on=”itemId”, how=”inner”)

Output dataframe

output.to_csv("./output_date.csv", index=False)

output.to_excel("./output_date.csv", index=False)

Handling large datasets

df_iter = pd.read_csv(file, chunksize=10)

df_iter.get_chunk()

For example:

i = 0

for a in df_iter:

# do some processing chunk = df_iter.get_chunk()

i += 1

new_chunk = chunk.apply(lambda x: do_something(x), axis=1) new_chunk.to_csv("chunk_output_%i.csv" % i )

Dive in Data Science

if data == ocean: data_scientist = diver

My Bible of Data Wrangling