My Bible of Data Wrangling
Notice: this content will be updated constantly since I work on data wrangling a lot. So everytime I find a common problem in cleansing data, I will post it here.
As a data scientist, you may find yourself in the same situation every time you are tasked to clean a dataset using pandas. It’s quite annoying to me that I have to Google search (more specifically using Stackoverflow) to solve the same problem that I have faced before such as trouble with datetime format or using pyplot. Therefore, I have combined all of the Pandas tricks into a cheat sheet for my own purpose of reusing them in the future
Credit: Suzi Eszterhas/Minden Pictures
Here’s the list of all the tricks:
Get a quick summary of a DataFrame
df.dtypes
df.shape
df.info
Columns of a DataFrame
# list all column names
df.columns
# change column names
df.columns = ["col3", "col4", "col5"]
# unique values in this column
df.column1.unique()
# count each unique value in a column
df.column1.values_count()
# rearrange column order
df = df[[col1, col2, col3, col4]]
# split a column
df['col2'] = [d.time() for d in df['col1']]
Multiple filters
df[(df.col > value1) & (df.col < value2)] # and
df[(df.col > value1) | (df.col < value2)] # or
Dealing with Datetime format:
# define dates as strings
date_str1 = 'Wednesday, June 6, 2018'
date_str2 = '6/6/18'
date_str3 = '06-06-2018'
# define dates as datetime objects
date_dt1 = datetime.strptime(date_str1, '%A, %B %d, %Y')
date_dt2 = datetime.strptime(date_str2, '%m/%d/%y')
date_dt3 = datetime.strptime(date_str3, '%m-%d-%Y')
time_dt3 = datetime.strptime(date_str3, '%m-%d-%Y').time() # print time only
# convert to date format
df.col = pd.to_datetime(df.col) # or
df.col = pd.to_datetime(df.col, format = '%m/%d/%Y')
# create date ranges with start date and end date
pd.date_range(start_date, periods=10, freq='D')
pd.date_range(start_date, end_date, freq='W')
# import time
time.strftime("%H:%M:%S")
time.strftime("%d/%m/%Y")
# get current date and time
now = datetime.datetime.now()
now.hour
now.mintue
now.year
now.day
now.month
# compare datetime
pd.to_datetime("06:00:00").time() <= df.datetimecolumn[index]
What is the difference between strptime() and strftime()?
stringftime() is used to convert a datetime object to string object
stringptime(), the other way around, converts a date (and time) which is represented in a string format to a datetime object.
This works similarly in R programming as well.
Iterate over a Dataframe
for item, row in df.iterrows():
print row()
Sort values and reset index
# sort value by column
df = df.sort_values(['column1', "column2", "column3"])
# groupby function
df.groupby(by=['column1','column2'])['column3'].mean().reset_index(name='mean_values_grouped')
# reset index
df.reset_index(drop = True, inplace = True)
Duplicates
# print duplicate rows
df[df.duplicated() == True]
# drop duplicate rows, keep the first record of duplicate
df.drop_duplicates(keep=’first’, inplace=True)
Filter column elements by multiple elements contained on a list
df[df['A'].isin([3, 6])]
Series of values
# change value type of series
series.astype(float)
# apply a function to each item of a series
series.apply(lambda x: 0 if x=='a' else 1)
Sum values of all columns
df.sum(axis=1)
Use apply for multiple columns
def my_function(a, b):
return a + b
df.apply(lambda row: my_function(row['a'], row['b']), axis=1)
Combine different dataframe
pd.concat([df1, df2, df3])
pd.merge(df1, df2, on=”itemId”, how=”inner”)
Output dataframe
output.to_csv("./output_date.csv", index=False)
output.to_excel("./output_date.csv", index=False)
Handling large datasets
df_iter = pd.read_csv(file, chunksize=10)
df_iter.get_chunk()
For example:
i = 0
for a in df_iter:
# do some processing chunk = df_iter.get_chunk()
i += 1
new_chunk = chunk.apply(lambda x: do_something(x), axis=1) new_chunk.to_csv("chunk_output_%i.csv" % i )