Category: Data Analysis

Random Pandas Notes, Note 2

2019-11-21

Reference: https://www.mikulskibartosz.name/how-to-split-a-list-inside-a-dataframe-cell-into-rows-in-pandas/

Given a table containing a column of lists, like:

id	name	tags
1	winter squash	[‘60-minutes’, ‘time-to-make’, ‘healthy’]
2	braised pork	[‘2-hours-more’, ‘time-to-make’, ‘chinese’]
3	chilli Beef	[‘4-hours-more’, ‘chinese’]

Like the ‘unwind’ function in mongodb, to turn it into:

id	name	tags
1	winter squash	‘60-minutes’
1	winter squash	‘time-to-make’
1	winter squash	‘healthy’
2	braised pork	‘2-hours-more’
2	braised pork	‘time-to-make’
2	braised pork	‘chinese’
3	chilli Beef	‘4-hours-more’
3	chilli Beef	‘chinese’

Here’s who we do it:

tags_df['tags'] = tags_df.tags.apply(lambda x: x[1:-1].split(','))
clean_tag_df = tags_df.tags.apply(pd.Series).merge(tags_df, right_index = True, left_index = True) \
    .drop(["tags"], axis = 1) \
    .melt(id_vars = ['recipe_id'], value_name = "tags") \
    .drop("variable", axis = 1).dropna()

Breaking it down, we have the following to turn the tags from a string into a list of strings.

1	tags_df['tags'] = tags_df.tags.apply(lambda x: x[1:-1].split(','))

Next step, turning a list of tags into multiple columns:

1	tags_df.tags.apply(pd.Series)

It turns the chart like:

1	2	3
‘60-minutes’	‘time-to-make’	‘healthy’
‘2-hours-more’	‘time-to-make’	‘chinese’
‘4-hours-more’	‘chinese’	NaN

Then, we join tags with the rest of the list:

1	prev_df.merge(tags_df, right_index = True, left_index = True)

Then, we drop the duplicated “tags” column and unwind different columns of the tags into different rows:

1	prev_df.drop(["tags"], axis = 1).melt(id_vars = ['recipe_id'], value_name = "tags")

Lastly, we remove the “variable” column, which we might not need:

1	prev_df.drop("variable", axis = 1).dropna()

Confusion Matrix Summary

2019-10-21

A simple cheat-sheet for confusion matrix.

Random Pandas Notes, Note 1

2019-10-19

To change a row value based on more than one conditions, use the following:

1	df['A'] = np.where(((df['B'] == 'some_value') & (df['C'] == 'some_other_value')), true_value, false_value)

Another equally efficient approach is using loc:
Note that this method does not give a default value when the condition is not met. So, the same code might be required to run twice or using .fillna() method.

1	df.loc[(df.B == 'some_value') \| (df.C == 'some_other_value'), 'A'] = true_value

ps: for ‘and’ or ‘or’ operation, use symbol ‘&’ or ‘|’.

ZEXI JIN

Data Analysis

Random Pandas Notes, Note 2

Confusion Matrix Summary

Random Pandas Notes, Note 1