Data Analysis
Random Pandas Notes, Note 2
Reference: https://www.mikulskibartosz.name/how-to-split-a-list-inside-a-dataframe-cell-into-rows-in-pandas/
Given a table containing a column of lists, like:
| id | name | tags | 
|---|---|---|
| 1 | winter squash | [‘60-minutes’, ‘time-to-make’, ‘healthy’] | 
| 2 | braised pork | [‘2-hours-more’, ‘time-to-make’, ‘chinese’] | 
| 3 | chilli Beef | [‘4-hours-more’, ‘chinese’] | 
Like the ‘unwind’ function in mongodb, to turn it into:
| id | name | tags | 
|---|---|---|
| 1 | winter squash | ‘60-minutes’ | 
| 1 | winter squash | ‘time-to-make’ | 
| 1 | winter squash | ‘healthy’ | 
| 2 | braised pork | ‘2-hours-more’ | 
| 2 | braised pork | ‘time-to-make’ | 
| 2 | braised pork | ‘chinese’ | 
| 3 | chilli Beef | ‘4-hours-more’ | 
| 3 | chilli Beef | ‘chinese’ | 
Here’s who we do it:
1  | tags_df['tags'] = tags_df.tags.apply(lambda x: x[1:-1].split(','))  | 
Breaking it down, we have the following to turn the tags from a string into a list of strings.
1  | tags_df['tags'] = tags_df.tags.apply(lambda x: x[1:-1].split(','))  | 
Next step, turning a list of tags into multiple columns:
1  | tags_df.tags.apply(pd.Series)  | 
It turns the chart like:
| 1 | 2 | 3 | 
|---|---|---|
| ‘60-minutes’ | ‘time-to-make’ | ‘healthy’ | 
| ‘2-hours-more’ | ‘time-to-make’ | ‘chinese’ | 
| ‘4-hours-more’ | ‘chinese’ | NaN | 
Then, we join tags with the rest of the list:
1  | prev_df.merge(tags_df, right_index = True, left_index = True)  | 
Then, we drop the duplicated “tags” column and unwind different columns of the tags into different rows:
1  | prev_df.drop(["tags"], axis = 1).melt(id_vars = ['recipe_id'], value_name = "tags")  | 
Lastly, we remove the “variable” column, which we might not need:
1  | prev_df.drop("variable", axis = 1).dropna()  | 
Random Pandas Notes, Note 1
To change a row value based on more than one conditions, use the following:
1  | df['A'] = np.where(((df['B'] == 'some_value') & (df['C'] == 'some_other_value')), true_value, false_value)  | 
Another equally efficient approach is using loc:
Note that this method does not give a default value when the condition is not met. So, the same code might be required to run twice or using .fillna() method.
1  | df.loc[(df.B == 'some_value') | (df.C == 'some_other_value'), 'A'] = true_value  | 
ps: for ‘and’ or ‘or’ operation, use symbol ‘&’ or ‘|’.
