python

Random Pandas Notes, Note 2

Reference: https://www.mikulskibartosz.name/how-to-split-a-list-inside-a-dataframe-cell-into-rows-in-pandas/

Given a table containing a column of lists, like:

id name tags
1 winter squash [‘60-minutes’, ‘time-to-make’, ‘healthy’]
2 braised pork [‘2-hours-more’, ‘time-to-make’, ‘chinese’]
3 chilli Beef [‘4-hours-more’, ‘chinese’]

Like the ‘unwind’ function in mongodb, to turn it into:

id name tags
1 winter squash ‘60-minutes’
1 winter squash ‘time-to-make’
1 winter squash ‘healthy’
2 braised pork ‘2-hours-more’
2 braised pork ‘time-to-make’
2 braised pork ‘chinese’
3 chilli Beef ‘4-hours-more’
3 chilli Beef ‘chinese’

Here’s who we do it:

1
2
3
4
5
tags_df['tags'] = tags_df.tags.apply(lambda x: x[1:-1].split(','))
clean_tag_df = tags_df.tags.apply(pd.Series).merge(tags_df, right_index = True, left_index = True) \
.drop(["tags"], axis = 1) \
.melt(id_vars = ['recipe_id'], value_name = "tags") \
.drop("variable", axis = 1).dropna()

Breaking it down, we have the following to turn the tags from a string into a list of strings.

1
tags_df['tags'] = tags_df.tags.apply(lambda x: x[1:-1].split(','))

Next step, turning a list of tags into multiple columns:

1
tags_df.tags.apply(pd.Series)

It turns the chart like:

1 2 3
‘60-minutes’ ‘time-to-make’ ‘healthy’
‘2-hours-more’ ‘time-to-make’ ‘chinese’
‘4-hours-more’ ‘chinese’ NaN

Then, we join tags with the rest of the list:

1
prev_df.merge(tags_df, right_index = True, left_index = True)

Then, we drop the duplicated “tags” column and unwind different columns of the tags into different rows:

1
prev_df.drop(["tags"], axis = 1).melt(id_vars = ['recipe_id'], value_name = "tags")

Lastly, we remove the “variable” column, which we might not need:

1
prev_df.drop("variable", axis = 1).dropna()

Random Pandas Notes, Note 1

To change a row value based on more than one conditions, use the following:

1
df['A'] = np.where(((df['B'] == 'some_value') & (df['C'] == 'some_other_value')), true_value, false_value)

Another equally efficient approach is using loc:
Note that this method does not give a default value when the condition is not met. So, the same code might be required to run twice or using .fillna() method.

1
df.loc[(df.B == 'some_value') | (df.C == 'some_other_value'), 'A'] = true_value

ps: for ‘and’ or ‘or’ operation, use symbol ‘&’ or ‘|’.

A Brief Summary of Sorting Algorithm, Part 1

  1. Basics:
    • in-place sorting vs. out-place sorting
    • internal sorting vs. external sorting
    • stable vs. unstable sorting
  2. Bubble Sort
  3. Selection Sort
  4. Insertion Sort
  5. Merge Sort
  6. Quick Sort
  7. Heap Sort

Internal vs. external sorting

Internal sorting and external sorting describes where the sorting occurs:

  • internal sorting located entirely in memory
  • external sorting utilizes hard disk and external storage

Stable vs. unstable sorting

A sorting algorithm is said to be stable if two objects with equal keys appear in the same order in sorted output as they appear in the input array to be sorted.

  • stable sorting algorithms includes:
    1. Bubble Sort
    2. Insertion Sort
    3. Merge Sort
    4. Count Sort

Bubble Sort

Bubble Sort is a type of stable sorting algorithm. The algorithm compares two elements that are next to each other and swap two element is the left one is larger than the right one. Time complexity of bubble sort is O(n*n).

1
2
3
4
5
6
7
def bubbleSort(arr): 
n = len(arr)
for i in range(n):
for j in range(0, n-i-1):
if arr[j] > arr[j+1] :
arr[j], arr[j+1] = arr[j+1], arr[j]
return arr

Selection Sort

In every iteration of selection sort, the minimum element (considering ascending order) from the unsorted subarray is picked and moved to the sorted sub-array. Selection sort can be done stably. Time complexity of selection sort is O(n*n).

1
2
3
4
5
6
7
8
9
10
def selectionSort(arr):
for i in range(len(arr)):
min_idx = i
for j in range(i+1, len(arr)):
if arr[min_idx] > arr[j]:
min_idx = j
tmp = arr[i]
arr[i] = arr[min_idx]
arr[min_idx] = tmp
return arr

Insertion Sort

Insertion sort is stable. In every iteration of insertion sort, the first element is selected and inserted into the correct location in the sorted half of the array. Time complexity of Insertion sort is O(n*n).

1
2
3
4
5
6
7
8
9
def insertionSort(arr): 
for i in range(1, len(arr)):
key = arr[i]
j = i-1
while j >= 0 and key < arr[j] :
arr[j + 1] = arr[j]
j -= 1
arr[j + 1] = key
return arr

Merge Sort

Merge sort is stable. Merge Sort is a Divide and Conquer algorithm. It divides input array in two halves, calls itself for the two halves and then merges the two sorted halves. Time complexity is O(n*log(n)).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def mergeSort(arr): 
if len(arr) >1:
mid = len(arr)//2
L = mergeSort(arr[:mid])
R = mergeSort(arr[mid:])
i = j = k = 0
while i < len(L) and j < len(R):
if L[i] < R[j]:
arr[k] = L[i]
i+=1
else:
arr[k] = R[j]
j+=1
k+=1
while i < len(L):
arr[k] = L[i]
i+=1
k+=1
while j < len(R):
arr[k] = R[j]
j+=1
k+=1
return arr

Quick Sort

For quick sort, we pick a random element as pivot. Compare each element with the pivot to create first half of the list smaller than the pivot and the second half larger than the pivot. After that, quick sort divide conquer two halves. Time complexity for quick sort is O(nlog(n)), worst case is O(nn). Quick sort can be made stable.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def partition(arr,low,high): 
i = (low-1)
pivot = arr[high]
for j in range(low , high):
if arr[j] < pivot:
i = i+1
arr[i],arr[j] = arr[j],arr[i]
arr[i+1],arr[high] = arr[high],arr[i+1]
return ( i+1 )

def quickSort(arr,low,high):
if low < high:
pi = partition(arr,low,high)
quickSort(arr, low, pi-1)
quickSort(arr, pi+1, high)

Heap Sort

  1. Build a max heap from the input data.
  2. At this point, the largest item is stored at the root of the heap. Replace it with the last item of the heap followed by reducing the size of heap by 1. Finally, heapify the root of tree.
  3. Repeat above steps while size of heap is greater than 1.
    Heapify: this procedure calls itself recursively to move the max value to the top of the heap. Time complexity for heapify is O(log(n)) and time complexity for building a heap is O(n). Thus, heap sort gives the overall time complexity as O(n*log(n)).
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    def heapify(arr, n, i): 
    largest = i # Initialize largest as root
    l = 2 * i + 1 # left = 2*i + 1
    r = 2 * i + 2 # right = 2*i + 2
    # left child
    if l < n and arr[i] < arr[l]:
    largest = l
    # right child
    if r < n and arr[largest] < arr[r]:
    largest = r
    # change the root
    if largest != i:
    arr[i],arr[largest] = arr[largest],arr[i] # swap
    # Heapify upward
    heapify(arr, n, largest)

    def heapSort(arr):
    n = len(arr)
    for i in range(n, -1, -1):
    heapify(arr, n, i)
    for i in range(n-1, 0, -1):
    arr[i], arr[0] = arr[0], arr[i] # swap
    heapify(arr, i, 0)
    return a

Django Basics, Part 1

Django directory levels

  1. manage.py: we use this to communicate with Django server
  2. project directory:
    1. __init__.py: tells Django to use directory as a python package
    2. settings.py: Django configuration file
    3. urls.py: url schema
    4. wsgi.py: project and wsgi integration web server

Database (MySQL) config

  • dependencies used:
    mysql 5.7.24 h56848d4_0
    mysql-connector-c 6.1.11 hccea1a4_0
    mysql-connector-python 8.0.17 py27h3febbb0_0 anaconda
    mysql-python 1.2.5 py27h1de35cc_0 anaconda
    pymysql 0.9.3 py27_0
  • __init__.py:, add the following code:
    1
    2
    import pymysql
    pymysql.install_as_MySQLdb()
  • in settings.py: add database configurations to connect to the database
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    DATABASES = {
    'default': {
    'ENGINE': 'django.db.backends.mysql',
    'NAME': # server name
    'USER': # user name
    'PASSWORD': # password
    'HOST': 'localhost',
    'PORT': '3306',
    }
    }

Init new application

  • init new application:
    1
    python manage.py startapp <myApp>
  • codes above creates:
    • admin.py: website config
    • models.py: model
    • views.py: view
  • In setting, add to INSTALLED_APP:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    INSTALLED_APPS = [
    'django.contrib.admin',
    'django.contrib.auth',
    'django.contrib.contenttypes',
    'django.contrib.sessions',
    'django.contrib.messages',
    'django.contrib.staticfiles',
    'myApp'
    ]

models.py file

  • the following is an example of models.py
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    # -*- coding: utf-8 -*-
    from __future__ import unicode_literals
    from django.db import models

    # Create your models here.
    class Grades(models.Model):
    gname = models.CharField(max_length=20)
    gdate = models.DateTimeField()
    ggirlnum = models.IntegerField()
    gboynum = models.IntegerField()
    isDelete = models.BooleanField(default=False)

    def __str__(self):
    return "%s-%d-%d"%(self.gname, self.gboynum, self.ggrilnum)


    class Students(models.Model):
    sname = models.CharField(max_length=20)
    sgender = models.BooleanField(default=True)
    sage = models.IntegerField()
    scontend = models.CharField(max_length=20)
    isDelete = models.BooleanField(default=False)
    sgrade = models.ForeignKey("Grades")

    def __str__(self):
    return "%s-%d-%s"%(self.sname, self.sage, self.scontend)
  • make migration and execute migration file: will create database according to models.py
    1
    2
    python manage.py makemigrations # create migration file
    python manage.py migrate # execute migration file

To add new entries with manage.py shell commands

  1. enter shell:
    1
    python manage.py shell
  2. import packages in shell:
    1
    2
    3
    from myApp.models import Grades, Students
    from django.utils import timezone
    from datetime import *
  3. add new entry to db:
    1
    2
    3
    4
    5
    6
    grade1 = Grades()
    grade1.gname = "something"
    grade1.gdate = datetime(year=,month=,day=)
    grade1.ggirlnum = 70
    grade1.gboynum = 35
    grade1.save()
  4. some other command:
    1
    2
    3
    4
    5
    6
    Grades.objects.all()    # get all data entries in Grades
    g = Grades.objects.get(pk=1) # get the 1st object in Grades
    g.gboynum = 45
    g.save() # alternate existing data
    d.delete() # delete, including the one in db
    stu1 = g.students_set.create(sname=...) # add new student with one line

To run server

  • execute:
    1
    python manage.py runserver ip:port
  • Admin:
    • publish content
    • add/modify/delete content
    • add “‘django.contrib.admin’” in INSTALLED_APPS
    • exists by defualt
    • add “/admin” and log in
    • change language or timezone based on preference

(to be continued…)