Business Management: Data Management and Visualization - Wesleyan University (Coursera) - Week 3

CODE

import pandas
import numpy

pandas.set_option('display.float_format',lambda x:'%f'%x)

data = pandas.read_csv('20151009gap.csv', low_memory=False)

print(len(data)) #Number of observations (Rows)
print(len(data.columns)) #Number of Variables (Columns)

data['polityscore'] = data['polityscore'].convert_objects(convert_numeric=True)
data['suicideper100th'] = data['suicideper100th'].convert_objects(convert_numeric=True)
data['employrate'] = data['employrate'].convert_objects(convert_numeric=True)
data['incomeperperson'] = data['incomeperperson'].convert_objects(convert_numeric=True)
data['co2emissions'] = data['co2emissions'].convert_objects(convert_numeric=True)

#Week 3 - Managing Data Assignment

#Making a copy of data to sub5 data frame
sub5=data.copy()

#Filling empty records with avearge value of the column, I am doing this invidually to only numeric columns

#fillna function is used to fill NaN with mean values. This helps to make our analysis more accurate. Thus managing empty spaces.

sub5['polityscore'].fillna((sub5['polityscore'].mean()), inplace=True)
sub5['suicideper100th'].fillna((sub5['suicideper100th'].mean()), inplace=True)
sub5['employrate'].fillna((sub5['employrate'].mean()), inplace=True)
sub5['incomeperperson'].fillna((sub5['incomeperperson'].mean()), inplace=True)
sub5['co2emissions'].fillna((sub5['co2emissions'].mean()), inplace=True)

# categorize quantitative variable based on customized splits using cut function - making a new variable polity4
# splits into 4 groups
sub5['polity4'] = pandas.cut(sub5.polityscore, [-10, -5, 0, 5, 10])
f1 = sub5['polity4'].value_counts(sort=False)
f2 = sub5['polity4'].value_counts(sort=False, normalize=True)

print ('\n\n Polity Score divided into 4 parts, frequesty and percentage of each is given \n\n')
print(f1)
print(f2)

# quartile split (use qcut function & ask for 4 groups - gives you quartile split)
sub5['suicide4']=pandas.qcut(sub5.suicideper100th, 4, labels=["1=0%tile","2=25%tile","3=50%tile","4=75%tile"])
f3 = sub5['suicide4'].value_counts(sort=False)
f4 = sub5['suicide4'].value_counts(sort=False, normalize=True)

print ('\n\n suicide Score divided into 4 parts, frequesty and percentage of each is given \n\n')
print(f3)
print(f4)

# quartile split (use qcut function & ask for 4 groups - gives you quartile split)
sub5['employ4']=pandas.qcut(sub5.employrate, 4, labels=["1=0%tile","2=25%tile","3=50%tile","4=75%tile"])
f5 = sub5['employ4'].value_counts(sort=False)
f6 = sub5['employ4'].value_counts(sort=False, normalize=True)

print ('\n\n employrate Score divided into 4 parts, frequesty and percentage of each is given \n\n')
print(f5)
print(f6)

OUTPUT

Polity Score divided into 4 parts, frequesty and percentage of each is given

(-10, -5] 23
(-5, 0] 27
(0, 5] 71
(5, 10] 90
dtype: int64
(-10, -5] 0.107981
(-5, 0] 0.126761
(0, 5] 0.333333
(5, 10] 0.422535
dtype: float64

suicide Score divided into 4 parts, frequesty and percentage of each is given

1=0%tile 54
2=25%tile 53
3=50%tile 53
4=75%tile 53
dtype: int64
1=0%tile 0.253521
2=25%tile 0.248826
3=50%tile 0.248826
4=75%tile 0.248826
dtype: float64

employrate Score divided into 4 parts, frequesty and percentage of each is given

1=0%tile 56
2=25%tile 68
3=50%tile 36
4=75%tile 53
dtype: int64
1=0%tile 0.262911
2=25%tile 0.319249
3=50%tile 0.169014
4=75%tile 0.248826
dtype: float64

ANALYSIS
1 - Handling missing data
To handle missing data, I filled the missing records with mean of the column. This was done though the code

sub5['polityscore'].fillna((sub5['polityscore'].mean()), inplace=True)

This helps to make our analysis more accurate.

2 - Making three new variables
I collapsed polity score, suicide rate and employrate to make 3 new variables names polity4, suicide4 and employ4. The data was split into 4 parts for each variable, I used 2 different methods to accomplish this. The pandas.qcut and pandas.cut functions were used.

3 - Analysis outcome
From the output I see that polity score from 5 to 10 accounts for 42% of the data.
The suicide rate is rather evenly distributed in the four percentages, I will need to look into this more closely to determine new insight.
and finally the employment rate is also evenly distributed, but the lowest is at the 50%. That gives me additional insight to look in that specific area to gain new knowledge or why its lowest at 50 to 75%.

Business Management

Sunday, October 18, 2015

Data Management and Visualization - Wesleyan University (Coursera) - Week 3 - Ammar Shigri

No comments:

Post a Comment

About Me