Sunday, October 18, 2015

Data Management and Visualization - Wesleyan University (Coursera) - Week 3 - Ammar Shigri

CODE


import pandas
import numpy


pandas.set_option('display.float_format',lambda x:'%f'%x)



data = pandas.read_csv('20151009gap.csv', low_memory=False)

print(len(data))            #Number of observations (Rows)
print(len(data.columns))    #Number of Variables (Columns)




data['polityscore'] = data['polityscore'].convert_objects(convert_numeric=True)
data['suicideper100th'] = data['suicideper100th'].convert_objects(convert_numeric=True)
data['employrate'] = data['employrate'].convert_objects(convert_numeric=True)
data['incomeperperson'] = data['incomeperperson'].convert_objects(convert_numeric=True)
data['co2emissions'] = data['co2emissions'].convert_objects(convert_numeric=True)



#Week 3 - Managing Data Assignment




#Making a copy of data to sub5 data frame
sub5=data.copy()

#Filling empty records with avearge value of the column, I am doing this invidually to only numeric columns

#fillna function is used to fill NaN with mean values. This helps to make our analysis more accurate. Thus managing empty spaces.

sub5['polityscore'].fillna((sub5['polityscore'].mean()), inplace=True)
sub5['suicideper100th'].fillna((sub5['suicideper100th'].mean()), inplace=True)
sub5['employrate'].fillna((sub5['employrate'].mean()), inplace=True)
sub5['incomeperperson'].fillna((sub5['incomeperperson'].mean()), inplace=True)
sub5['co2emissions'].fillna((sub5['co2emissions'].mean()), inplace=True)



# categorize quantitative variable based on customized splits using cut function - making a new variable polity4
# splits into 4 groups
sub5['polity4'] = pandas.cut(sub5.polityscore, [-10, -5, 0, 5, 10])
f1 = sub5['polity4'].value_counts(sort=False)
f2 = sub5['polity4'].value_counts(sort=False, normalize=True)

print ('\n\n Polity Score divided into 4 parts, frequesty and percentage of each is given \n\n')
print(f1)
print(f2)


# quartile split (use qcut function & ask for 4 groups - gives you quartile split)
sub5['suicide4']=pandas.qcut(sub5.suicideper100th, 4, labels=["1=0%tile","2=25%tile","3=50%tile","4=75%tile"])
f3 = sub5['suicide4'].value_counts(sort=False)
f4 = sub5['suicide4'].value_counts(sort=False, normalize=True)

print ('\n\n suicide Score divided into 4 parts, frequesty and percentage of each is given \n\n')
print(f3)
print(f4)



# quartile split (use qcut function & ask for 4 groups - gives you quartile split)
sub5['employ4']=pandas.qcut(sub5.employrate, 4, labels=["1=0%tile","2=25%tile","3=50%tile","4=75%tile"])
f5 = sub5['employ4'].value_counts(sort=False)
f6 = sub5['employ4'].value_counts(sort=False, normalize=True)

print ('\n\n employrate Score divided into 4 parts, frequesty and percentage of each is given \n\n')
print(f5)
print(f6)





OUTPUT


Polity Score divided into 4 parts, frequesty and percentage of each is given 


(-10, -5]    23
(-5, 0]      27
(0, 5]       71
(5, 10]      90
dtype: int64
(-10, -5]   0.107981
(-5, 0]     0.126761
(0, 5]      0.333333
(5, 10]     0.422535
dtype: float64


suicide Score divided into 4 parts, frequesty and percentage of each is given 


1=0%tile     54
2=25%tile    53
3=50%tile    53
4=75%tile    53
dtype: int64
1=0%tile    0.253521
2=25%tile   0.248826
3=50%tile   0.248826
4=75%tile   0.248826
dtype: float64


employrate Score divided into 4 parts, frequesty and percentage of each is given 


1=0%tile     56
2=25%tile    68
3=50%tile    36
4=75%tile    53
dtype: int64
1=0%tile    0.262911
2=25%tile   0.319249
3=50%tile   0.169014
4=75%tile   0.248826
dtype: float64


ANALYSIS
1 - Handling missing data
To handle missing data, I filled the missing records with mean of the column. This was done though the code 

sub5['polityscore'].fillna((sub5['polityscore'].mean()), inplace=True)

This helps to make our analysis more accurate.

2 - Making three new variables
I collapsed polity score, suicide rate and employrate to make 3 new variables names polity4, suicide4 and employ4. The data was split into 4 parts for each variable, I used 2 different methods to accomplish this. The pandas.qcut and pandas.cut functions were used.

3 - Analysis outcome
From the output I see that polity score from 5 to 10 accounts for 42% of the data.
The suicide rate is rather evenly distributed in the four percentages, I will need to look into this more closely to determine new insight.
and finally the employment rate is also evenly distributed, but the lowest is at the 50%. That gives me additional insight to look in that specific area to gain new knowledge or why its lowest at 50 to 75%.



No comments:

Post a Comment