Thursday, October 22, 2015

Data Management and Visualization - Wesleyan University (Coursera) - Week 4 - Ammar Shigri

CODE

import pandas
import numpy
import seaborn
import matplotlib.pyplot as plt

pandas.set_option('display.float_format',lambda x:'%f'%x)



data = pandas.read_csv('20151009gap.csv', low_memory=False)

print(len(data))            #Number of observations (Rows)
print(len(data.columns))    #Number of Variables (Columns)




data['polityscore'] = data['polityscore'].convert_objects(convert_numeric=True)
data['suicideper100th'] = data['suicideper100th'].convert_objects(convert_numeric=True)
data['employrate'] = data['employrate'].convert_objects(convert_numeric=True)
data['incomeperperson'] = data['incomeperperson'].convert_objects(convert_numeric=True)
data['co2emissions'] = data['co2emissions'].convert_objects(convert_numeric=True)


#Making a copy of data to sub5 data frame
sub5=data.copy()

#Filling empty records with avearge value of the column, I am doing this invidually to only numeric columns
#fillna function is used to fill NaN with mean values. This helps to make our analysis more accurate. Thus managing empty spaces.
sub5['polityscore'].fillna((sub5['polityscore'].mean()), inplace=True)
sub5['suicideper100th'].fillna((sub5['suicideper100th'].mean()), inplace=True)
sub5['employrate'].fillna((sub5['employrate'].mean()), inplace=True)
sub5['incomeperperson'].fillna((sub5['incomeperperson'].mean()), inplace=True)
sub5['co2emissions'].fillna((sub5['co2emissions'].mean()), inplace=True)



# categorize quantitative variable based on customized splits using cut function - making a new variable polity4
# splits into 4 groups
sub5['polity4'] = pandas.cut(sub5.polityscore, [-10, -5, 0, 5, 10])
f1 = sub5['polity4'].value_counts(sort=False)
f2 = sub5['polity4'].value_counts(sort=False, normalize=True)

print ('\n\n Polity Score divided into 4 parts, frequesty and percentage of each is given \n\n')
print(f1)
print(f2)


# quartile split (use qcut function & ask for 4 groups - gives you quartile split)
sub5['suicide4']=pandas.qcut(sub5.suicideper100th, 4, labels=["1=0%tile","2=25%tile","3=50%tile","4=75%tile"])
f3 = sub5['suicide4'].value_counts(sort=False)
f4 = sub5['suicide4'].value_counts(sort=False, normalize=True)

print ('\n\n suicide Score divided into 4 parts, frequesty and percentage of each is given \n\n')
print(f3)
print(f4)



# quartile split (use qcut function & ask for 4 groups - gives you quartile split)
sub5['employ4']=pandas.qcut(sub5.employrate, 4, labels=["1=0%tile","2=25%tile","3=50%tile","4=75%tile"])
f5 = sub5['employ4'].value_counts(sort=False)
f6 = sub5['employ4'].value_counts(sort=False, normalize=True)

print ('\n\n employrate Score divided into 4 parts, frequesty and percentage of each is given \n\n')
print(f5)
print(f6)


"""
#basic scatterplot:  Q->Q
scat1 = seaborn.regplot(x="polityscore", y="suicideper100th", data=data)
plt.xlabel('polityscore')
plt.ylabel('Suicide rate per 100th')
plt.title('Scatterplot for the Association Between Suicide Rate and polityscore')



scat2 = seaborn.regplot(x="employrate", y="suicideper100th", data=data)
plt.xlabel('employrate')
plt.ylabel('Suicide rate per 100th')
plt.title('Scatterplot for the Association Between Suicide Rate and employrate')
"""

# quartile split (use qcut function & ask for 4 groups - gives you quartile split)
print ('employment - 4 categories - quartiles')
data['employrate']=pandas.qcut(data.incomeperperson, 4, labels=["1=25th%tile","2=50%tile","3=75%tile","4=100%tile"])
g1 = data['employrate'].value_counts(sort=False, dropna=True)
print (g1)

# bivariate bar graph 
seaborn.factorplot(x='employrate', y='suicideper100th', data=data, kind="bar", ci=None)
plt.xlabel('employrate')
plt.ylabel('suicideper100th')

OUTPUT




From the bar chart and even the scatter plot I see that there does not any direct relationship between the researched variables. Further analysis is required to look into other variables, or may be the combined effect of variables will have to be studied.

Sunday, October 18, 2015

Data Management and Visualization - Wesleyan University (Coursera) - Week 3 - Ammar Shigri

CODE


import pandas
import numpy


pandas.set_option('display.float_format',lambda x:'%f'%x)



data = pandas.read_csv('20151009gap.csv', low_memory=False)

print(len(data))            #Number of observations (Rows)
print(len(data.columns))    #Number of Variables (Columns)




data['polityscore'] = data['polityscore'].convert_objects(convert_numeric=True)
data['suicideper100th'] = data['suicideper100th'].convert_objects(convert_numeric=True)
data['employrate'] = data['employrate'].convert_objects(convert_numeric=True)
data['incomeperperson'] = data['incomeperperson'].convert_objects(convert_numeric=True)
data['co2emissions'] = data['co2emissions'].convert_objects(convert_numeric=True)



#Week 3 - Managing Data Assignment




#Making a copy of data to sub5 data frame
sub5=data.copy()

#Filling empty records with avearge value of the column, I am doing this invidually to only numeric columns

#fillna function is used to fill NaN with mean values. This helps to make our analysis more accurate. Thus managing empty spaces.

sub5['polityscore'].fillna((sub5['polityscore'].mean()), inplace=True)
sub5['suicideper100th'].fillna((sub5['suicideper100th'].mean()), inplace=True)
sub5['employrate'].fillna((sub5['employrate'].mean()), inplace=True)
sub5['incomeperperson'].fillna((sub5['incomeperperson'].mean()), inplace=True)
sub5['co2emissions'].fillna((sub5['co2emissions'].mean()), inplace=True)



# categorize quantitative variable based on customized splits using cut function - making a new variable polity4
# splits into 4 groups
sub5['polity4'] = pandas.cut(sub5.polityscore, [-10, -5, 0, 5, 10])
f1 = sub5['polity4'].value_counts(sort=False)
f2 = sub5['polity4'].value_counts(sort=False, normalize=True)

print ('\n\n Polity Score divided into 4 parts, frequesty and percentage of each is given \n\n')
print(f1)
print(f2)


# quartile split (use qcut function & ask for 4 groups - gives you quartile split)
sub5['suicide4']=pandas.qcut(sub5.suicideper100th, 4, labels=["1=0%tile","2=25%tile","3=50%tile","4=75%tile"])
f3 = sub5['suicide4'].value_counts(sort=False)
f4 = sub5['suicide4'].value_counts(sort=False, normalize=True)

print ('\n\n suicide Score divided into 4 parts, frequesty and percentage of each is given \n\n')
print(f3)
print(f4)



# quartile split (use qcut function & ask for 4 groups - gives you quartile split)
sub5['employ4']=pandas.qcut(sub5.employrate, 4, labels=["1=0%tile","2=25%tile","3=50%tile","4=75%tile"])
f5 = sub5['employ4'].value_counts(sort=False)
f6 = sub5['employ4'].value_counts(sort=False, normalize=True)

print ('\n\n employrate Score divided into 4 parts, frequesty and percentage of each is given \n\n')
print(f5)
print(f6)





OUTPUT


Polity Score divided into 4 parts, frequesty and percentage of each is given 


(-10, -5]    23
(-5, 0]      27
(0, 5]       71
(5, 10]      90
dtype: int64
(-10, -5]   0.107981
(-5, 0]     0.126761
(0, 5]      0.333333
(5, 10]     0.422535
dtype: float64


suicide Score divided into 4 parts, frequesty and percentage of each is given 


1=0%tile     54
2=25%tile    53
3=50%tile    53
4=75%tile    53
dtype: int64
1=0%tile    0.253521
2=25%tile   0.248826
3=50%tile   0.248826
4=75%tile   0.248826
dtype: float64


employrate Score divided into 4 parts, frequesty and percentage of each is given 


1=0%tile     56
2=25%tile    68
3=50%tile    36
4=75%tile    53
dtype: int64
1=0%tile    0.262911
2=25%tile   0.319249
3=50%tile   0.169014
4=75%tile   0.248826
dtype: float64


ANALYSIS
1 - Handling missing data
To handle missing data, I filled the missing records with mean of the column. This was done though the code 

sub5['polityscore'].fillna((sub5['polityscore'].mean()), inplace=True)

This helps to make our analysis more accurate.

2 - Making three new variables
I collapsed polity score, suicide rate and employrate to make 3 new variables names polity4, suicide4 and employ4. The data was split into 4 parts for each variable, I used 2 different methods to accomplish this. The pandas.qcut and pandas.cut functions were used.

3 - Analysis outcome
From the output I see that polity score from 5 to 10 accounts for 42% of the data.
The suicide rate is rather evenly distributed in the four percentages, I will need to look into this more closely to determine new insight.
and finally the employment rate is also evenly distributed, but the lowest is at the 50%. That gives me additional insight to look in that specific area to gain new knowledge or why its lowest at 50 to 75%.



Sunday, October 11, 2015

Data Management and Visualization - Wesleyan University (Coursera) - Week 2 - Ammar Shigri

CODE

import pandas
import numpy


pandas.set_option('display.float_format',lambda x:'%f'%x)



data = pandas.read_csv('20151009gap.csv', low_memory=False)

print(len(data))            #Number of observations (Rows)
print(len(data.columns))    #Number of Variables (Columns)




data['polityscore'] = data['polityscore'].convert_objects(convert_numeric=True)
data['suicideper100th'] = data['suicideper100th'].convert_objects(convert_numeric=True)
data['employrate'] = data['employrate'].convert_objects(convert_numeric=True)
data['incomeperperson'] = data['incomeperperson'].convert_objects(convert_numeric=True)
data['co2emissions'] = data['co2emissions'].convert_objects(convert_numeric=True)


"""
print("count of similar suiciderate distributed over countries, plus percentage of that value")
c1 = data["suicideper100th"].value_counts(sort=False, dropna = False)
print (c1)

p1 = data["suicideper100th"].value_counts(sort=False, normalize=True)
print (p1)


print("count of similar employment rate distributed over countries, plus percentage of that value")
c2 = data["employrate"].value_counts(sort=False, dropna = False)
print (c2)

p2 = data["employrate"].value_counts(sort=False, normalize=True)
print (p2)


print("count of similar income distributed over countries, plus percentage of that value")
c3 = data["incomeperperson"].value_counts(sort=False, dropna = False)
print (c3)

p3 = data["incomeperperson"].value_counts(sort=False, normalize=True)
print (p3)


print("count of similar political score distributed over countries, plus percentage of that value")
c4 = data["polityscore"].value_counts(sort=False, dropna = False)
print (c4)

p4 = data["polityscore"].value_counts(sort=False, normalize=True)
print (p4)


print("count of similar Co2Emissions distributed over countries, plus percentage of that value")
c5 = data["co2emissions"].value_counts(sort=False, dropna = False)
print (c5)

p5 = data["co2emissions"].value_counts(sort=False, normalize=True)
print (p5)


ct1 = data.groupby("suicideper100th").size()
print (ct1)

pt1 = data.groupby("suicideper100th").size()*100/len(data)
print(pt1)
"""



sub1=data[(data['incomeperperson']<=110000) & (data['polityscore']<=10) & (data['employrate']<=60) & (data['suicideper100th']>=1)]



sub2=sub1.copy()


print('\n\n\n\n\n\n\ncriteria-1 / what is polity score when Employrate is less and equal to 60\n')

print('\n\ncount employrate from table with criteria-1\n')
c6 = sub2["employrate"].value_counts(sort=False)
print (c6)


"""
print('percentage suicideper100th from table with criteria-1')
p6 = sub2["suicideper100th"].value_counts(sort=False, normalize=True)
print (p6)
"""


print('\n\ncount polityscore per person from table with criteria-1\n')
c7 = sub2["polityscore"].value_counts(sort=False)
print (c7)

"""
print('\npercentage polityscore per person from table with criteria-1\n')
p7 = sub2["polityscore"].value_counts(sort=False, normalize=True)
print (p7)
"""

print('\n\ncount suicideper100th from the table with criteria-1\n')
c10 = sub2["suicideper100th"].value_counts(sort=False)
print (c10)





sub3=data[(data['incomeperperson']<=110000) & (data['polityscore']<=10) & (data['employrate']>60) & (data['suicideper100th']>=1)]



sub4=sub3.copy()


print('\n\n\n\n\n\n criteria-2 / what is polity score when Employrate is greater than 60\n')

print('\n\n\ncount employrate from table with criteria-1\n')
c8 = sub4["employrate"].value_counts(sort=False)
print (c8)


print('\n\n\ncount polityscore per person from table with criteria-2\n')
c9 = sub4["polityscore"].value_counts(sort=False)
print (c9)

print('\n\n\ncount suicideper100th from the table with criteria-1\n')
c11 = sub4["suicideper100th"].value_counts(sort=False)
print (c11)















OUTPUT from the code


criteria-1 / what is polity score when Employrate is less and equal to 60
count employrate from table with criteria-1
52.099998    1
57.200001    1
56.299999    2
51.299999    1
58.599998    1
59.299999    1
53.099998    1
58.799999    2
46.900002    1
59.900002    3
58.500000    1
34.900002    1
46.400002    1
54.400002    1
37.400002    1
57.900002    1
39.000000    1
40.099998    1
47.799999    1
42.400002    1
47.099998    1
45.700001    1
46.000000    2
47.299999    3
48.599998    2
49.599998    1
50.500000    1
51.400002    1
52.700001    1
53.400002    2
            ..
59.099998    2
48.700001    1
58.900002    3
49.500000    1
41.099998    1
57.599998    1
56.400002    1
50.900002    2
52.500000    1
51.000000    2
42.799999    1
56.799999    1
57.299999    1
59.000000    1
44.299999    1
41.599998    1
50.700001    1
57.500000    2
53.500000    1
58.200001    2
54.599998    1
51.200001    2
55.099998    1
55.400002    1
59.799999    1
55.900002    2
46.200001    1
59.700001    1
42.000000    1
56.500000    1
Name: employrate, dtype: int64
count polityscore per person from table with criteria-1
0.000000       2
1.000000       1
2.000000       1
3.000000       2
4.000000       2
5.000000       4
6.000000       4
7.000000       5
8.000000      11
9.000000      12
10.000000     22
-10.000000     1
-9.000000      3
-8.000000      1
-7.000000      4
-6.000000      1
-4.000000      4
-3.000000      3
-2.000000      2
Name: polityscore, dtype: int64
count suicideper100th from the table with criteria-1
5.888479     1
1.799904     1
2.206169     1
3.741588     1
4.848770     1
5.931845     1
6.597168     1
7.699330     1
15.542603    1
9.216544     1
10.171870    1
11.213970    1
12.367980    1
13.094370    1
14.776250    1
15.953850    1
16.959240    1
17.032646    1
18.583826    1
16.234370    1
20.317930    1
22.404560    1
6.882952     1
26.874690    1
9.875281     1
3.940259     1
10.059320    1
33.341860    1
35.752872    1
2.816705     1
            ..
7.858619     1
8.021970     1
9.211085     1
8.188375     1
5.838315     1
20.162010    1
10.571910    1
7.060184     1
18.954570    1
13.637060    1
3.563325     1
27.874160    1
19.422610    1
6.265789     1
4.417507     1
6.021882     1
7.745065     1
5.213720     1
12.122269    1
4.119620     1
10.645740    1
12.872222    1
15.714571    1
12.216769    1
13.089616    1
7.214221     1
15.538490    1
6.105282     1
20.369590    1
14.091530    1
Name: suicideper100th, dtype: int64
 criteria-2 / what is polity score when Employrate is greater than 60
count employrate from table with criteria-1
62.299999    1
73.199997    1
71.000000    1
73.099998    1
72.000000    1
68.000000    1
62.400002    1
61.799999    1
75.199997    1
60.400002    2
71.699997    1
61.700001    1
62.700001    1
63.900002    1
63.799999    1
78.199997    2
63.700001    1
63.099998    1
60.900002    1
61.500000    3
63.200001    1
63.500000    1
64.500000    1
65.099998    1
66.199997    1
65.900002    1
68.099998    1
65.000000    3
70.400002    2
71.599998    1
72.800003    1
64.199997    1
65.599998    1
75.699997    1
76.000000    1
77.000000    1
78.900002    1
79.800003    1
80.699997    1
81.300003    1
64.900002    1
83.199997    2
81.500000    1
61.000000    2
60.700001    1
71.800003    1
66.000000    1
64.300003    1
67.300003    1
83.000000    1
68.300003    1
68.900002    1
60.500000    1
66.800003    1
71.300003    1
61.299999    1
Name: employrate, dtype: int64
count polityscore per person from table with criteria-2
0.000000       2
1.000000       2
2.000000       1
4.000000       2
5.000000       3
6.000000       6
7.000000       8
8.000000       7
9.000000       2
10.000000     10
-1.000000      4
-10.000000     1
-8.000000      1
-7.000000      6
-6.000000      1
-5.000000      2
-4.000000      2
-3.000000      2
-2.000000      3
Name: polityscore, dtype: int64
count suicideper100th from the table with criteria-1
1.380965     1
2.034178     1
4.414990     1
5.767406     1
6.057740     1
7.443826     1
8.470030     1
9.873761     1
10.100990    1
11.396111    1
8.973104     1
13.548420    1
14.554677    1
16.913248    1
2.234896     1
11.980497    1
25.404600    1
26.219198    1
9.127511     1
1.658908     1
10.550375    1
2.515721     1
4.777007     1
4.751084     1
10.071942    1
9.847460     1
1.922485     1
12.411181    1
13.905267    1
12.019036    1
            ..
14.547167    1
4.961071     1
11.653322    1
8.204222     1
4.409532     1
8.283071     1
8.913363     1
10.171735    1
8.211067     1
8.164005     1
9.257976     1
10.129350    1
1.392951     1
6.369888     1
13.239810    1
14.713020    1
9.927033     1
11.655210    1
4.907702     1
6.288555     1
4.484753     1
12.179760    1
11.115830    1
14.680936    1
12.289122    1
13.117949    1
7.184853     1
6.811439     1
10.937718    1
14.538357    1
Name: suicideper100th, dtype: int64



Description

Definition of variables
1- Employrate (2007 total employees age 15+ (% of population)
Percentage of total population, age above 15, that has been employed during the given year.
2 - Polityscore (2009 Democracy score (Polity) Overall polity score from the Polity IV dataset, calculated by subtracting an autocracy score from a democracy score. The summary measure of a country's democratic and free nature. -10 is the lowest value, 10 the highest.)
3 - Incomeperperson (2010 Gross Domestic Product per capita in constant 2000 US$. The
inflation but not the differences in the cost of living between countries has been taken into account.)
4 - Co2emissions (2006 cumulative CO2 emission (metric tons), Total amount of CO2
emission in metric tons since 1751. CDIAC (Carbon Dioxide Information Analysis Center)
5 - Suicideper100TH 2005 Suicide, age adjusted, per 100 000 Mortality due to self-inflicted injury, per 100 000 standard population, age adjusted

I am particularly interested in finding out How does Suicide rate relate to employment rate, Polity score, IncomePerPerson and Co2emissions.

From the output of the code I was able to observe one interesting fact about the polity score how it relates to employrate.

As you can see when employrate is less than 60, the polityscore is more inclined towards higher values of Polityscore.
where as when the employrate is higher than 60, the polityscore is more evenly distributed from higher to lower values.

This is an interesting find and as I previously assumed that employrate and polityscore are positively co-related, which may not be the case and needs to be investigated.

My core goal is to find how suicide rate relates to other variables. But by doing frequency tables I was able to identify something I was not initially aware of.

I will do further research as the course progresses to identify interesting facts about the data.




Friday, October 2, 2015

Data Management and Visualization - Wesleyan University (Coursera) - Ammar Shigri

Wesleyan UniversityData Management and Visualization
Selecting a research questions?


Dataset

I have selected GapMinder data. It enables me to analyze data across multiple countries.

Question
I am particularly interested in finding out How does Suicide rate relate to employment rate?

Hypothesis

It is believed "employment rate" plus variables such as "income" have a negative correlation with the variable "suicide rate". I would like to test with with data analysis.

I will also analyze relationship of Suicide rate with the following variables

1 - Polityscore 
2 - Incomeperperson
3 - Co2emissions

Definition of variables

1- Employrate (2007 total employees age 15+ (% of population)
Percentage of total population, age above 15, that has been employed

during the given year.
2 - Polityscore (2009 Democracy score (Polity) Overall polity score from the Polity IV dataset, calculated by subtracting an autocracy score from a democracy score. The summary measure of a country's democratic and free nature. -10 is the lowest value, 10 the highest.)
3 - Incomeperperson (2010 Gross Domestic Product per capita in constant 2000 US$. The
inflation but not the differences in the cost of living between countries has been taken into account.)
4 - Co2emissions (2006 cumulative CO2 emission (metric tons), Total amount of CO2
emission in metric tons since 1751. CDIAC (Carbon Dioxide Information Analysis Center)
5 - Suicideper100TH 2005 Suicide, age adjusted, per 100 000 Mortality due to self-inflicted injury, per 100 000 standard population, age adjusted


Literature Search terms

Google scholar - "suicide and unemployment" - "Suicide and income"

Literature References

Suicide and Suicidal Behavior - http://epirev.oxfordjournals.org/content/30/1/133.full
GapMinder data - http://www.gapminder.org/
Unemployment and suicide. Evidence for a causal association? - http://jech.bmj.com/content/57/8/594.full

Literature Summary

After the research I have found that similar research has been conducted and suggest Un-employment rate will result in increase of suicide rates. I would like to look into this as well as explore other relations suicide rate might have with income,Co2, Polityscore.

I want to see how does suicide rate relates to different variables and take my analysis further by looking how it relates to the combined effect of these variables as well. Moving into predictive analysis and forecasting value of Suicide rate.