#Crosstabulation | Explore Tumblr posts and blogs

timothy-mokoka · 2 years ago

Text

Assignment 2: Hypothesis Testing with Chi-Square Test of Independence

Introduction:

This assignment examines a 2412 sample of Marijuana / Cannabis users from the NESRAC dataset between the ages of 18 and 30. My Research question is as follows:

Is the number of Cannabis joints smoked per day amongst young adults in USA between the Ages of 18 and 30 the leading cause of mental health disorders such as depression and anxiety?

My Hypothesis Test statements are as follows:

H0: The number of Cannabis joints smoked per day amongst young adults in USA between the Ages of 18 and 30 is not the leading cause of mental health disorders such as depression and anxiety.

Ha: The number of Cannabis joints smoked per day amongst young adults in USA between the Ages of 18 and 30 is the leading cause of mental health disorders such as depression and anxiety.

Explanation of the Code:

I used the crosstabulation function to produce a contingency of observed counts and percentages of each mental health disorders, i.e. depression and anxiety. I did this in order to examine if whether the status (1 = Yes and 2 = No) of cannabis usage of the categorical explanatory variable ‘S3BQ1A5’ is correlated with the categorical response variables depression (‘MAJORDEP12’) and anxiety (‘GENAXDX12’). Therefore I ran a Chi-Square Test of Independence for these categorical variables twice, calculating the x-squared values for them and corresponding p-values so that the null and alternative hypothesis are corroborated or rejected with respect to the findings.

To visualize the associate relationship between the frequency of cannabis usage and the depression diagnosis I used the factor-plot function to produce the bivariate graph. I also used the crosstabulation function to test the association between the frequency of cannabis use (‘S3BQ1A5’) and general anxiety (‘GENAXDX12’). After the third Chi-Square Test of Independence I performed a Post Hoc Test using the Bonferroni Adjustment since the explanatory variable has more than two levels. Doing this makes it possible to identify instances where the null hypothesis can be rejected without making an extensive Type-I Error.

Code / Syntax:

-- coding: utf-8 --

""" Created on Fri Mar 31 12:20:15 2023

@author: Oteng """

import pandas import numpy import scipy.stats import seaborn import matplotlib.pyplot as plt

nesarc = pandas.read_csv ('nesarc_pds.csv' , low_memory=False)

Sets pandas to show all columns in a dataframe

pandas.set_option('display.max_columns', None)

Sets pandas to show all rows in a dataframe

pandas.set_option('display.max_rows', None)

nesarc.columns = map(str.upper , nesarc.columns)

pandas.set_option('display.float_format' , lambda x:'%f'%x)

Changes the variables of interest to numeric

nesarc['AGE'] = pandas.to_numeric(nesarc['AGE'], errors='coerce') nesarc['S3BQ4'] = pandas.to_numeric(nesarc['S3BQ4'], errors='coerce') nesarc['S3BQ1A5'] = pandas.to_numeric(nesarc['S3BQ1A5'], errors='coerce') nesarc['S3BD5Q2B'] = pandas.to_numeric(nesarc['S3BD5Q2B'], errors='coerce') nesarc['S3BD5Q2E'] = pandas.to_numeric(nesarc['S3BD5Q2E'], errors='coerce') nesarc['MAJORDEP12'] = pandas.to_numeric(nesarc['MAJORDEP12'], errors='coerce') nesarc['GENAXDX12'] = pandas.to_numeric(nesarc['GENAXDX12'], errors='coerce')

Subset of my sample if interest

subset1 = nesarc[(nesarc['AGE']>=18) & (nesarc['AGE']<=30)] # Ages between 18-30 subsetc1 = subset1.copy()

subset2 = nesarc[(nesarc['AGE']>=18) & (nesarc['AGE']<=30) & (nesarc['S3BQ1A5']==1)] # Cannabis users, between age 18-30 subsetc2 = subset2.copy()

Setting missing data for frequency and cannabis use, variables S3BD5Q2E, S3BQ1A5

subsetc1['S3BQ1A5']=subsetc1['S3BQ1A5'].replace(9, numpy.nan) subsetc2['S3BD5Q2E']=subsetc2['S3BD5Q2E'].replace('BL', numpy.nan) subsetc2['S3BD5Q2E']=subsetc2['S3BD5Q2E'].replace(99, numpy.nan)

Contingency table of observed counts of major depression diagnosis (response variable) within cannabis use (explanatory variable), in ages 18-30

contab1=pandas.crosstab(subsetc1['MAJORDEP12'], subsetc1['S3BQ1A5']) print (contab1)

Column percentages

colsum=contab1.sum(axis=0) colpcontab=contab1/colsum print(colpcontab)

Chi-square calculations for major depression within cannabis use status

print ('Chi-square value, p value, expected counts, for major depression within cannabis use status') chsq1= scipy.stats.chi2_contingency(contab1) print (chsq1)

Contingency table of observed counts of geberal anxiety diagnosis (response variable) within cannabis use (explanatory variable), in ages 18-30

contab2=pandas.crosstab(subsetc1['GENAXDX12'], subsetc1['S3BQ1A5']) print (contab2)

Column percentages

colsum2=contab2.sum(axis=0) colpcontab2=contab2/colsum2 print(colpcontab2)

Chi-square calculations for general anxiety within cannabis use status

print ('Chi-square value, p value, expected counts, for general anxiety within cannabis use status') chsq2= scipy.stats.chi2_contingency(contab2) print (chsq2)

Contingency table of observed counts of major depression diagnosis (response variable) within frequency of cannabis use (10 level explanatory variable), in ages 18-30

contab3=pandas.crosstab(subset2['MAJORDEP12'], subset2['S3BD5Q2E']) print (contab3)

Column percentages

colsum3=contab3.sum(axis=0) colpcontab3=contab3/colsum3 print(colpcontab3)

Chi-square calculations for mahor depression within frequency of cannabis use groups

print ('Chi-square value, p value, expected counts for major depression associated frequency of cannabis use') chsq3= scipy.stats.chi2_contingency(contab3) print (chsq3)

recode1 = {1: 9, 2: 8, 3: 7, 4: 6, 5: 5, 6: 4, 7: 3, 8: 2, 9: 1} # Dictionary with details of frequency variable reverse-recode subsetc2['CUFREQ'] = subsetc2['S3BD5Q2E'].map(recode1) # Change variable name from S3BD5Q2E to CUFREQ

subsetc2["CUFREQ"] = subsetc2["CUFREQ"].astype('category')

Rename graph labels for better interpretation

subsetc2['CUFREQ'] = subsetc2['CUFREQ'].cat.rename_categories(["2 times/year","3-6 times/year","7-11 times/years","Once a month","2-3 times/month","1-2 times/week","3-4 times/week","Nearly every day","Every day"])

Graph percentages of major depression within each cannabis smoking frequency group

plt.figure(figsize=(12,4)) # Change plot size ax1 = seaborn.factorplot(x="CUFREQ", y="MAJORDEP12", data=subsetc2, kind="bar", ci=None) ax1.set_xticklabels(rotation=40, ha="right") # X-axis labels rotation plt.xlabel('Frequency of cannabis use') plt.ylabel('Proportion of Major Depression') plt.show()

Post hoc test, pair comparison of frequency groups 1 and 9, 'Every day' and '2 times a year'

recode2 = {1: 1, 9: 9} subsetc2['COMP1v9']= subsetc2['S3BD5Q2E'].map(recode2)

Contingency table of observed counts

ct4=pandas.crosstab(subsetc2['MAJORDEP12'], subsetc2['COMP1v9']) print (ct4)

Column percentages

colsum4=ct4.sum(axis=0) colpcontab4=ct4/colsum4 print(colpcontab4)

Chi-square calculations for pair comparison of frequency groups 1 and 9, 'Every day' and '2 times a year'

print ('Chi-square value, p value, expected counts, for pair comparison of frequency groups -Every day- and -2 times a year-') cs4= scipy.stats.chi2_contingency(ct4) print (cs4)

Post hoc test, pair comparison of frequency groups 2 and 6, 'Nearly every day' and 'Once a month'

recode3 = {2: 2, 6: 6} subsetc2['COMP2v6']= subsetc2['S3BD5Q2E'].map(recode3)

Contingency table of observed counts

ct5=pandas.crosstab(subsetc2['MAJORDEP12'], subsetc2['COMP2v6']) print (ct5)

Column percentages

colsum5=ct5.sum(axis=0) colpcontab5=ct5/colsum5 print(colpcontab5)

Chi-square calculations for pair comparison of frequency groups 2 and 6, 'Nearly every day' and 'Once a month'

print ('Chi-square value, p value, expected counts for pair comparison of frequency groups -Nearly every day- and -Once a month-') cs5= scipy.stats.chi2_contingency(ct5) print (cs5)

Output:

Explanation: When the relationship between the association of cannabis usage and major depression, the Chi-Square Test of Independence amongst young adults aged between 18 and 30 years shows that those who were cannabis users in the last 12 months, which constitutes about 18%, where more likely to have been diagnosed with major depression compared to the non-users of cannabis (8.4%). X2 = 171.6, 1 df, p-value = 3.16e-39. Since the p-value is extremely small, the results provide enough evidence against the null hypothesis. Thus, we accept the alternative hypothesis and reject the null hypothesis since there is a positive relationship / association between cannabis usage and major depression.

Explanation: When testing the relationship and association between cannabis use and general anxiety, the Chi-Square Test of Independence reveals that, amongst young adults aged between 18 and 30 years, those who were cannabis users were more likely to have been diagnosed with general anxiety in the last 12 months (3.8%), compared to the non-users of cannabis (1.6%), X2 = 40.22, 1 df, p-value = 2.26e-10. Thus these results provides enough evidence against the null hypothesis to safely reject it. Thus we accept the alternative hypothesis and reject the null hypothesis, which indicates a positive relationship between cannabis use and general anxiety.

Explanation: This third Chi-Square Test of Independence shows that, for cannabis users aged between 18 and 30 years, the frequency of cannabis usage and major depression for the past 12 months were significantly associated, X2 = 35.18, 10 df, p-value = 0.00011.

Explanation: The Bivariate graph above presenting my sample of interest shows that there is a positive correlation between the frequency of cannabis usage and major depression in the past 12 months. The distribution is skewed to the left which indicates that the more individuals aged 18 – 30 smoked cannabis the more chances they are to have or experience major depression in the past 12 months.

Explanation: The Post Hoc Test comparison of the Bonferroni Adjustment of the rate of major depression by the pairs “Every Day” and “2 times a year” frequency categories reveal a p-value of 0.00019 and the percentage of major depression diagnosis for each frequency category / group are 23.7% and 11.6% respectively. Thus, since the p-value is smaller than the Bonferroni Adjusted p-value (0.0011 > 0.00019) we can assume that these rates are different from one another. Therefore, we can safely reject the null hypothesis and accept the alternative hypothesis.

Explanation: With regards to the Post Hoc Test comparison with the Bonferroni Adjustment in relation to major depression by the pairs “Nearly every day” and “once a month” frequency categories, the p-value is 0.046 and the percentages of major depression for these two frequency category groups are 23.3% and 13.7% respectively. As a result, since the p-value is bigger than the Bonferroni Adjusted p-value (0.0011 < 0.046) we can safely assume that these rates are not significantly different from one another. Thus, in this instance, we can accept the null hypothesis and reject the alternative hypothesis.

0 notes

educadacademy · 2 years ago

Text

Oracle Database SQL Training

Oracle Database SQL course is an online course that assists you in preparing ng for the OCP exam. We offer a diverse oracle database SQL exam. This course covers all the features of SQL like editing and making running, running reports, transactional writing, writing short p, programs, and more. We have a batch of certified oracle trainers to assist you. It is a practically based SQL online course to help you have a full grip on Oracle database SQL.

Restricting and Sorting Data

Limit the rows that are retrieved by a query

Sort the rows that are retrieved by a query

Use substitution variables

Use the SQL row limiting clause

Create queries using the PIVOT and UNPIVOT clause

Use pattern matching to recognize patterns across multiple rows in a table

Using the Set Operators

Explain set operators

Use a set operator to combine multiple queries into a single query

Control the order of rows returned

Using Single-Row Functions to Customize Output

Describe various types of functions that are available in SQL

Use character, number, and date and analytical (PERCENTILE_CONT, STDDEV, LAG, LEAD) functions in SELECT statements

Use conversion functions

Manipulating Data

Describe the DML statements

Insert rows into a table

Update rows in a table

Delete rows from a table

Control transactions

Reporting Aggregated Data Using the Group Functions

Identify the available group functions

Use group functions

Group data by using the GROUP BY clause

Include or exclude grouped rows by using the HAVING clause

Using DDL Statements to Create and Manage Tables

Categorize the main database objects

Review the table structure

Describe the data types that are available for columns

Create tables

Create constraints for tables

Describe how schema objects work

Truncate tables, and recursively truncate child tables

Use 12c enhancements to the DEFAULT clause, invisible columns, virtual columns and identity columns in table creation/alteration

Displaying Data from Multiple Tables

Use equijoins and nonequijoins

Use a self-join

Use outer joins

Generate a Cartesian product of all rows from two or more tables

Use the cross outer apply clause

Creating Other Schema Objects

Create simple and complex views with visible/invisible columns

Retrieve data from views

Create, maintain and use sequences

Create private and public synonyms

Using Subqueries to Solve Queries

Use subqueries

List the types of subqueries

Use single-row and multiple-row subqueries

Create a lateral inline view in a query

Managing Objects with Data Dictionary Views

Query various data dictionary views

EXTRACT Managing Schema Objects

Manage constraints

Create and maintain indexes including invisible indexes and multiple indexes on the same columns

Create indexes using the CREATE TABLE statement

Create function-based indexes

Drop columns and set column UNUSED

Perform flashback operations

Create and use external tables

Controlling User Access

Differentiate system privileges from object privileges

Grant privileges on tables and on a user

View privileges in the data dictionary

Grant roles

Distinguish between privileges and roles

Manipulating Large Data Sets

Manipulate data using subqueries

Describe the features of multitable INSERTs

Use multitable inserts

Unconditional INSERT

Pivoting INSERT

Conditional ALL INSERT

Conditional FIRST INSERT

Merge rows in a table

Track the changes to data over a period of time

Use explicit default values in INSERT and UPDATE statements

Managing Data in Different Time Zones

Use various date time functions

Tz_offset

from_tz

to_timestamp

to_timestamp_tz

to_yminterval

to_dsinterval

current_date

current_timestamp

localtimestamp

dbtimezone

sessiontimezone

Generating Reports by Grouping Related Data

Use the ROLLUP operation to produce subtotal values

Use the CUBE operation to produce crosstabulation values

Use the GROUPING function to identify the row values created by ROLLUP or CUBE

Use GROUPING SETS to produce a single result set

Retrieving Data Using Subqueries

Use multiple-column subqueries

Use scalar subqueries

Use correlated subqueries

Update and delete rows using correlated subqueries

Use the EXISTS and NOT EXISTS operators

Use the WITH clause

Hierarchical Retrieval

Interpret the concept of a hierarchical query

Create a tree-structured report

Format hierarchical data

Exclude branches from the tree structure

Regular Expression Support

Use meta Characters

Use regular expression functions to search, match and replace

Use replacing patterns

Use regular expressions and check constraints

International Student Fee : 300 USD | 395 CAD | 1,125 AED | 1,125 SAR

Flexible Class Options

Corporate Group Training | Fast-Track

Week End Classes For Professionals SAT | SUN

Online Classes – Live Virtual Class (L.V.C), Online Training

#onlineclasses #onlinecourses #oracletraining #oracle database #educad_academy

0 notes

winportables · 3 years ago

Text

StatPlus Pro Portable You can work with various statistical tools and graphical analysis methods, such as Analysis of Variance (ANOVA), Design of Experiments (DOE), as well as regression, time series, and survival analysis. StatPlus Pro Portable is an advanced statistical analysis program intended to help you perform everything from data transformation and sampling to complex regression and non-parametric analysis, survival analysis, and other functions. The application comes with a multitude of charts (histograms, bars, areas, dot charts, pie charts, statistics, control charts) and spreadsheets of mathematical, statistical and financial functions. It also provides support for an Excel add-in that allows you to perform statistical tasks directly from the Excel interface. Clean feature line StatPlus Pro Portable reveals a well-structured GUI where you can enter data directly into a spreadsheet or import it from HTML, XLS, CSV, SAV, ODS, or other file formats. Thanks to the multi-tab design, you can work with different tabs at the same time and quickly switch between them. Editing functions Editing functions are implemented to help you activate clipboard-related tasks (cut, copy, paste), delete entries, search for items, and undo or redo your actions. A spell checker is included in the package. Additionally, you can insert cells, charts, symbols, functions, comments, images, and hyperlinks. Each cell can be customized in terms of layout (such as horizontal or vertical alignment), color, font, and border. You can print the information, email it, or export it to the same file formats as the input. Analysis tools StatPlus Pro Portable supports a wide range of statistical utilities, so be prepared to spend some of your time discovering them. These tests are related to the mean comparison t tests, the Pagurova criterion and the G criterion, the F test, the one and two sample z tests, the correlation coefficients (Pearson, Fechner) and the covariation, the tests normality, crosstabulation and frequency. table analysis (discrete / continuous). Additionally, you can perform analysis of variance (ANOVA) related tests with one-, two-, or three-way analysis of variance, data classification, design of experiments (DOE), as well as non-parametric statistics, such as 2 × 2 table analysis. (eg, chi-square, Yates chi-square, Fisher's exact test), rank correlations, and Cochran's Q test. You can perform regression analysis (for example, logistic regression, polynomial regression), time series analysis (for example, moving average, Fourier analysis, data processing), survival analysis (Cox proportional hazards regression, and Cox proportional hazards regression). ban), power and sample size analysis (PASS), and data processing (eg, random number generation, matrix operations, sampling). The tool allows you to generate charts, like Gantt, arrow, buble, error, pie, and control charts like X-bar, R-chart, S-chart, P-chart, C-chart, U-chart and CUSUM- chart. Graphics can be printed or exported to BMP, GIF, JPEG, PDF, SVG, or other file formats. Release year: 2021 Version: 6.2.5.0 System: Windows® 2000 / XP / Vista / 7/8 / 8.1 /In Windows 10 it is POSSIBLE, BUT NOT GUARANTEED! Interface language: Multilanguage English- English included File size: 75.88 MB Format: Rar Execute as an administrator: There's no need

0 notes

sanjeev216-blog · 5 years ago

Link

#tableau #crosstab #crosstabview #creation #crosstabulation #dataanalytics #datavisualization

0 notes

acemywriter · 3 years ago

Text

Quantitative Analysis Report: Crosstabulation

*INSTRUCTIONS ALSO UPLOADED IN FILES SECTION. QUANTITATIVE ANALYSIS REPORT: CROSSTABULATION AND CORRELATION ANALYSIS ASSIGNMENT INSTRUCTIONS OVERVIEW You will take part in several data analysis assignments in which you will develop a report using tables and figures from the IBM SPSS® output file of your results. Using the resources and readings provided, you will interpret these results and test…

View On WordPress

#Writing

0 notes

essaynook · 4 years ago

Text

Provide a table showing the average sales revenue, variable cost and contribution margin per region per brand.

doing some basic Google Colab steps, with provided data. Like making rows, tables, using mathematic functions. output file has to be JSON and Colab (Phyton). Needed = JSON/ GoogleCOLAB file with : • A crosstabulation table showing the number of transactions and test whether the relationship between these two variables is significant using the chi-square test of independence. • Calculate the…

View On WordPress

0 notes

fufupaw · 4 years ago

Text

This week's main Discussion requires you to answer the question completely and c

This week’s main Discussion requires you to answer the question completely and c

This week’s main Discussion requires you to answer the question completely and correctly to receive full credit. This week we talk about the uses of a crosstabulation (crosstabs) and the benefits of creating this “snapshot” of your data. For this forum, provide a brief introduction to your study to remind your classmates what we are reading about here. Include: 1. Your overall research…

View On WordPress

0 notes

the-social-networks · 4 years ago

Text

Digital Community and Fandom: Reality TV

WEEK 4

Reality television is an easy ratings grab yet an often criticised genre, popular amongst audiences but still strongly associated with “over the top emotions” (Kavka, 2019) or self-absorbed, wannabe celebrities. Though seemingly the most hated television genre (Morning Consult et al. 2018), it is a guilty pleasure for viewers that garners strong fanbases, with dedicated forums and social media pages made by fans (or fascinated haters) of the Kardashians, The Real Housewives or MAFS. So what makes reality tv so fascinating for audiences, and what role does social media have in its success, and vice versa?

I used the above gif as an illustration of the major influence reality tv has on creating social publics. Kim Kardashian, famous for... being famous, has created an empire along with her family from their reality tv show. However, their success came during the advent of social media, with their success furthered by fandoms online - as well as critics - constantly discussing, mocking, or enjoying the show via the sharing of memes and iconic moments from the program (such as the gif above). Keeping Up With the Kardashians has covered a plethora of issues, with the world watching and debating these topics online. From minor family drama, to Caitlin Jenner’s transition, the Kardashians showcase the privilege and naivety we expect from Beverley Hills rich kids. However, the universal themes and social issues raised throughout its 10 year run acted as a catalyst for important discourse across social media.

In week 4, the lecture addressed that reality tv is less reliant on television as a medium as it is reliant on social media, with platforms giving reality stars the opportunity to present an even deeper look into their ‘personal’ lives and to “perform amplified versions” (Arcy, 2018) of themselves for audiences online. This idea of monetising and incentivising every aspect of the star’s life not only makes audiences feel as though they are engaging with the content on a deeper level (e.g. buying the same lipstick worn and promoted by their favourite Kardashian), it also creates a marketing tool for the show itself as well as brands that wish to be associated. The active participation of viewers and the two way communication that social media provides gives reality television an aspect of realism that other programming may lack, however as these shows become more and more intertwined with their fanbases online, the authenticity of these stars and these shows starts to fade. An example of this is outlined by Love Island, a show that relies on audiences being active audiences of both television and social media.

This symbiotic relationship isn’t always positive or beneficial for producers, as Xavier L’Hoiry notes. Discussing the relationships that individual viewers have with each other online, L’Hoiry argues that though Love Island’s social media marketing strategy worked in engaging fans, it also caused issues for the show itself. Fans had the ability to access, share and discuss footage that proved the tv show’s editing was misleading, and creating an air of doubt about the realism of the show. Despite this, it seems the fans were “not seeking to counter organizational surveillance in order to destroy these systems” (L’Hoiry, 2019), but rather were so invested in the content that they wanted to know more, investigate more and uncover every detail surrounding the show, without these issues affecting ratings.

In my opinion, reality television is becoming less authentic in order to remain entertaining, however this strong focus on editing and manipulating social media to adhere to a particular narrative, can also create digital publics such as hashtags, or prompts important social issues to be discussed online surrounding controversies or conversations that appear on these shows.

References

Arcy, J. (2018) The digital money shot: Twitter wars, The Real Housewives, and transmedia storytelling, Celebrity Studies, 9:4, 487-502.

Hajru, A., Graham, T. (2011) Reality TV as a trigger of everyday political talk in the net-based public sphere, European Journal of Communication, 26:1, 18-32.

L’Hoiry, X. (2019) Love Island, Social Media, and Sousveillance: New Pathways of Challenging Realism in Reality TV, Frontiers in Sociology, 4:59.

Morning Consult, The Hollywood Reporter. (2018) National Tracking Poll #181129 Crosstabulation Results, viewed 15 April 2021 <https://morningconsult.com/wp-content/uploads/2018/11/181129_crosstabs_HOLLYWOOD_REPORTER_Reality-TV.pdf>.

#fandoms #reality tv #mda20009 #digital communities

0 notes

ansprasad · 4 years ago

Text

Data Analysis Tools - Assignment 4

used python for checking depression as a moderating variable in cigarettes smoked vs nicotine dependence

loaded data using pandas

data = pd.read_csv('nesarc.csv', low_memory=False)

converted to numeric using following code data['TAB12MDX'] = data['TAB12MDX'].apply(pd.to_numeric, errors='coerce') data['CHECK321'] = data['CHECK321'].apply(pd.to_numeric, errors='coerce') data['S3AQ3B1'] = data['S3AQ3B1'].apply(pd.to_numeric, errors='coerce') data['S3AQ3C1'] = data['S3AQ3C1'].apply(pd.to_numeric, errors='coerce') data['AGE'] = data['AGE'].apply(pd.to_numeric, errors='coerce')

subsetted the target population

sub1=data[(data['AGE']>=18) & (data['AGE']<=25) & (data['CHECK321']==1)]

recoded cigarattes similar to what was done in class

recode1 = {1: 30, 2: 22, 3: 14, 4: 6, 5: 2.5, 6: 1} sub1['USFREQMO']= sub1['S3AQ3B1'].map(recode1)

def USQUAN (row): if row['S3AQ3B1'] != 1: return 0 elif row['S3AQ3C1'] <= 5 : return 3 elif row['S3AQ3C1'] <=10: return 8 elif row['S3AQ3C1'] <= 15: return 13 elif row['S3AQ3C1'] <= 20: return 18 elif row['S3AQ3C1'] > 20: return 37 sub1['USQUAN'] = sub1.apply (lambda row: USQUAN (row),axis=1)

after dropping na values is crosstabulated using pd. crosstab

Name: S3AQ3C1, dtype: int64 USQUAN 0.0 3.0 8.0 13.0 18.0 37.0 TAB12MDX 0 289 130 210 43 114 20 1 97 119 267 91 254 67

and the column percentages are found

USQUAN 0.0 3.0 8.0 13.0 18.0 37.0 TAB12MDX 0 0.748705 0.522088 0.440252 0.320896 0.309783 0.229885 1 0.251295 0.477912 0.559748 0.679104 0.690217 0.770115

chi quare value is computed using

cs1= scipy.stats.chi2_contingency(ct1)

results is as below

chi-square value, p value, expected counts (194.52141019317162, 4.218547040348835e-40, 5, array([[182.90182246, 117.98589065, 226.02116402, 63.49441505, 174.37272193, 41.22398589], [203.09817754, 131.01410935, 250.97883598, 70.50558495, 193.62727807, 45.77601411]]))

Now the dataset is subsetted using depression

sub3=sub1[(sub1['MAJORDEPLIFE']== 0)] sub4=sub1[(sub1['MAJORDEPLIFE']== 1)]

factorplot in seaborn indicates

cross tabulation is done again for the two subsets using pd.crosstab and chisquared is done using

ct2=pd.crosstab(sub3['TAB12MDX'], sub3['USQUAN'])

cs2= scipy.stats.chi2_contingency(ct2)

ct3=pd.crosstab(sub4['TAB12MDX'], sub4['USQUAN'])

cs3= scipy.stats.chi2_contingency(ct3)

the outputs of the print statements are as under

USQUAN 0.0 3.0 8.0 13.0 18.0 37.0 TAB12MDX 0 231 110 183 41 98 20 1 64 75 171 60 164 40 USQUAN 0.0 3.0 8.0 13.0 18.0 37.0 TAB12MDX 0 0.748705 0.522088 0.440252 0.320896 0.309783 0.229885 1 0.251295 0.477912 0.559748 0.679104 0.690217 0.770115 chi-square value, p value, expected counts (119.8838461347068, 3.321507405356043e-24, 5, array([[160.29037391, 100.52108194, 192.34844869, 54.87907717, 142.35958632, 32.60143198], [134.70962609, 84.47891806, 161.65155131, 46.12092283, 119.64041368, 27.39856802]])) association between smoking quantity and nicotine dependence for those WITH depression USQUAN 0.0 3.0 8.0 13.0 18.0 37.0 TAB12MDX 0 58 20 27 2 16 0 1 33 44 96 31 90 27 USQUAN 0.0 3.0 8.0 13.0 18.0 37.0 TAB12MDX 0 0.748705 0.522088 0.440252 0.320896 0.309783 0.229885 1 0.251295 0.477912 0.559748 0.679104 0.690217 0.770115 chi-square value, p value, expected counts (87.90481311162473, 1.8504851262047968e-17, 5, array([[25.20945946, 17.72972973, 34.07432432, 9.14189189, 29.36486486, 7.47972973], [65.79054054, 46.27027027, 88.92567568, 23.85810811, 76.63513514, 19.52027027]]))

the same has also been plotted using seaborn.factorplot

it can be seen that the moderating variable depression does not have significant effect in the replationship between cigarattes smoked and nicotin dependence

0 notes

acedemicsblog · 4 years ago

Text

Criminal homework help

Quantitative Analysis Report: Crosstabulation & Correlation Assignment Instructions Overview You will take part in several data analysis assignments in which you will develop a report using tables and figures from the IBM SPSS® output file of your results. Using the resources and readings provided, you will interpret these results and test the hypotheses and writeup these…

View On WordPress

0 notes

lnct-mca · 5 years ago

Text

Chi-Square Test of Independence

The Chi-Square Test of Independence determines whether there is an association between categorical variables (i.e., whether the variables are independent or related). It is a nonparametric test.

This test is also known as:

Chi-Square Test of Association.

This test utilizes a contingency table to analyze the data. A contingency table (also known as a cross-tabulation, crosstab, or two-way table) is an arrangement in which data is classified according to two categorical variables. The categories for one variable appear in the rows, and the categories for the other variable appear in columns. Each variable must have two or more categories. Each cell reflects the total count of cases for a specific pair of categories.

There are several tests that go by the name "chi-square test" in addition to the Chi-Square Test of Independence. Look for context clues in the data and research question to make sure what form of the chi-square test is being used.

Common Uses

The Chi-Square Test of Independence is commonly used to test the following:

Statistical independence or association between two or more categorical variables.

The Chi-Square Test of Independence can only compare categorical variables. It cannot make comparisons between continuous variables or between categorical and continuous variables. Additionally, the Chi-Square Test of Independence only assesses associations between categorical variables, and can not provide any inferences about causation.

If your categorical variables represent "pre-test" and "post-test" observations, then the chi-square test of independence is not appropriate. This is because the assumption of the independence of observations is violated. In this situation, McNemar's Test is appropriate.

Data Requirements

Your data must meet the following requirements:

Two categorical variables.

Two or more categories (groups) for each variable.

Independence of observations.

Relatively large sample size.

There is no relationship between the subjects in each group.

The categorical variables are not "paired" in any way (e.g. pre-test/post-test observations).

Expected frequencies for each cell are at least 1.

Expected frequencies should be at least 5 for the majority (80%) of the cells.

Hypotheses

The null hypothesis (H0) and alternative hypothesis (H1) of the Chi-Square Test of Independence can be expressed in two different but equivalent ways:

H0: "[Variable 1] is independent of [Variable 2]" H1: "[Variable 1] is not independent of [Variable 2]"

H0: "[Variable 1] is not associated with [Variable 2]" H1: "[Variable 1] is associated with [Variable 2]"

Test Statistic

The test statistic for the Chi-Square Test of Independence is denoted Χ2, and is computed as:

χ2=∑i=1R∑j=1C(oij−eij)2eijχ2=∑i=1R∑j=1C(oij−eij)2eij

where

oijoij is the observed cell count in the ith row and jth column of the table

eijeij is the expected cell count in the ith row and jth column of the table, computed as

eij=row i total∗col j totalgrand totaleij=row i total∗col j totalgrand total

The quantity (oij - eij) is sometimes referred to as the residual of cell (i, j), denoted rijrij.

The calculated Χ2 value is then compared to the critical value from the Χ2 distribution table with degrees of freedom df = (R - 1)(C - 1) and chosen confidence level. If the calculated Χ2 value > critical Χ2 value, then we reject the null hypothesis.

Data Set-Up

There are two different ways in which your data may be set up initially. The format of the data will determine how to proceed with running the Chi-Square Test of Independence. At minimum, your data should include two categorical variables (represented in columns) that will be used in the analysis. The categorical variables must include at least two groups. Your data may be formatted in either of the following ways:

IF YOU HAVE THE RAW DATA (EACH ROW IS A SUBJECT):

Cases represent subjects, and each subject appears once in the dataset. That is, each row represents an observation from a unique subject.

The dataset contains at least two nominal categorical variables (string or numeric). The categorical variables used in the test must have two or more categories.

IF YOU HAVE FREQUENCIES (EACH ROW IS A COMBINATION OF FACTORS):

An example of using the chi-square test for this type of data can be found in the Weighting Cases tutorial.

Cases represent the combinations of categories for the variables.

You should have three variables: one representing each category, and a third representing the number of occurrences of that particular combination of factors.

Before running the test, you must activate Weight Cases, and set the frequency variable as the weight.

Each row in the dataset represents a distinct combination of the categories.

The value in the "frequency" column for a given row is the number of unique subjects with that combination of categories.

Run a Chi-Square Test of Independence

In SPSS, the Chi-Square Test of Independence is an option within the Crosstabs procedure. Recall that the Crosstabs procedure creates a contingency table or two-way table, which summarizes the distribution of two categorical variables.

To create a crosstab and perform a chi-square test of independence, click Analyze > Descriptive Statistics > Crosstabs.

A Row(s): One or more variables to use in the rows of the crosstab(s). You must enter at least one Row variable.

B Column(s): One or more variables to use in the columns of the crosstab(s). You must enter at least one Column variable.

Also note that if you specify one row variable and two or more column variables, SPSS will print crosstabs for each pairing of the row variable with the column variables. The same is true if you have one column variable and two or more row variables, or if you have multiple row and column variables. A chi-square test will be produced for each table. Additionally, if you include a layer variable, chi-square tests will be run for each pair of row and column variables within each level of the layer variable.

C Layer: An optional "stratification" variable. If you have turned on the chi-square test results and have specified a layer variable, SPSS will subset the data with respect to the categories of the layer variable, then run chi-square tests between the row and column variables. (This is not equivalent to testing for a three-way association, or testing for an association between the row and column variable after controlling for the layer variable.)

D Statistics: Opens the Crosstabs: Statistics window, which contains fifteen different inferential statistics for comparing categorical variables. To run the Chi-Square Test of Independence, make sure that the Chi-square box is checked off.

E Cells: Opens the Crosstabs: Cell Display window, which controls which output is displayed in each cell of the crosstab. (Note: in a crosstab, the cells are the inner sections of the table. They show the number of observations for a given combination of the row and column categories.) There are three options in this window that are useful (but optional) when performing a Chi-Square Test of Independence:

1Observed: The actual number of observations for a given cell. This option is enabled by default.

2Expected: The expected number of observations for that cell (see the test statistic formula).

3Unstandardized Residuals: The "residual" value, computed as observed minus expected.

F Format: Opens the Crosstabs: Table Format window, which specifies how the rows of the table are sorted.

Example: Chi-square Test for 3x2 Table

PROBLEM STATEMENT

In the sample dataset, respondents were asked their gender and whether or not they were a cigarette smoker. There were three answer choices: Nonsmoker, Past smoker, and Current smoker. Suppose we want to test for an association between smoking behavior (nonsmoker, current smoker, or past smoker) and gender (male or female) using a Chi-Square Test of Independence (we'll use α = 0.05).

BEFORE THE TEST

Before we test for "association", it is helpful to understand what an "association" and a "lack of association" between two categorical variables looks like. One way to visualize this is using clustered bar charts. Let's look at the clustered bar chart produced by the Crosstabs procedure.

This is the chart that is produced if you use Smoking as the row variable and Gender as the column variable (running the syntax later in this example):

The "clusters" in a clustered bar chart are determined by the row variable (in this case, the smoking categories). The color of the bars is determined by the column variable (in this case, gender). The height of each bar represents the total number of observations in that particular combination of categories.

This type of chart emphasizes the differences within the categories of the row variable. Notice how within each smoking category, the heights of the bars (i.e., the number of males and females) are very similar. That is, there are an approximately equal number of male and female nonsmokers; approximately equal number of male and female past smokers; approximately equal number of male and female current smokers. If there were an association between gender and smoking, we would expect these counts to differ between groups in some way.

RUNNING THE TEST

Open the Crosstabs dialog (Analyze > Descriptive Statistics > Crosstabs).

Select Smoking as the row variable, and Gender as the column variable.

Click Statistics. Check Chi-square, then click Continue.

(Optional) Check the box for Display clustered bar charts.

Click OK.

SYNTAX

CROSSTABS /TABLES=Smoking BY Gender /FORMAT=AVALUE TABLES /STATISTICS=CHISQ /CELLS=COUNT /COUNT ROUND CELL /BARCHART.

OUTPUTTABLES

The first table is the Case Processing summary, which tells us the number of valid cases used for analysis. Only cases with nonmissing values for both smoking behavior and gender can be used in the test.

The next tables are the crosstabulation and chi-square test results.

The key result in the Chi-Square Tests table is the Pearson Chi-Square.

The value of the test statistic is 3.171.

The footnote for this statistic pertains to the expected cell count assumption (i.e., expected cell counts are all greater than 5): no cells had an expected count less than 5, so this assumption was met.

Because the test statistic is based on a 3x2 crosstabulation table, the degrees of freedom (df) for the test statistic isdf=(R−1)∗(C−1)=(3−1)∗(2−1)=2∗1=2df=(R−1)∗(C−1)=(3−1)∗(2−1)=2∗1=2.

The corresponding p-value of the test statistic is p = 0.205.

DECISION AND CONCLUSIONS

Since the p-value is greater than our chosen significance level (α = 0.05), we do not reject the null hypothesis. Rather, we conclude that there is not enough evidence to suggest an association between gender and smoking.

Based on the results, we can state the following:

No association was found between gender and smoking behavior (Χ2(2)> = 3.171, p = 0.205).

Example: Chi-square Test for 2x2 Table

PROBLEM STATEMENT

Let's continue the row and column percentage example from the Crosstabs tutorial, which described the relationship between the variables RankUpperUnder (upperclassman/underclassman) and LivesOnCampus (lives on campus/lives off-campus). Recall that the column percentages of the crosstab appeared to indicate that upperclassmen were less likely than underclassmen to live on campus:

The proportion of underclassmen who live off campus is 34.8%, or 79/227.

The proportion of underclassmen who live on campus is 65.2%, or 148/227.

The proportion of upperclassmen who live off campus is 94.4%, or 152/161.

The proportion of upperclassmen who live on campus is 5.6%, or 9/161.

Suppose that we want to test the association between class rank and living on campus using a Chi-Square Test of Independence (using α = 0.05).

BEFORE THE TEST

The clustered bar chart from the Crosstabs procedure can act as a complement to the column percentages above. Let's look at the chart produced by the Crosstabs procedure for this example:

The height of each bar represents the total number of observations in that particular combination of categories. The "clusters" are formed by the row variable (in this case, class rank). This type of chart emphasizes the differences within the underclassmen and upperclassmen groups. Here, the differences in number of students living on campus versus living off-campus is much starker within the class rank groups.

RUNNING THE TEST

Open the Crosstabs dialog (Analyze > Descriptive Statistics > Crosstabs).

Select RankUpperUnder as the row variable, and LiveOnCampus as the column variable.

Click Statistics. Check Chi-square, then click Continue.

(Optional) Click Cells. Under Counts, check the boxes for Observed and Expected, and under Residuals, click Unstandardized. Then click Continue.

(Optional) Check the box for Display clustered bar charts.

Click OK.

OUTPUTSYNTAX

CROSSTABS /TABLES=RankUpperUnder BY LiveOnCampus /FORMAT=AVALUE TABLES /STATISTICS=CHISQ /CELLS=COUNT EXPECTED RESID /COUNT ROUND CELL /BARCHART.

TABLES

The first table is the Case Processing summary, which tells us the number of valid cases used for analysis. Only cases with nonmissing values for both class rank and living on campus can be used in the test.

The next table is the crosstabulation. If you elected to check off the boxes for Observed Count, Expected Count, and Unstandardized Residuals, you should see the following table:

With the Expected Count values shown, we can confirm that all cells have an expected value greater than 5.

Computation of the expected cell counts and residuals (observed minus expected) for the crosstabulation of class rank by living on campus. Off-CampusOn-CampusTotal

Underclassman

Row 1, column 1

o11=79o11=79e11=227∗231388=135.147e11=227∗231388=135.147r11=79−135.147=−56.147r11=79−135.147=−56.147

Row 1, column 2

o12=148o12=148e12=227∗157388=91.853e12=227∗157388=91.853r12=148−91.853=56.147r12=148−91.853=56.147row 1 total = 227

Upperclassmen

Row 2, column 1

o21=152o21=152e21=161∗231388=95.853e21=161∗231388=95.853r21=152−95.853=56.147r21=152−95.853=56.147

Row 2, column 2

o22=9o22=9e22=161∗157388=65.147e22=161∗157388=65.147r22=9−65.147=−56.147r22=9−65.147=−56.147row 2 total = 161

Totalcol 1 total = 231col 2 total = 157grand total = 388

These numbers can be plugged into the chi-square test statistic formula:

χ2=∑i=1R∑j=1C(oij−eij)2eij=(−56.147)2135.147+(56.147)291.853+(56.147)295.853+(−56.147)265.147=138.926χ2=∑i=1R∑j=1C(oij−eij)2eij=(−56.147)2135.147+(56.147)291.853+(56.147)295.853+(−56.147)265.147=138.926

We can confirm this computation with the results in the Chi-Square Tests table:

The row of interest here is Pearson Chi-Square and its footnote.

The value of the test statistic is 138.926.

Because the crosstabulation is a 2x2 table, the degrees of freedom (df) for the test statistic isdf=(R−1)∗(C−1)=(2−1)∗(2−1)=1df=(R−1)∗(C−1)=(2−1)∗(2−1)=1.

The corresponding p-value of the test statistic is so small that it is cut off from display. Instead of writing "p = 0.000", we instead write the mathematically correct statement p < 0.001.

DECISION AND CONCLUSIONS

Since the p-value is less than our chosen significance level α = 0.05, we can reject the null hypothesis, and conclude that there is an association between class rank and whether or not students live on-campus.

Based on the results, we can state the following:

There was a significant association between class rank and living on campus (Χ2(1) = 138.9, p < .001).

0 notes