Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
306226_E1432_daudrih_n_i_kniga_dlya_chteniya_me....doc
Скачиваний:
20
Добавлен:
26.10.2018
Размер:
3.98 Mб
Скачать

Лекция 4. Критерии качества измерения

Rymarchyk G.K. Validity. URL: <http://trochim.human.cornell.edu/tutorial/rymarchk/rymar2.htm>

Social science research differs from research in fields such as physics and chemistry for many reasons. One reason is that the things social science research are trying to measure are intangible, such as attitudes, behaviors, emotions, and personalities. Whereas in physics you can use a ruler to measure distance, and in chemistry you can use a graduated cylinder to measure volume, in social science research you cannot pour emotions into a graduated cylider or use a ruler to measure how big someone's attitude is (no puns intended).

As a result, social scientists have developed their own means of measuring such concepts as attitudes, behaviors, emotions, and personalities. Some of these techniques include surveys, interviews, assessments, ink blots, drawings, dream interpretations, and many more. A difficulty in using any method to measure a phenomenon of social science is that you never know for certain whether you are measuring what you want to measure.

Validity is an element of social science research which addresses the issues of whether the researcher is actually measuring what s/he says s/he is. As an example, let us pretend we want to measure attitude. A psychologist by the name of Kurt Goldstein developed a way to measure "abstract attitude" by assessing several different abilities in brain injury patients, such as ability to separate their internal experience from the external world, ability to shift from one task to another, and the ability to recognize an oragnized whole, to break it into component parts, and then reorganize it as before. Carl Jung defined attitude a introversion and extraversion. Raymond Cattell defined attitude in three components: intensity of interest, interest in an action, and interest in action toward an object (Hall & Lindzey, 1978).

Are any of these things what you think of when someone mentions the word "attitude?" Do any of these definitions of attitude seem like they are defining the same thing? Do they seem valid to you?

A definition of attitude that would seem to possess more validity to you might be the definition provided in the American Heritage Dictionary: "A state of mind or feeling with regard to some matter; disposition" (1987, p. 140). This definition of attitude may appear to you to be the most valid.

Validity in social science research has several different components - some people feel there are only three components of validity , and others feel there are four components of validity . On this page all four will be addressed. All of these facets of validity you would ideally want to establish for your research measure prior to administering it for your actual research project.

Face validity requires that your measure appears relevant to your construct to an innocent bystander, or more specifically, to those you wish to measure. Face validity can be established by your Mom - just ask her if she thinks your survey could adequately and completely assess someone's attitude. If Mom says yes, then you have face validity. However, you may want to take this one step further and ask individuals similar to those you wish to study if they feel the same way your Mom does about your survey. The reason for asking these people is that people can sometimes become resentful and uncooperative if they think they are being misrepresented to others, or worse, if they think you are misrepresenting yourself to them. For instance, if you tell people you are measuring their attitudes, but your survey asks them how much money they spend on alcohol, they may think you have lied to them about your study. Or if your survey only asks how they feel about negative things (i.e. if their car was stiolen, if they were beat up, etc.) they may think that you are going to find that these people all have negative attitudes, when that may not be true. So, it is important to establish face validity with your population of interest.

In order to have a valid measure of a social construct, one should never stop at achieving only face validity, as this is not sufficient. However one should never skip establishing face validity, because if you do not have it, you cannot achieve the other components of validity.

Content validity is very similar to face validity, except instead of asking your Mom or your target members of your population of interest, you must ask experts in the field (unless your Mom is an expert on attitude). The theory behind content validity, as opposed to face validity, is that experts are aware of nuances in the construct that may be rare or elusive of which the layperson may not be aware. For example, if you submitted your attitude survey to Kurt Goldstein for a content validity check, he may say you need to have something to assess whether your respondents can break something down into component parts, then resynthesize it, as this is an important aspect of attitude, and otherwise you have no content validity. For an example of a study where a content validity check was used for an attitude assessment, click here (http://www.coe.uh.edu/innnsitee/elec_pub/html1995/097.htm). Another example measures influences (http://www.joe.uwex.edu/test/joe/1990fall/rb3.html), and another measures impacts (http://www.joe.uwex.edu/test/joe/1992summer/rb2.html).

Many studies precede following content validity achievement, however this does not necessarily mean the measures used are entirely valid. Criterion validity is a more rigorous test than face or content validity. Criterion validity means your attitude assessment can predict or agree with constructs external to attitude. Two types of criterion validity exist:

Predictive validity- Can your attitude survey predict? For example, if someone scores high, indicating that they have a positive attitude, can high attitude scores also be predictive of job promotion? If you administer your attitude survey to someone and s/he rates high, indicating a posotive attitude, then alter that week s/he is fired from his/her job and his/her spouse divorces him/her, you may not have predictive validity.

Concurrent validity- Does your attitude survey give scores that agree with other things that go along with attitude? For example, if someone scores low, indicating that they have a negative attitude, are low attitude scores concurrent with (happen at the same time as) negative remarks from that person? High bold pressure? If you administer your attitude survey to someone who is cheerful and smiling a lot, but they rate low, indicating a negative attitude, your survey may not have concurrent validity. For an extremely thorough example of research on the use of solution-focused group therapy with school children, which includes a concurrent validity check, click here (http://www.ezonline.com/grafton/solution.html).

Finally, the most rigorous validity test you can put your attitude survey through is the construct validity check. Do the scores your survey produce correlate with other related constructs in the anticipated manner? For example, if your attitude survey has construct validity, lower attitude scores (indicating negative attitude) should correlate negatively with life satisfaction survey scores, and positively with life stress scores. These other constructs do not necessarily have to be predictive or concurrent, however often times they are. For an in-depth discussion of construct validity, click here (http://trochim.human.cornell.edu/kb/constval.htm). To see what some of the threats are to construct validity, click here (http://trochim.human.cornell.edu/tutorial/driebe/tweb1.htm).

Pelstring L. The NEP and Measurement Validity. 1997. URL: <http://trochim.human.cornell.edu/tutorial/pelstrng/validity.htm>

An Introduction to Measurement

In general, scales are meant to "weigh" an object. In social science, scales are used to "weigh" or gauge a behavior or a personality quality like self-esteem, for example. In the late 1970s, many researchers began to examine environmental attitude and potential ways to "gauge" this concept. Think about it for a moment, if you wanted to measure environmental attitude...

  • How would you go about this?

  • How would you even define environmental attitude?

  • What would be some of the questions you might ask an individual to measure this concept?

  • How do you know those questions are the "right" ones to ask to get at this concept of environmental attitude?

As you can see, it is very tricky to measure--let alone define--something like environmental attitude. Getting the "right" answers to the questions above means operationalizing the construct environmental attitude accurately--defining exactly what you mean by environmental attitude and developing a scale that captures this concept. In other words, developing a scale that is "valid" and accurately able to measure the concept of interest--in this case environmental attitude.

When we think about measurement validity we are essentially talking about construct validity--"the approximate truth of the conclusion that your operationalization accurately reflects its construct" (Trochim web site). Clearly, if we want to measure environmental attitude, we first need to operationalize it or define exactly what we think an individual's environmental attitude might be. Luckily, after a literature search on environmental attitude, we have found one study that has operationalized environmental attitude and developed a scale to measure it.

The New Environmental Paradigm

In the 1960s and 1970s, social scientists' interest in the concept environmental attitude increased. There was a great deal of concern relating to the environment during this decade: the Ohio Cayahoga River caught fire in 1969 capturing national attention; the first Earth Day was held in 1970; the National Environmental Policy Act was signed that same year; and energy conservation became a primary goal in the mid and late 1970s as oil embargoes severely impacted the nation. As a result of these and many other incidents, funding for research directed at the environment and human interaction with the environment became more of a priority.

In 1978, social scientists Dunlap and Van Liere published an article in The Journal of Environmental Education that summarized their efforts to measure a fairly new environmental mind-set they and other researchers believed was becoming a predominant influence. At the time, many social scientists believed that a "paradigmatic" shift--a change in many people's way of thinking--was occurring. People were becoming disenchanted with the so-called "Dominant Social Paradigm," which emphasized human ability to control and manage the environment, limitless natural resources, private property rights, and unlimited industrial growth.

The New Environmental Paradigm, on the other hand, emphasized environmental protection, limited industrial growth, and population control, among other issues. The two social scientists developed the New Environmental Paradigm scale to measure this mind-set. Since its development, the scale has been used in many other studies--both replicating as well as modifying the scale. Many of the studies conducted since then have questioned whether in fact a paradigmatic shift is occurring or has occurred. But most researchers agree that the scale developed by Dunlap and Van Liere is considered one valid measure of environmental attitude and comprises the 12-items listed below. Agreement and disagreement with these statements constitute acceptance or rejection of the NEP.

The New Environmental Paradigm Scale

  • We are approaching the limit of the number of people the earth can support.

  • The balance of nature is very delicate and easily upset.

  • Humans have the right to modify the natural environment.

  • Humankind was created to rule over the rest of nature.

  • When humans interfere with nature it often produces disastrous consequences.

  • Plants and animals exist primarily to be used by humans.

  • To maintain a healthy economy we will have to develop a "steady state" economy where industrial growth is controlled.

  • Humans must live in harmony with nature in order to survive.

  • The earth is like a spaceship with only limited room and resources.

  • Humans need not adapt to the natural environment because they can remake it to suit their needs.

  • There are limits to growth beyond which our industrialized society cannot expand.

  • Mankind is severely abusing the environment.

So What is a Valid Scale?

Validity is "a set of standards by which research can be judged" or "the best available approximation to the truth or falsity of a given inference, proposition, or conclusion" (Trochim web site). Validity can be divided into the following areas: Conclusion Validity, Internal Validity, Construct Validity, and External Validity.

I will not attempt to define all of the above kinds of validity. For a detailed explanation of validity types, please see the Knowledge Base constructed by Professor William Trochim. The kind of validity this web site is concerned with is Measurement Validity--and falls mostly under the domain of Construct Validity. Dunlap and Van Liere attempted to prove that their NEP scale was valid by addressing three elements of measurement validity: construct, predictive, and face validity.

Construct Validity

Construct validity is often considered the most difficult kind of validity to achieve--it essentially comprises both predictive and face validity. I will do my best to differentiate different types of validity, but be aware that many kinds of validity overlap and are sometimes difficult to distinguish. By construct validity we mean assessing how well an idea or concept is translated from the "land of theory" in your head to the land of reality into an actual measure or scale. In terms of the NEP, achieving construct validity (and thus achieving measurement validity) meant that Dunlap and Van Liere had to translate exactly what they meant by the new environmental paradigm, as well as develop an actual scale that could accurately measure whether or not this paradigm was in fact part of an individual's attitude-makeup.

There are essentially three conditions that must be met to ensure construct validity:

  1. The concept requires a specific theoretical framework--in other words, you must explicitly state what you mean by the NEP.

  2. You must be able to show that your operationalization acts the way it theoretically should.

  3. The data you gather must support these theoretical views.

How did Dunlap and Van Liere achieve construct validity for their scale? Well, in order to meet the first condition, they reviewed the literature to find out more about how others defined construct validity. In addition, they consulted scientists and ecologists to determine if their definition and development of the NEP and scale items met with agreement among experts. [By the way, this discussion on establishing construct validity also overlaps with Content Validity--see Trochim's web site for more information on this kind of validity.] What I have just outlined above falls under Face Validity below, so skip to that section for more detailed information. They met the second condition by predicting results they might achieve with their scale [skip to Predictive Validity for more information]. By achieving predictive validity, they also essentially achieved the third condition--their data from two samples supported their theoretical views.

Predictive Validity (and Concurrent Validity)

What is predictive validity? Recall from above that validity means "the approximate truth of the conclusion that your operationalization accurately reflects its construct." The definition of predictive validity can be found in the name--PREDICTIVE. How well does the NEP Scale predict what it theoretically should predict? Well, Dunlap and Van Liere were able to test their scale on two samples--a sample of the general public as well as a sample of environmental group members. They theorized that the environmental group members would score BETTER than the general public on the NEP scale. And they were right--the mean total scale score for environmental group members was 43.8. This compares with a mean scale score of 36.3 for the general public.

What exactly did the two researchers accomplish with two distinct samples and an explicit statement that the environmental group members would score better than the general public? First, they were able to demonstrate meeting the second condition for establishing construct validity. They offered a theory about the results they would receive from these two samples--one sample would score better than another sample based on environmental attitudes. The two researchers were correct in their prediction--the environmental groups did score higher than the general public.

They not only established predictive validity--meeting the second condition for construct validity--they were also able to demonstrate concurrent validity. Concurrent Validity is established when a scale is able to "distinguish between two groups that it theoretically should be able to distinguish between" (Trochim web site). We would expect, as did Dunlap and Van Liere, that environmental group members would care MORE about the environment--and thus score higher on the NEP scale--than the members of the general public.

Additional Tests of Predictive Validity

Dunlap and Van Liere also tested the NEP scale against other measures of environmental attitude and behavior. Their additional scales contained a list of environmental activities and lists of state and federal environmental programs. Respondents were asked to report how often they performed the behaviors and how much they supported the various state and federal programs. The data from these scales was compared with the NEP scale data to see if respondents who performed environmental behaviors and supported government environmental programs also scored well on the NEP scale. Again, Dunlap and Van Liere found that their NEP scale did in fact correlate well with their other measures of environmental attitude and behavior.

Face Validity

Now lets look at face validity--the most subjective and weakest method to establishing measurement validity. Face Validity essentially looks at whether the scale appears to be a good measure of the construct "on its face." As mentioned earlier, Dunlap and Van Liere established face validity by conducting a literature review of what they considered to be crucial aspects of this new environmental paradigm and developing a list of scale items that constitute the paradigm. In addition to the literature review, environmental scientists and ecologists also aided in developing and writing scale questions. By submitting the NEP scale for review by experts in environmental issues, the two researchers were able to bolster face validity.

Summary

Lets take a moment to recap exactly what Dunlap and Van Liere did to ensure measurement validity of their New Environmental Paradigm Scale. First, the researchers operationalized--or explicitly defined--what they meant by the NEP and how they were going to measure the NEP. Second, they worked with a panel of experts who approved of the content of their scale. Third, they used two separate population samples--an environmental group sample and a general public sample. Fourth, the researchers also used several scales to measure environmental attitude. Fifth, they theorized how their scale would work in relation to the other scales and with the different population samples--in other words, what kind of data they would expect to get. And last but not least, their data actually supported their theory and predictions.

Remember what the three conditions are to ensure construct validity?

  1. The concept requires a specific theoretical framework--in other words, you must explicitly state what you mean by the NEP.

  2. You must be able to show that your operationalization acts the way it theoretically should.

  3. The data you gather must support these theoretical views.

What do you think? Did Dunlap and Van Liere meet these conditions? I would say yes. They provided a theoretical framework for their concept, demonstrated that their operationalization acted the way they predicted it would, and produced data to support their theoretical views.

How Stable and Consistent Is Your Instrument? A Brief Look at Reliability. URL: <http://trochim.human.cornell.edu/tutorial/johnson/melody.htm>

This web page was designed to provide you with basic information on an important characteristic of a good measurement instrument: reliability. Prior to starting any research project, it is important to determine how you are going to measure a particular phenomena. This process of measurement is important because it allows you to know whether you are on the right track and whether you are measuring what you intend to measure. Both reliability and validity are essential for good measurement, because they are your first line of defense against forming inaccurate conclusions (i.e., incorrectly accepting or rejecting your research hypotheses). Although this tutorial will only address general issues of reliability, you can access more detailed information by clicking on the words or titles that are highlighted.

What is Reliability?

I am sure you are familiar with terms such as consistency, predictability, dependability, stability, and repeatability. Well, these are the terms that come to mind when we talk about reliability. Broadly defined, reliability of a measurement refers to the consistency or repeatability of the measurement of some phenomena. If a measurement instrument is reliable, that means the instrument can measure the same thing more than once or using more than one method and yield the same result. When we speak of reliability, we are not speaking of individuals, we are actually talking about scores.

The observed score is one of the major components of reliability. The observed score is just that, the score you would observe in a research setting. The observed score comprised of a true score and an error score. The true score is a theoretical concept. Why is it theoretical? Because there is no way to really know what the true score is (unless you're God). The true score reflects the true value of a variable. The error score is the reason why the observed is different from the true score. The error score is further broken down into method (or systematic) error and trait (or random) error. Method error refers to anything that causes a difference between the observed score and true score due to the testing situation. For example, any type of disruption (loud music, talking, traffic) that occurs while students are taking a test may cause the students to become distracted and may affect their scores on the test. On the other hand, trait error is caused by any factors related to the characteristic of the person taking the test that may randomly affect measurement. An example of trait error at work is when individuals are tired, hungry, or unmotivated. These characteristics can affect their performance on a test, making the scores seem worse than they would be if the individuals were alert, well-fed, or motivated.

Reliability can be viewed as the ratio of the true score over the true score plus the error score, or:

true score

true score + error score

Okay, now that you know what reliability is and what its components are, you're probably wondering how to achieve reliability. Simply put, the degree of reliability can be increased by decreasing the error score. So, if you want a reliable instrument, you must decrease the error.

As previously stated, you can never know the actual true score of a measurement. Therefore, it is important to note that reliability cannot be calculated; it can only be estimated. The best way to estimate reliability is to measure the degree of correlation between the different forms of a measurement. The higher the correlation, the higher the reliability.

3 Aspects of Reliability

Before going on to the types of reliability, I must briefly review 3 major aspects of reliability: equivalence, stability, and homogeneity. Equivalence refers to the degree of agreement between 2 or more measures administered nearly at the same time. In order for stability to occur, a distinction must be made between the repeatability of the measurement and that of the phenomena being measured. This is achieved by employing two raters. Lastly, homogeneity deals with assessing how well the different items in a measure seem to reflect the attribute one is trying to measure. The emphasis here is on internal relationships, or internal consistency.

Types of Reliability

Now back to the different types of reliability. The first type of reliability is parallel forms reliability. This is a measure of equivalence, and it involves administering two different forms to the same group of people and obtaining a correlation between the two forms. The higher the correlation between the two forms, the more equivalent the forms.

The second type of reliability, test-retest reliability, is a measure of stability which examines reliability over time. The easiest way to measure stability is to administer the same test at two different points in time (to the same group of people, of course) and obtain a correlation between the two tests. The problem with test-retest reliability is the amount of time you wait between testings. The longer you wait, the lower your estimation of reliability.

Finally, the third type of reliability is inter-rater reliability, a measure of homogeneity. With inter-rater reliability, two people rate a behavior, object, or phenomenon and determine the amount of agreement between them. To determine inter-rater reliability, you take the number of agreements and divide them by the number of total observations.

The Relationship Between Reliability and Validity

The relationship between reliability and validity is a simple one to understand: a measurement can be reliable, but not valid. However, a measurement must first be reliable before it can be valid. Thus reliability is a necessary, but not sufficient, condition of validity. In other words, a measurement may consistently assess a phenomena (or outcome), but unless that measurement tests what you want it to, it is not valid.

Remember: When designing a research project, it is important that your measurements are both reliable and valid. If they aren't, then your instruments are basically useless and you decrease your chances of accurately measuring what you intended to measure.

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]