Re: Using google to weed out too-obscure answers (Final thought)

castrioti <*no_reply_at_yahoogroups.com*> · Fri, 03 May 2002 16:43:47 -0000

Now that I'm finished with finals and papers, I can speak some final 
thoughts on this matter:

> I think a good metric here is required, and with some work (like 
> controlling for the size of the google database), is possible.
> I don't know the size of google's database, but the word "the" 
> shows up "2,560,000,000" times, so that is a lower bound.

using "the" has been working quite well. I've yet to see a quizbowl 
question asking for a definite article.

> If we assign "B" to be the baseline measure, and then look at 
> ratios, we could develop something a little bit more stable
> e.g.
> 
> LOG (GoogleCount(B)/GoogleCount(X))
> 
> where X is the word in question
> and B is the baseline word or wordphrase (and preferably B >> 
> any X we are likely to test).

Thought I'd test this / match scales. A few examples:

For a question about Tales of Hoffman whose giveaway is Jacques 
Offenbach: 
Levinson-Castrioti difficulty (10-log2,020) = 6.69
Proposed scale (Levinson Method?) (log(2,560,000,000/2020)) = 6.10

For a question about Thomas Hardy's Return of the Native whose 
giveaway is Clym Yeobright: 
Levinson-Castrioti (or -Mathews) difficulty (10-log570)= 7.24
Proposed scale log(2,560,000,000/570) = 6.65

For a question about Eleanor of Aquitaine whose giveaway is "wife of 
Henry II of England:"
Levinson-Castrioti difficulty (10-log9800)= 6.00
Proposed scale log(2,560,000,000/9800) = 5.41

For a question about the Raman Effect whose giveaway is "named for an 
Indian physicist:"
Levinson-Castrioti difficulty (10-log1030)= 6.99
Proposed scale log(2,560,000,000/1030) = 6.40

For a question about Mjollnir whose giveaway is "hammer of Thor:"
Levinson-Castrioti difficulty (10-log1030)= 6.89
Proposed scale log(2,560,000,000/1030) = 6.30

Levinson-Castrioti difficulty for "the:" 0.60
Proposed scale difficulty for "the:" 0.00

Will have to try this maybe a year later to see how much "difficulty 
drift" is occurring for the first scale, but I agree that the 
proposed scale will eliminate negative ranks should they occur in the 
future and will likely be much more stable as the database for google 
expands. I'll use Levinson's  proposed scale for now.

--Wesley