Recent Forum Posts
From categories:

I'm looking for a partner. I have class at the following times:
M 10:30-11:30
T, Th 10:30-1:30
W, F 2:30-5:30


I get out of class at 1:30, so I have lots of free time.

e-mail: gdemart@cs

ok, the assignment discription asks us to:
"implement tf-idf as discribed in class slides"
therefore i've been wondering:
1. are we really supposed to return the vector space similarity of hte query term doc
and the current doc as the "relevence score"?

2. since the normalizer is defined to be (sqr (tf-idf)squred) shouldn't each individual
term have its own normalizer? in stead of a whole document sharing 1 normalizer?

3. if the query term were to be treated as a doc, then the Tf would be "1" for each
term, and since idf/normalizer is a constant then the whole "vector space similarity"
deal would just be changing the fromular of weight to weight=tf*(idf/normalizer)^2?

sorry to have ranted on like this. alot of my questions sound silly, but i am just really confused …
any explaination is appreciated.
cheers, jiatao

Hi everyone,

Someone asked a couple of questions that I'll cite here and

Question1: Treating Query as a document

According to slide 44 IR, we did in class yesterday , you suppose to treat your
query as a document and calcualte its W (for each term). Do the same for the
current document you are trying to rank(for each term). Then do that cross
product thing with the array of W from both current doc and query to get how
close together are they are

It is unclear to me that how do you calculate W (normalize) for the query?
Formular for W(not normalize) is tf * log(N/nk) where nk is number of times
term appear in all documents. Is nk including the one in query as well, since
you are counting query as a document?

One generally does not include the query document itself in this
count. The reason is that people like to precompute the idf term,
and obviously we can't do that if the query contents can change it.
In practice, the number of documents is so large that this isn't a
real concern regarding the idf value; the difference between including
the query or not will usually be vanishingly small.

What about when we are calculating W for the document we are trying to rank?,
do you need to count in words inside the query to a document space of
log(N/nk)? (thus N would increase by 1, and nk would increase by the number of
times the term appear in the query?).

One does not consider internal-doc frequency to compute this. It's
just the number of docs in which the count is >= 1.

Question2: getNormalize()?

Upon talking to professor, he also want this thing normalize~. I try writting
it out, and it is very long (2 loop for calculating dominator for each the
current document and the query. Another 2 loops to calcualte W (not normalize)
for document, and the query. you need to go though query k^2 times to find out
how many times each term occur in the query as well since you can not convery
query array to Document. Its long such that I think I am doing the wrong thing.
So I was wondering what is "getNormalizer()" function in Document class do?

I mentioned in a previous email what the getNormalizer() result
actually contains. While you don't have to incorporate the query,
you're right that it takes a loop through the whole corpus to compute.
That's a little burdensome for >1M docs, so we've precomputed it
for you.


[CSE454] some more questions by gdemartgdemart, 21 Oct 2006 21:10

Hi everyone,

A few people have asked for a more precise description of the
result from the getNormalizer() call in Document.

It's the per-document length normalizer. It's not computed
per-term-weight. There's just one unique value per document.


[CSE454] getNormalizer() by gdemartgdemart, 21 Oct 2006 21:09

Hi everyone,

There's an inconsistency in the assignment about whether your
DocumentRanker code should inherit from "IRanker" or "IExtendedRanker".

Sorry about that; I made a versioning error in the assignment text.
You should inherit from "IExtendedRanker". Its methods are exactly the
same as the "IRanker" interface.

—Mike by jarjar, 09 Oct 2006 22:09
guanyuguanyu 05 Oct 2006 15:28
in discussion Forum Discussion / Cool URLs »

Healia is the premier consumer health search engine for finding high quality and personalized health information on the Web. It serves as an independent, unbiased gateway to the highest quality health information resources. We engineered Healia primarily with consumers and patients in mind but health professionals and researchers will also find it useful. by guanyuguanyu, 05 Oct 2006 15:28

The Internet creates new opportunities for content authors to provide rich interactive experiences for their readers. Standard in most web pages today are links to related content including audio and video, the ability to purchase products and services as well as the ability to interact with the author or publisher and, at times, other readers.

One could argue that a downside of paper is that it does not offer the same rich interactivity that is available with a digital document. But what if, while reading your local newspaper, you could find other articles written by the author of an op-ed piece, buy a book from, get a review of a new consumer product and find other readers to chat with about advice offered by the gardening columnist? by mattgarmattgar, 05 Oct 2006 15:05
(account deleted) 05 Oct 2006 13:36
in discussion Forum Discussion / Cool URLs »
Pluggd is a vertical search engine for searching the audio of podcasts. They say they’re in beta, but audio search features don’t appear to be working yet. They do, however, have a neat working demo based on a single ESPN podcast which shows what they are working towards. by (account deleted), 05 Oct 2006 13:36
gdemartgdemart 05 Oct 2006 06:40
in discussion Forum Discussion / Cool URLs » - A website with popular links from other, more popular websites like flickr and youtube. Aggregated content so you don't have to look for it. by gdemartgdemart, 05 Oct 2006 06:40
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-Share Alike 2.5 License.