Let’s take that thought to its conclusion. If clicks on Google really account for only 1/1000th (or some other trivial fraction) of Microsoft’s relevancy, why not just stop using those clicks and reduce the negative coverage and perception of this? And if Microsoft is unwilling to stop incorporating Google’s clicks in Bing’s rankings, doesn’t that argue that Google’s clicks account for much more than 1/1000th of Bing’s rankings?
My take on this is quite the opposite. If MS thinks they're doing nothing wrong, they shouldn't stop. They should do what is best for their customers.
Reversing what they do now because Google "caught" them, would IMO, imply they were doing something wrong and got caught.
I personally think what Bing is doing is great. I regularly use Bing and Google. I wish there were a streamlined way for me to broadcast to both of them... "for search term X this is the relevant link!"
MS has found a way to do this, and as long as its opt-in, I like it. I wish Google had the same (although I do realize that since Google owns so much of the traffic it is less of an issue for them -- its a net loss in terms of the flow of information).
But maybe a question for Matt... Can you and MS work together for a toolbar that does just this for both engines?
As a user this isn't a matter of Bing copying Google or not. But its about the fact that search relevance still kind of sucks. I feel like I can make the search experience better, especially in those occassions when I go to your competitors site.
I wonder if you would be ok if Bing did the same thing to amazon. That is, imagine they used toolbar/IE logs to infer that people went to amazon, searched for LCD TVs and then purchased model X. Then they could boost pages about X in search, or implement a "bestselling" feature. After all, they are "just" using clickstream data here. Similarly, they could track Netflix, and so on.
IMHO, there's a loophole in Bing's argument that the user click data is free for them to use. What about the site - do they have no ability to consent/dissent to being tracked?
Maybe robots.txt needs to be updated for the toolbar.
Absolutely on all of those sites. And Wikipedia. Why would I not want them to? The only reason I could think of that I would not want them to is if I thought it would create worse search links as a result.
Although being able to personallize the search queries would be incredible. So when I search for "Movie XYZ" -- it can also look at my clickstream and see that I spend a lot of time in Netflix and Netflix has that movie, and they can point me to that movie directly in my search results ("You can stream Movie XYZ now!") -- kind of like a built in Clicker -- I'd love it.
But the problem is search relevance sucks. I'm in my freakin web browser all day, yet it contributes nothing to my search relevance. WTF!?
But again, key to me is opt-in. If anything, the one takeaway I would take if I were MS is to make it more apparent. But besides that. Everything you said, and more, please. And from Google too -- Bing shouldn't be the only one to benefit.
I'm not talking about using just visits to improve search though, I'm talking about using the pattern of interaction on a site to basically replicate a site's data. That is, imagine netflix recommends movie X to you, now Bing infers that Netflix has done so using custom code to parse their IE logs, and then recommend movie X to you.
"Stefan Weitz, director of the Bing search engine at Microsoft, said in an interview the company studies how certain users interact with Google in order to improve Bing. ..."
I'm not sure I get your example (not clear when Bing would recommend a movie to me). So let me give that I do understand.
I go to Netflix and search for the move "Network". I end up clicking on "The Social Network". Later on Bing, if I search for "movie Network" -- I'd hope that one of the movies that comes back is "The Social Network", based on that clickthrough data from Netflix..
In my mind the only thing that is borderline unethical is that we've artificially limited this clickstream data to one company. I'd like to give it out more broadly. It's my data right?
I was considering the case when you go to NetFlix, and they recomment the movie "Social Network" to you, and you click on it. Surely the url will have sufficient info to claim that its a recommendation.
Now bing implements a movie recommendation service, which also magically recommends for you "Social Network", based on clicktracking of Netflix.
IMHO, what Bing is doing with Google is exactly the same scenario as above (extract click data from a competitor's service, and surface the exact same data in their own product).
Why wouldn't I want Bing to recommend the movie? The first thing Bing should do is use Netflix's APIs to suck down all of my ratings of movies. Then it should use my clickstream data for Netflix, IMDB, Hulu, Amazon, Facebook etc... And if I'm using Media Center or SageTV, I want those things to also get consolidated together.
Are you saying you want your Netflix data in one silo, your Hulu data elsewhere, your TV data elsewhere? Don't get me wrong, I want this to all be opt-in, but when I opt-in on MY behavior, I want them to use it.
I would argue that users own their clickstream data and are free to give it away. (The question then becomes whether Microsoft makes sufficiently clear exactly which data users are giving to them.) Clickstream data is basically a list of URLs, I don’t think (parts of) such lists should controlled by website owners.
I’m not sure, though, whether I have overseen something that would make it useful or fair to give website owners control over clickstream data. I’m honestly open for suggestions.
But that’s only part of the problem. Even if everyone were to agree that collecting clickstream data from users with their consent is ok, we still don’t know how those who collect the data are allowed to use it. That’s certainly not an easy question.
You're right. There seems to be a gray area around whom owns the "click data":
- "Are users entitled to give away for free their navigation history ?"
- "To what extend are search engines allowed to implement specificities based on that data ?"
What is being data mined is a bit more than a user broadcasting their own preference on the correct result. The user is broadcasting a URL which is selected based on two factors:
- the user's preference
- Google's ranking algorithm.
Had Google not ranked that URL, the user wouldn't be broadcasting it. If there was some way to extract the factor of the user's preference of URLs as a signal without the factor of Google's ranking algorithm, you would be more spot on. Unfortunately this isn't possible. Also, in most cases the ranking algorithm component of this broadcasted data is the more useful factor of the two.
Yup this is the issue Google wants to raise. Expect Google lobbying soon for some law along the lines of "clicks on somesite.com belong to somesite.com and can only be shared with another party with somesite.com's permission.". Then in some Google ToS the "user" will be granted access to the clicks that the user makes on Google's search results page... but third party apps like Bing toolbar won't have access.
Perhaps :-) Google is a very smart, calculating company. They wouldn't dedicate so many resources to tricking the Bing toolbar if they weren't interested in getting something in return. Perhaps what Google wants in return is avoiding the following scenario: MS bundles the Bing toolbar with IE and data sharing is default opt-in. MS gets a ton of clickstream data from people using Google. Bing takes some of Google's marketshare.
Is Google's ranking algorithm somehow more sacred than my HTML? Google mines that to create their page rankings. Sure there's robots.txt but likewise Google could opt out of serving information to Internet Explorer since that's the only browser which has Bing toolbar. The entire search engine industry is based on using whatever someone has failed to opt out of.
I agree with you. Particularly, why not just stop using those clicks and reduce the negative coverage and perception of this seems like a very biased suggestion.
All of the negative perception seems to come from Google deliberately inserting outlier data. For MS to back down on that basis immediately makes them look like they were doing much more than what is proven.
Best for their customers in what time frame? They might get better results in the short term, but it's much less clear what the long term effects are.
What if it slows down innovation in search quality and moves effort into marketing, shiny baubles? After all, why bother spending resources on researching, implementing and evaluating the algorithms when your competitors will within a couple of months gain the most of the improvements, automatically with no effort.
You complain about search relevance sucking. Why are you so eager to support a practice that has a high chance of causing it to stagnate?
> My take on this is quite the opposite. If MS thinks they're doing nothing wrong, they shouldn't stop. They should do what is best for their customers.
Does this mean anyone should feel free to ignore robots.txt if they think it will be better for their customers?
"Clickstream owned by user. Opt out by turning off toolbar."
It is not clear to me that the user completely owns the clickstream. If Google's terms of use say you cannot share their search results with anyone else, are you legally allowed to send the clickstream to Microsoft?
I guess if it really comes down to it Google can encrypt the query in the URL. If Microsoft circumvents this (like by intercepting the JavaScript that encrypts it, or pulling the query out of the HTML input element), then it will demonstrate that they're really intentionally copying Google's results and this "clickstream" business was just a convenient cover.
IF Google say you cannot share their data with anyone else, then you are not legally allowed to share the answer with your Dad.
IF Google say you must send a 'thank you' email to them after each query, you are legally obliged to do that.
IF Google say you cannot share their data in an automatic, 'clickstream' fashion, then you cannot legally use Microsoft's toolbar. And if you break that agreement, Microsoft cannot legally use the clickstream data you send it.
I think Google would require a clickthrough TOS to enforce this. AFAIK Google has no TOS listed on its page, but they do have a copyright notice. And a link would not be sufficient.
But yes, if they did a clickthrough TOS then they could enforce that. Of course the day Google adds a clickthrough TOS there will be some serious scrutinizing.
Doesn't the browser own the click even before the Web page gets it?
If Google's terms of use want to claim that a browser can't record clicks on their web pages, then perhaps browser terms of use will say they can do whatever they want with any data returned to them by the server.
Agreed. And his example of how his mom could not understand what the IE disclosure really meant in terms of information sharing is silly. Your typical mom would also does not understand the true extent of data collection from using Google's own products. Does that make it wrong?
If this joint toolbar comes to fruition, which it never will, the crowd will decide the obvious winner for practically all queries. At that point what will be the difference between Bing and Google?
At that point what will be the difference between Bing and Google?
Speed of innovation. It's not our job to make it easy for them to have diffentiation. I just want great results. The more parity we give them with respect to data, the harder they'll have work to innovate in order to differentiate. AFAICT, that benefits us the users.
I didn't suggest it should be our job to make it easy for them. I want great results too and I want it without ads and spam getting in the way as well. In fact there should be more tools integrated into browsers to allow the user to own their own click stream and browsing pattern data and "sell" it to the highest bidder. Do you know of such tools? I don't. We are even at the point both in terms of computational power and data storage costs that each user could potentially have their own web crawler/search engine. The bing and google bars basically steal this information from users and in return provide little to no improvement on search results. The point I was making was that there is absolutely no incentive on either side to share knowledge and converge onto a high quality, common query result set because both companies try really really hard to not appear like the other guy and the result set is part of this.
"Copying" was a pretty brutal word to use-- not surprising that it raised MSFT's hackles a bit.
MS clearly uses toolbar users' clickstreams (on and off Google) to improve their own search efforts. Google created an artificial scenario where the ONLY input was Google search behavior and lo, the search results are exactly the same. Whether or not that steps over a line (I don't feel that it does), it's not "copying" in my book.
Another interesting point is that Google has been beating the "open" drum for a long time now. No walled gardens, right? If a Facebook user should have the right to take his data with him wherever he wants to go, shouldn't a Google user be able to fork over their behavior data to Bing?
Matt's point about MS' lack of clarity when getting folks permission to grab their clickstream was dead-on. THAT is pretty outrageous and MS should be ashamed of that.
Regardless of all that, hats off to Matt for keeping a cool head and stating his position in a respectful way.
I don't understand how this isn't a settled issue. If clickstream data is 1 of 1000 signals, and you create clickstream data for a specialized query that will never trigger off another signal, then your created data will be reflected. That sounds exactly like what happened.
You'd have to make the argument that using this data is wrong, somehow. But to make that argument, you'd basically have to argue that users shouldn't be able to share their habits with whomever they want to. I doubt that argument can be made in a compelling way.
I'm surprised Google is pushing this further, and a little disappointed.
Accept in the "original" [torsoraphy] search. Forget Bing having to compete with Google's "spell correction team;" Bing need only use Google [copy] as a high frequency signal on tailing queries.
That sounds like exactly what happened, and it's wrong.
Google is setting trends on tail terms due to its massive market share. Once the trend is set, Bing captures the trend by monitoring user's behavior and provides that to the other users. By reflecting user's behavior in its own index, Bing seems to replicate the Google's index structure which created that behavior. Had there been no Google, there would have been different trends and Bing would capture those instead.
It's not Bing's fault that Google is so influential, and certainly not a reason for them to stop dedcuing relevance frm user behavior. Just because Google creates trends, does not mean they own trends. The users own their own behavior and are free to share it.
This comment really gets to the meat of the issue here. Search is about discovering and then predicting user trends. Google in fact is creating user trends through its position in the market. Should Bing be locked out of the market because Google is currently in the dominant position?
If Bing is allowed to discover user trends then it will necessarily end up being a replica of Google's results in some cases. There's just no way around it.
I think you can make the argument that in order to have a "free market" of ideas (when it comes to building the best search engine) you shouldn't cannibalize the ideas of your competitor. If Bing's signal consisted of some generic method for mining clickstream data that just happened to pick up on clicks originating from Google, then this wouldn't be a problem. The fact that they're deliberately deconstructing Google URLs (per the Microsoft research paper that Matt cites) makes this questionable.
You should be able to share your results with whoever you want but that's not what happens. You share your data with just bing and google. You don't really own any of the data you are generating so saying you should be able to share it with whoever you want is saying hop before you've jumped.
Of course not. Try inquiring google or bing about obtaining all the information they have gathered from your toolbar and see what response you get. Actually I'll tell you: "The gathered data is anonymous and we have no way of verifying who sent it to us". So how exactly do you own information that you have absolutely no access to and with no way to transfer to competing search companies?
That’s decidedly not what owning your clickstream data means. The argument is that users should be able to decide to give away their data, whether they can later access that data is irrelevant. Access to the collected clickstream data is simply not part of the agreement and as long as users are not coerced or tricked into agreeing that’ certainly unproblematic. (I can similarly agree to give away my photo collection without getting any right to access it ever again. That doesn’t mean that I don’t own my photo collection.)
You just said you can give away your photos by agreeing to not have any access rights to it and still own those same photos. It seems like you are using a different definition of "ownership" or maybe "click stream" than I am so this argument is pointless unless you define exactly what you mean by those words. By "click stream" I mean the aggregate history of clicks along with contextual information related to those clicks as one coherent, single entity and not individual clicks looked at one by one each of which was owned by the person that generated the click and who transfered ownership of that single click to somebody else. If you meant giving away a copy of your collection then that's a little different but that's not what's happening with user generated click streams. The only click stream copy is the one at google, microsoft headquarters and you have no access to it so you don't technically own it.
No. You can give something you own away. You obviously don’t own it anymore after you gave it away. Being able to give what you own away is part of ownership. (You can’t give something away you don’t own.)
Since you can still make a local copy of your clickstream data (every browser has some sort of history function) you don’t even really give it away, you keep it and hand over a copy (maybe not technically but it is still the exact same data) to Bing.
Here is an analogy: Imagine a scale which, after you agreed to it, automatically sends your weight to the manufacturer every time you use it. The manufacturer made it clear when you agreed that you will not be able to access your weight history. While slightly creepy, I don’t see any ethical or legal problems with that. You own the weight displayed on the scale, you can therefore also agree to send this weight to whomever you want, even if they don’t let you access your weight history. You are also free to make your own local copy of your weight history (for example by writing you weight on a piece of paper every time you use the scale).
"Google created an artificial scenario where the ONLY input was Google search behavior and lo, the search results are exactly the same."
Cutts addresses this:
"As we said in our blog post, the whole reason we ran this test was because we thought this practice was happening for lots and lots of different queries, not simply rare queries."
If this isn't enough to convince you, nothing will be. I'm not sure how the evidence could be any more convincing.
Content from Google's result database ends up used to generate Bing's results, full stop. Proved. It may be technical error, it may be only used a little bit, it may be deliberate, it may be accidental and perfectly ethical by some people's standards, there's all kinds of ways to legitimately interpret this fact. But Google has demonstrated beyond a shadow of a doubt that input from Google is used to generate some non-zero bit of Bing's results. There is absolutely, positively no other explanation that accounts for the facts of the case.
I know. My point is that they've established beyond a shadow of a doubt that Google information is ending up in Bing's results. The default presumption is that any data could end up anywhere; the odd argument is that they have specially cased this and they only grab clickstream data when they have no other sources. One would expect that if the data is being collected it's being fed into the search algorithms and always being considered as a weight, not just sometimes.
I hate to be a bit harsh, but that argument sounds more like spin or rationalization than a logical argument. It sounds nice but it doesn't make sense if you try to actually map it back to the real world, where somebody had to type real code to produce the effect you're hypothesizing, and they weren't writing this code with this debate in mind in advance.
Of course it's always being considered, but the question is how much effect it has in practice. I guess this depends on your ontological categories, but personally I distinguish between "Bing is copying (if you want to call it that, and fair enough) certain Google results" and "Bing is a copy of Google" (which I think has been alleged or insinuated, but not shown). The difference is a matter of quantity turning into quality.
I think they probably have statistics that convince them of the latter, and they should show them, so we can see if people outside Google also find them convincing.
The problem is that the results would not be so dramatic. It would require statistical analysis to show that there was a difference. Most people's eyes would glaze over -- even though copying was occurring and genuine damage was being done (to market share, to business model, etc.), similar to a low-grade toxin in the environment.
Imagine I launch a search engine with no data. Then I feed it with urls IE users click on after their google search.
I will eventually end up with an exact copy of google database. So I think that this technique can be called "copying".
Now if Bing uses this technique for 0.1% of their data, then it can be said that 0.1% of their data are copied from Google database.
That's a straw man argument. That's not what they did. They had a search engine with data (lots of it), using many of the same factors that Google does (page/domain authority, on-page markup, etc). The feed that algorithm a lot of data. One bit of data they feed it is search behvior of toolbar users (presumably ALL search engines).
In your scenario, that'd be a copy. In the reality scenario (described above), I think it's not.
search behvior of toolbar users (presumably ALL search engines)
The situation is even less compelling than this. There's no evidence to suggest it's not the complete navigation behaviour of toolbar users, not just when they're on search engines.
Ironically, you are refuting as a straw man a proposition I never made. I have never claimed Bing did that.
I just highlight the fact the way they get data from google is technically a copy, since it has the capacity to duplicate
the entire google database (which again is not what they are doing).
What do you suppose google does when they are faced with a novel query and their algorithm returns 10 equally good results as matches? At that point you might as well provide a whole bunch of users some permutation of the matches and take into account which links are clicked on the most. This is exactly what bing is doing and if you crawl the web then you will indeed end up with google's database, the only difference will be how you rank results.
That's not what bing is doing in this case. What bing is doing is taking information from Google, as collected by users, and presenting that information to their own users.
Ya, and? They are also taking information from a whole bunch of other sites collected by their users and incorporating that information into ranking search results. I don't really see how this is inherently wrong.
I'm not saying that it's inherently wrong, I was attempting to explain to you how your example (Google using A/B testing to determine which results are the most relevant) had no relation to what bing is doing.
Matt's assertion about "Suggested Sites" sending this data seems to be conjecture. I ran some packet captures and didn't find anything of that kind.
However, if you install the Bing Toolbar then it does send URL clickstream data. It explicitly asks you beforehand if you want to send info about "the searches you do, websites you visit..." though.
Excellent analysis, thank you, especially in the distinction between "suggested sites" and the Bing toolbar's behavior. I can't argue with your methods. However, I do differ with some of your conclusions:
> The behaviour I’ve seen explains Google’s experiments, but does not support the accusation that Bing set out to copy Google.
I don't think it's so much about "set out to copy Google" necessarily, as it is that they are explicitly parsing Google queries and results from the clickthrough data and using it (quite directly) for their own results. What they set out to do is immaterial given what is provably happening.
> Bing Toolbar is tracking user clicks and Bing could use the result to improve search results. I don’t personally see any great distinction between this behaviour and Google’s many tracking, indexing and scraping endeavours which they use to improve their own search results.
The difference is that Google has proven that the results of certain queries are being directly fed as Bing results. If Microsoft does the same with Google rankings, I'd see your point, but right now the evidence only points in one direction.
> While I personally dislike the privacy implications, Bing Toolbar is pretty upfront about it when it gets installed (unlike much web page user tracking.) The fact that the tracking is plain HTTP not HTTPS, with the content in plaintext, would seem to indicate that they weren’t seeking to hide anything.
I'd be interested to see if Google over HTTPS queries are being transmitted by the toolbar over HTTP. That would be a pretty serious privacy violation IMO, especially when you pair that with unencrypted wifi at Starbucks. See the AOL search log fiasco: http://www.somethingawful.com/d/weekend-web/aol-search-log.p...
The difference is that Google has proven that the results of certain queries are being directly fed as Bing results. If Microsoft does the same with Google rankings, I'd see your point, but right now the evidence only points in one direction.
However, these are also the only experiments that have been run - outlier data, gaming the algorithms. If you only test one possible outlier scenario and don't control against any other, it's fairly ambitious to stand up and say "this is exactly what is happening!"
IMHO it would have been better for Microsoft to respond along these lines, instead of going into counterspin mode, though.
Claiming that Bing is copying outlier results is still a fair claim that Bing is copying results, especially considering that outlier queries are hard in general for search engines to get right: http://portal.acm.org/citation.cfm?id=1277939
what control are you suggesting? I read a suggestion elsewhere of using only Firefox for some queries and seeing if they show up, which is silly, of couse.
without a possible mechanism of action there's no point of using something as a control; I might as well wait to see if the files on my disabled usb drive show up on bing. they were "gaming the algorithms" because they suspected that the algorithms existed, and it wouldn't have worked if there was nothing to game.
Set up some other dynamic sites (not indexed or linked anywhere) with random nonsense words in their URL strings, and a single link to some unrelated site. Click the link. Repeat, wait two weeks.
Same as what google did, just not on google.com.
Of course, if Bing has other "relevance" indicators (ala BingPageRank) then this might not work because it will rule the nonsense control site irrelevant linkspam, and not put it in (even Google's experiment only got a 9% success rate.) It would be better if someone like Facebook injected the test nonsense strings in their URLs.
Something of this sort would at least show they tried to differentiate between "scrapes all clickstream data" and "uses Google's search results".
Thank you. And this is precisely the problem with this whole fiasco--its just a bunch of conjecture and assertions. But because its from Google its taken with much more weight than it would be otherwise.
Google has created precisely the conversation they wanted based on their assertions then gets indignant when Microsoft refuses to step into it.
The article in the WSJ insinuates that it can come from either IE or the toolbar:
Stefan Weitz, director of the Bing search engine at Microsoft, said in an interview the company studies how certain users interact with Google in order to improve Bing. It does this by looking at "clickstream data," or information that users of Microsoft's Internet Explorer or the Bing search toolbar voluntarily share with the company
Fascinating. I was planning on doing the same thing and now I don't have to -- thank you for taking the time. As you mention, this stuff could all be happening server-side, of course.
Oh man. That research paper he quoted strongly indicates that MS is specifically targeting Google.
I was totally wrong. Bing are copying Google. I'm sorry (I am the "What on earth are Google thinking" author).
I still think the honeypot experiments didn't support the conclusion. But this paper coupled with Bing's lacklustre pseudo-denial strongly indicates that my view of events was not accurate and that Bing were indeed blatantly piggybacking off Google's hard work.
I'm really disappointed. I gave Bing the benefit of the doubt, saw that there was another conclusion which explained Google's observations and I advocated that. But now it really looks like MS overstepped the mark and were deliberately picking out Google URLs in order to use signals from Google's algorithms to improve Bing.
That really is cheating.
edit: although still it's not conclusive. I don't know what to think now. I think I should stick to writing code :)
In the spirit of completeness, the paper says they use Google, Bing, and Yahoo.
But the paper doesn't say this is what Bing uses. Looks like it, but we should be clear. This is MSR, not Bing.
Good paper though. But it is another example of MS actually not thinking its bad. They published a paper on this where they spell out how MSR would do this. Did people in the academic community protest that this would be copying if implemented?
Yeah, I mean you could make the argument that it's just one piece of MS research among thousands of others, etc etc, but the fact that they've sunk resource into literally spying on Google (et al) is just too much. It could still turn out that Bing don't use that technique... but it's looking a lot less likely. Especially given this whole thing started based on Google noticing their spelling corrections popping up on Bing, and stealing Google's spelling corrections is exactly what this paper was about.
In one of the earlier threads there was discussion of MSR being in a bubble, and this is a good example. Did the people publishing that paper realize it could look like copying Google's results? Almost certainly not; as a researcher, this would look to me like I was showing that the approach worked across the three biggest search engines. Researchers are supposed to do stuff like that! But in this other non-research context, with billions of dollars at stake and Microsoft's track record of being the evil empire, it can look very different ...
[disclaimer: I was in MSR from 1999-2005 and then did competitive strategy for MS in 2006-2007. so that influences my perspective.]
Those examples to me are like finding a maggot in a piece of meat. Maybe that's the only one, and if you just remove it the rest is ok, but it's going to make you pretty suspicious especially if it already smells.
The reality is that it's almost impossible for google to measure how much their results influence bing. After all, they are both trying to rank the same internet. The best they can do is prove the obvious cases and let bing do the explaining.
Yeah, this was a point I wanted to make in one of my blog posts. If Google wanted to claim some kind of ownership over SERPs, that would be ridiculous because you could easily argue that there is only one correct set of "ten links ordered from most relevant to least" for each query. Eventually all search engines will converge on the same ten results for every query.
I don't think Google ever made such a claim of ownership, but it was taken that way by some.
user24, I didn't make the connection that you did those "What on earth are Google thinking" posts--just wanted to say thanks for your tweet earlier. It made me feel like doing the post wasn't such a bad idea.
I just wish you'd found that research right in the beginning! It's still not 100% conclusive, but it adds a lot more weight to your claims.
But still, I don't understand why Bing felt it necessary to specifically target you! The generic approach of mining all URLs indiscriminately, which I advocated, would have had largely the same impact and would be defensible against claims of cheating. It's crazy. There was no need to copy directly from Google and yet it looks very much like they still did.
I haven't got much to add since I've expressed my views previously, plus this blog post does a good job of calmly pointing out the issues. It's worth emphasising the penultimate paragraph from the blog post since I feel it's key here though:
"Since people at Microsoft might not like this post, I want to reiterate that I know the people (especially the engineers) at Bing work incredibly hard to compete with Google, and I have huge respect for that. It's because of how hard those engineers work that I think Microsoft should stop using clicks on Google in Bing's rankings. If Bing does better on a search query than Google does, that’s fantastic. But an asterisk that says "we don't know how much of this win came from Google" does a disservice to everyone. I think Bing's engineers deserve to know that when they beat Google on a query, it's due entirely to their hard work. Unless Microsoft changes its practices, there will always be a question mark"
I agree with this entirely. Given Bing's resources, it seems bizarre that they were relying on Google's results even a little bit. And until they sort this issue out, I definitely agree that it's difficult to know when to give credit to Bing.
(Plus it's made all the more difficult considering that Bing keep sort of denying this whole mess, despite the quite conclusive proof. I suspect that if they do stop using Google's results, we won't know about it considering the hole that Bing have dug themselves with their denials. Ah well, que sera sera).
Matt Cutts should know better. Being part of '1000 signals' does not mean all signals are weighted evenly. It does not even mean the signals are weighted the same across all query types. This is machine learning - the actual weighting is learned and dynamic (always shifting) and not controlled. And there is absolutely no reason for Microsoft to take out a particular signal just because Google asked. There needs to be proof of unethical behavior, of which there is none.
The Chrome and Gmail EULA's are as bad as the IE8 Suggested Sites bit mentioned in this article. GMail basically reads your email to provide contextually relevant ads. Does my mom know that? No.
Not a plug, but I blogged about this @ http://roshank.posterous.com/google-versus-bing-no-one-is-be... . I believe this should be a discussion on ethics - and feel it is ethical for a company to do whatever it wants with data contained in its own software application.
I don't think you read the whole article, because Matt quotes Nate Silver in saying that exact thing
You said: " -Being part of '1000 signals' does not mean all signals are weighted evenly."
Matt quotes Nate: "First, not all of the inputs are necessarily equal. It could be, for instance, that the Google results are weighted so heavily that they are as important as the other 999 inputs combined."
Google should also quit trying to build the invite-your-facebook-friends feature in order to get virality for their services. Because, you know, it does a big disservice to their engineers and marketing people that they must rely on facebook.
And if they are not relying on facebook, why not just get rid of it all together?
"Because it helps the users".
Oh I see. But so does Bing's actions.
"Because its individual users giving permission"
Same with Bing.
I think Matt lays out a strong argument for why they have made such a stink. And it sounds like a pretty compelling defense for their actions to go public.
But...
1) I think they will regret it bigtime when all the attention they are causing makes the bored government officials poke their head in and realize that at a macro level, neither Bing nor Google does anything to protect user search privacy.
2) I think Google has more to lose by bringing this to light than they have to win. Despite Matt's defense, it's hard to see it as anything other than being petty and pedantic. But this is there response to anything that threatens their search pageviews, and that's understandable even if erroneous.
They should focus on trying to be innovative again, that was the Google I respected.
> They should focus on trying to be innovative again, that was the Google I respected.
I work in search quality at Google. I'm busting my ass every day working on fundamental reimaginings of how results get ranked. I'm going to keep doing that regardless of what bing does, because it makes the world a better place, and it's fun.
But, suppose the stuff I'm working on works out, and tomorrow Google shows up with a wholly new set of awesome results. This is very possible, there is a ton of headroom left in search quality, I've seen the experiments myself.
Then after a few weeks of sniffing clicks, Bing comes up with the same set of revolutionary results, but they have no idea how they got there, they have no idea what the evidence is to rank them there, all they know is that people like those results on Google. Is this fair? Is this ethical? Is this even legal?
People in this whole debate have the idea that the user is creating this association between the result and the query, like the user searched through the whole web and came back saying "hey microsoft, check out this great result for this query! I found it! Isn't is awesome?" They aren't. Users click on whatever results you put in front of them, generally starting with the top and working their way down.
Ranking results is not a science with some objective optimal conclusion. Ranking results is fundamentally subjective, and while data-driven, is ultimately an opinion. The user does have some discriminating power in this whole feedback loop, but it's miniscule compared to Google figuring out how to show them that result in the first place. Bing is taking the closest proxy that they can practically acquire for Google's opinion, and using it directly in their ranking.
You have computed a relationship between FOO and BAR. For as long as you keep it to yourself, this ephemeral relationship is a secret, yours to keep.
Once you tell the world that FOO and BAR are related, and the whole world looks at it and says "Yay, it IS related!", the relationship stops being ephemeral and becomes actual, reflected in actions of the users. You no loner own that relationship.
You can't claim ownership of facts, even if you discovered them first, or created them into existence.
>You can't claim ownership of facts, even if you discovered them first, or created them into existence.
It's not as straightforward as all that. Suppose you write a book. The existence of that book is a fact. The words that are in that book are now a fact. Now, I create a book that says, "The following book was written by DennisM, this is a list of the words that are in that book, in order" and then precede to duplicate your book. Those are the words that are in the book, it's a fact, no one can deny that it's a fact. That's not defensible, even though I'm just stating a fact.
We're in a somewhat grey area here, dealing with ethical issues that have never been clearly dealt with before. However, I feel that a general principle of the web applies: If I create some data, and don't at least implicitly give you permission to use it, it's unethical for you to use it.
This is the central idea behind copyright, behind robots.txt, behind plagiarism, and even behind privacy.
In your analogy there was a wholesale copyright infringement, entirely absent from bing sting. Here's a much closer analogy:
I wrote a book, in which I described some new way to dance. Before the book was published, the ideas were mine to keep secret. After the book was published, a bunch of readers picked up on the idea. They started all dancing in certain way, and someone described their behavior. That description will be de-facto copy of my book, and yet it is entirely legitimate. Because users own they behavior, not the author of the book who inspired them.
Now if someone simply copied pages from my book, that would be a copyright violation. But that's not what happened. As it is, it's an original work of art. Would that suck for me as a dance-inventor? Obviously. Do I want to live in a society where description of my behavior is owned by the person who inspired or directed it? Absolutely not.
Your idea of data ownership is contrary to tradition and contrary to the law. Facts (such as relationship between FOO and BAR) can not be owned in any way shape or form, except as a trade secret. You can not copyright a fact, or trademark a fact. In some cases you can patent application of a fact to a problem, but that's not the case here, as you can't and don't want to patent relationship between a word and target page.
Just because you put effort into something, does not mean it's yours. You probably wish it were true, but again, that's not how the law works, and not what the tradition is. You can't own facts.
Actually, it is as straightforward as all that. What's alleged is that someone searches for DenisM's book on Google, and when they click on one of the results, just that piece of data is forwarded to Microsoft. Not a scraping of all of Google's search results and nothing like the contents of the book.
As to "the central idea of copyright", bear in mind that in addition to not being able to copyright ideas or data, there's also the notion of "fair use".
Explain this to me. In terms of raw bytes it's probably more data than every book that has ever been written. It was certainly more expensive to write than any particular book. What is the important difference here?
To make it a little more concrete, suppose someone was doing the same thing with google's streetview data. Now, it's an empirical fact that if you go to such-and-such an address, and look around, it looks like the pictures that google took. Does that mean it's ok to use those pictures, just because their resemblance to reality is factual?
In terms of raw bytes it's probably more data than every book
You're referring to the aggregate, so far as I know copyright law works on specific instances. Let's say the Kindle had a button that allowed you to isolate a couple of sentences and broadcast them to your friends on Facebook, or wherever. The small portion of the text being shared would fall under fair usage and not be a copyright infringement violation, even though the aggregate of all Kindle users might be large.
What is the important difference here?
I think the biggest difference between the two sides debating this issue is that one side thinks the amount of effort Google put into creating search results is a relevant concern, while my side thinks that once those results are shared with users Google's legal/moral claims to any ownership of that data evaporate.
The streetview point highlights the problem. If Google and I both take pictures of the Church of Mxyzptlk at 14th and Main, we both own copyright to our respective pictures. The fact that I created my picture after "piggybacking" off of a Google search for Mxyzptlk is irrelevant to copyright. So far as know (IANAL) the combination of a link with search terms is a utilitarian set of data rather than a creative expression and hence not copyrightable.
I remember back when people hand-crafted directories for the web. They'd put a lot of work into updating their directory and then in just a couple of days an automated spider would look at their site and a search engine would use their link structure to point everybody directly to the best pages without having to go through the directory. The search engine had no idea of what evidence the directory used. Was it fair? Was it ethical? Was it legal?
robots.txt is a good example of something that was introduced to make the new search engine technology more ethical but I'm not sure what point you are trying to make. If you're saying that we need a new robots.txt option saying "don't use clickstream data to help people find this page", I don't disagree; I'm not sure how many sites would take advantage of it though.
The point is that the creators of the data were given a say in how it was used. Googlers have busted a gut to provide 99% of the value of the clickstream data so they should determine how that data is used.
There's a more general point here: If something is difficult to do but easy to copy, then society should prevent that copying so that the creator can be rewarded for their effort. This maximises creativity and productivity and boosts GDP. Musicians, inventors, artists, authors, drug companies and software developers all rely on this principle. If we didn't have copyright and patents and robots.txt then there would be no incentive to produce half of the things people want, and we would all be much worse off. Google (and any other search engine) should be subject to the same rules.
You're addressing the letter of the "law" not the spirit. Do we really need a new formal standard to indicate that this is unethical? To me it's pretty plain from the standards the web has already agreed upon.
To others it's just as plain that Microsoft's behavior is ethical. I'm still waiting to hear more before making up my mind. I don't blame you for being upset and frustrated, and can see why you think it's unethical. But a lot of people see it differently.
It would be great for Google and Microsoft to both come clean about all the different factors they use in their search but I'm not holding my breath.
Absolutely a new standard is needed. This is the first time any major player has suggested that data from a users clickstream should be covered by robots.txt or anything like it.
The comparison to a web directory is perfect - in the absence of a robots.txt Google and other crawlers think the data is absolutely fair game. There isn't yet a corresponding analogy for clickstream data so its fair game. Are you sure all of Google's tools respect robots.txt for passive analysis of user initiated actions on other websites?
I was using "standard" in two different ways there. Let me fix that:
>You're addressing the letter of the "law" not the spirit. Do we really need a new formal standard to indicate that this is unethical? To me it's pretty plain from the ethical standards the web has already agreed upon.
Then after a few weeks of sniffing clicks, Bing comes up with the same set of revolutionary results, but they have no idea how they got there, they have no idea what the evidence is to rank them there, all they know is that people like those results on Google.
But this whole "no idea how they got there" is kind of bogus. The whole notion of link analysis is to infer from a link that a page is important. Now they're infering from a click that page is important. The relative importance of that page is a function of the search query, like the relative importance of a link is a function of the page from which it originated.
Link analysis and click throughs are both second order effects, right? If CNN.com starts linking to a page, you don't really know why that page is important, except for the fact that CNN.com now points to it.
So lets be clear MS knows how it got there as much as link analysis does. Its because the second order effect of a click implies relevance, just as a link does.
>Its because the second order effect of a click implies relevance, just as a link does.
There are two pieces of information in a click, one provided by google, and the other provided by the user.
Google says, "'foo.com' is a good result for [foo]."
The user says, "yup."
Which is providing more information here?
Everyone at google is happy to admit that if the internet didn't exist, we would have a very hard time ranking it. Is bing ready to admit that if google didn't exist, they'd have a hard time ranking the internet?
The user provides at least two pieces of information.
1) The search query (which you attmpted to hide in Google's information)
2) A URL they retrieve from Google
Google provides one piece:
1) An ordered list
By FAR the most important piece of information is the search query.
In fact, given a choice between having the search query or an arbitrary ordered list, I'm sure Bing would value the search query the most.
Clickstream data from Google is almost certainly but one piece of clickstream data (even if you are special-cased). Just as Microsoft.com is just one page you index. Now you wouldn't say that you'd have time ranking the internet if Microsoft.com didn't exist would you? Of course not.
And I think this is where Google is going astray. It's fine to call a spade a spade, but your overreaching. Trying to argue that MS couldn't index w/o Google is simply absurd. And if Matt was trying to be diplomatic, you're certainly not being by implying that MS would have a hard time indexing w/o Google.
That's not an accurate view of how search works. The user provides the query, but the results that are returned don't need to have anything obvious to do with the query. The terms might not be on the page, and might not be found in any data remotely associated with the page. We don't just intersect posting lists here.
Furthermore, with suggest and instant, often the query itself is provided by google.
I didn't say the ordered list was obvious. But it is an ordered list (at least to the user -- it may not be ordered all the way down or partially so).
Fair enough point about suggest and instant. In those cases Google is adding value to the query, but I think we'd both agree that it is incrementally so. Instant generally just helps you get to the eventual query faster. Spell checking does add value, but its a +epsilon. In either case, your right, we should add this to our function.
In any case I think my point still holds. The query is the most important aspect. The browsed page, for indexing purposes is the second most important aspect. The association between the query and the URL is the least important aspect.
IOW, the least value for Bing is the search algorithm. The query and page to add to the index (as a browsed page) are the most important.
>The association between the query and the URL is the least important aspect.
I don't understand why you believe this. This is the entire basis for why search is hard. It is the hard problem that search tries to solve. Many companies have spent in aggregate, billions of dollars in R&D trying to solve this, and all but a few have folded. It's an "AI-complete" problem in that solving it perfectly would be a sufficient demonstration of strong AI. It's the whole reason why Google exists.
The reason is that with the URL and the query, there's a decent chance you can derive the association.
Now the reason I say this is that to my best approximation the key difference that makes Google better than Bing (when it is) is the index size, and not the search algorithm.
So the two key pieces of value are:
A) queries that people don't do on your site, which would have generated poor relevance
B) Pages that are actually browsed that you don't have indexed or up to date.
Now don't get me wrong. There is value in the association, but I think I'd capture 80% of the value with the above. Now as you point out, these lists are often non-obvious, but Bing does a near equal job of creating the lists. And if you give them the search terms where they have gaps. And fill the index, then I think the gap closes most of the way.
And lets be clear, because they clicked the link doesn't mean its a good link. See ehow.com or expertsexchange. But it does have some value.
And frankly I think MS would have been willing to let it go if Google had gone straight to MS with it and said, "We have this data. Even though its not unethical we can spin it to the media to make it look bad. Kill the association." MS probably would have as the net win isn't that huge. Now the PR loss in pulling it would be worse than the PR loss in keeping it (with no integrity loss, since they [and I] think it is perfectly ethical).
I'm pretty thoroughly sure this is false. I'm sorry I can't show data to back it up, but Google has many many systems that are years ahead of bing's technology. I'm kinda hamstrung here by being unable to reveal anything about them. To a decent approximation, both Google and Bing likely have the whole internet that matters (and that they are allowed to crawl) in their indexes. Bing is just unable to return this data for as many queries as Google is.
>but Bing does a near equal job of creating the lists.
To the extent that this is true, how much of it is due to data harvested from Google search? This is something that google can't really demonstrate with evidence, and what I would have hoped that bing would clarify in a public statement if they had anything defensible to say.
To a decent approximation, both Google and Bing likely have the whole internet that matters (and that they are allowed to crawl) in their indexes. Bing is just unable to return this data for as many queries as Google is.
Maybe this is true, but its not apparent. I had commented on this a month or so back that I thought Bing was better at "vague" queries, where I don't know exactly what I'm looking for. But I'll know it when I see it.
Whereas Google is really good at targeted queries. I need info on the HP battery model number 003D434F90. These searches in Bing will often bring back literally nothing, while Google will often have one or two links, but they happen to be the link that I want. The text of the query is almost always found in these pages.
From that I infer it is index size, since the text is in the page.
To the extent that this is true, how much of it is due to data harvested from Google search?
While I find the quality of searches similar I don't find the results to be similar, if that makes sense. If Bing is harvesting your results, they're still using other very clever methods to surface other equally good, yet different results.
One obvious and very public example of how the algorithm matters more than the index size would be Cuil. It was launched with lots of hype about how it would have a index several times larger than Google's. And of course the results were notoriously bad.
I'm stunned that anyone could think that the core issue in this controversy is indexing or page discovery. It should have been totally obvious from the examples in the initial article and blog post that this was specifically about copying ranking.
Is this fair? Is this ethical? Is this even legal?
Oh please. At worst, Microsoft's customers are sharing results of their browsing activities, based on information you gave them. How in the world could that be illegal?
Your team is really coming across as whiny on this issue, which seems to me precisely the wrong tactic to take in response to Bing's increasing market share. It makes your team look defensive, as if your advantage is slipping away and this sort of theatrical display is all you've got in response.
Others interpret this situation as Bing being unable to compete effectively except by looking over Google's shoulder. And then denying it when called out with clear-cut evidence.
Keep doing what you are doing and don't take what I wrote too personally. It's obvious that there are a lot of amazingly talented individuals working hard at Google.
IMHO - One of your colleagues should have stopped that blog post from being put up.
You have a valid point that the click-through measures more of what Google puts out rather than what users find useful. But Bing has access not just to the click-through rate but also the rest of the clickstream. IF (and that's an all-caps if) the value of the click is weighted by engagement metrics on the resulting page, the association is more closely related to the value the user finds in the page rather than the results that Google returned. Would you have a problem with Bing doing that? As a user, and not a developer, I rather like it because it's a much more direct metric of whether a link is useful or not.
They are free to do that on their own search results. If they can't figure out how to rank the document in the first place, they shouldn't be getting data about the association from google.
I'm certain they never stopped trying to be innovative, but Bing is able to use that click stream data to reap the benefits of Google's innovations. Google is on the right side of this argument even if it makes them seem whiny to some people.
The part that seems interesting to me is that Google scrapes other entities' data (Reader, News, Books, Scholar) in dozens of other ways, and that behaviour is generally regarded as totally justifiable.
Because Google has a symbiotic relationship with the entities whose data it's grabbing. It uses that data to drive traffic or attention to the originators and owners of that data. Everyone wants Google to use their data to drive them traffic and those who do not use robots.txt. Bing is being parasitic to Google in this situation by generating no value for them.
Bing is being parasitic to Google in this situation by generating no value for them.
Well, perhaps Microsoft thinks its business involves providing value to their customers rather than its competitors. What seems consistently missing from the Google PR campaign has been how Microsoft's behavior is a bad thing for the user.
Google doesn't spend exorbitant amounts of money so that they can provide value to Bing's users. If Bing's users want value they can use Google instead.
Again you put the focus on Google. And again I ask, how is Microsoft's alleged behavior harming the user?
Bing's increasing market share is presumably already due to providing value. If Google's unique sales proposition has become, "we find obscure misspelled words better than those copycats at Microsoft", do you think that's going to make people less inclined to try Bing?
Because the focus is on Google vs. Bing, not Google vs. Bing's users or Google's users vs. Bing or users vs. users.
>how is Microsoft's alleged behavior harming the user?
It's not harming Bing's users to provide Google's results. What's in it for Google to help Bing's users? It's not as if Google is an NPO humanitarian effort to help people find the information they seek.
>If Google's unique sales proposition has become, "we find obscure misspelled words better than those copycats at Microsoft"
Obscure misspelled words are just one thing that Bing is able to get better results on thanks to this. They also benefit from every other kind of query that a user can make on Google.
The AFP could have opted out if they wanted to. Copyright and fair use laws are not something we should be dragging into this. Your examples have nothing to do with the argument at hand between Google and Bing.
I'd suggest that they're indirectly relevant, because they all address the question of what constitutes fair indexing and what constitutes "copying" or theft of innovative/valuable information.
The evidence suggests Bing is scraping clickstream data to work out where its users go and where they stay, and this includes google search URLs with embedded queries. The evidence doesn't seem to suggest that Bing treats google any differently to any other URL on the web.
So, the questions could come down to: who has the right to disseminate URL strings and mine them for valuable information? Which seems not indirectly related to "who has the right to digitise other forms of media and mine them for valuable information?"
> 1) ...neither Bing nor Google does anything to protect user search privacy.
do you mean external or internal privacy? in terms of leaking information, SSL for search is about as good as you're going to get for privacy...if it's the browser or an extension (toolbar) that's watching searches, there's nothing a web site can do.
> 2) I think Google has more to lose by bringing this to light than they have to win.
That may be true, but there was a pretty cutting colbert segment on this last night where nuances about clickstreams weren't really a concern. personally I think google should have taken the humour route in the first place, as the "smoking gun" isn't all that damning at first sight. it requires some thought, which leaves plenty of room for disagreement and doubt.
I think google should have taken the humour route in the first place
Good point. I think the current PR campaign, and really that is what it seems like now, is counterproductive. It used to be the response to Bing was "What, is that like Microsoft Bob for search?" The longer this campaign goes on, the more it increases the user perception of Bing as a serious competitor. Even if true, that's not really a good message to be promoting, even unintentionally.
Google has something to gain from all this. Consider the implications to Bing/Google competition for marketshare in the following scenario:
MS decides to bundle and auto-install the Bing toolbar in IE. Also default opt-in to the "share clickstream data with MS" option. Now you have tons of users with the Bing toolbar _and_ using Google to search. MS could "use" or "copy" the results of Google's hardwork on search and pagerank and consequently provide some serious competition for search marketshare.
I found the discussion video on that page fascinating, in how Harry Shum's take on the other topics echoed his defence of Bing on this issue.
On the copying: Bing has dramatically improved over recent history because we use lots of inputs and we would have preferred if Google had talked to us privately so we could figure out how to make this less obvious on the long tail.
On web spam: Google is the industry leader and needs to share more with us little guys so we can all work together to beat this.
On search quality: Google needs to disclose their quality metrics so the industry as a whole can understand what users want. Then we can all make search better for everyone.
I thought Cutts was very gracious in his responses despite looking incredulous at hearing some of this stuff.
Look, this "let's all get along and work together as an industry to fix problems for the user"... it's bullshit. Either compete fairly or don't, but don't pretend that Google owes you data so you can get a leg up.
That shows a lot of bad faith on the part of Matt Cutts. It has been explained to death, here and elsewhere, what most likely happened. The fact that he continues to make the same accusation... well frankly he lost a lot of credibility with me. One more manipulative corporate drone, one less genuine hacker.
This is, by far, the best treatment of the issue I've seen anywhere, on account of how it's the only level-headed one apart from Nate Silver's. There's a little needless inflation at the beginning with the screenshot comparison. It does have the obvious and expected slant. On the other hand, there's no enormous hyperbole or anything like that. There are no vacuous statements. The word "copying" appears a few times, but that's at least descriptive, and there's no use of other loaded terms like "cheating", "stealing", and "unethical".
Also, for what it's worth, it moved me from 80/20 certain that nothing fishy is going on to 50/50: the spell correction paper shows to a certainty that this has been considered. I'm sure Googlers know as well as I do that not everything in a research paper finds its way into the product, and this one in particular may have been nixed by management types for this very reason. It's still awfully suspicious.
I still think the original experiment completely fails to demonstrate anything unethical, and I still think the original info release was both hyperbolic and needlessly inflammatory. It does demonstrate a need for some more information, which seems to be all this post is asking for. If it had looked more like this post, I think the 'net could have been spared a lot of controversy. Maybe Matt Cutts should be writing these things, though far be it from me to decide that.
I'm shocked at the amount of suppport on HN (thus far) for Bing's attitude on this issue.
I understand that they may well not target Google's SERPs specifically in their clickstream analysis but they should certainly have excluded Google from it, for ethical reasons.
Google state that Bing created associations from clickstreams through Google's SERPs on common queries (e.g. the tarsorrhaphy spell check test), not just long tail queries. Given that Google is extremely popular, this must have given a lot of weight to clickstream signals resulting from Google SERPs on many occasions, for common queries. That is entirely unethical and I'm shocked so many here don't find a problem with this.
Take the case of a highly-ranked great result on Google for a particular term, which Bing rates lower due to inferior algorithms. Bing's analysis of Google users would send that result higher in the Bing SERPs, mainly due to Google's expertise in highlighting that site, and only to a small degree due to the user's choice of clicking on it. That to be does fall under "copying" Google's results, it may not be illegal, or intentional, but copying it remains.
Bing should have excluded Google from their clickstreams, and I certainly hope Google exclude Bing from their's. (Matt Cutts stated they do in the video.)
Let's see. You're ok with Bing doing click analysis, but you want them to specifically throw the clicks away when the user stopped by Google, because it would be unethical. And also probably remove clickstreams that involve Yahoo's directory. And StackOverflow answers, because it would be cheating. And blogs referencing interesting links. And... or just accept that clickstream analysis is a great source of data that will be an aggregation of many, many other things on the Internet. Including Google.
A quick thought experiment on this subject: say I search for a term on Bing, find a link I want and then put that link on my blog - when Google indexes that blog have they copied Bing?
Sure its different, but is it meaningfully different? I made the link between the two terms, I also consented for that data to be used in both cases (assuming the data comes from the Bing toolbar and an agreeable robots.txt). I just don't see how the data that Microsoft is using is off limits.
Just putting the link on your blog doesn't establish the connection between the keywords and the link. If you put both the keywords and the link on your blog then it allows Google to independently establish that connection using the merits of their own algorithms.
The "incrimminating" part here is that Microsoft appears to be intentionally parsing the keywords out of the search. Which means they are intentionally looking at a click and saying a) this is a google search and here are the terms used and then b) this is a result that Google returned, and they are then using that to fill their own index. If they generically parsed the URL for terms then you might argue they are not giving Google special treatment, they are just doing this for every page the user goes to. However that's a bit hard to buy - if they really did that they would end up with all kinds of garbage associations from opaque URLs. So they must have a signal saying "this was a google search, treat it better than the others". Or they are somewhere in between the two. It's not clear to me where they fall on this scale - it's generally murky. At worst, I'd say they are copying, at best, I'd say it's sneaky but clever and fair game. The minute they single out Google and say "hey, this must be a good result" I think they crossed a line.
Yeah you'd have to title the link with the search term. Personally I think the analogy someone else made to web directories is perfect, and that discussions is more interesting.
It seems entirely appropriate if you have a legitimate right to use that link to parse it for structure.
I still don't understand why Google hasn't done the exact same test with a control variable. Run the exact same test again on a domain that isn't google.com.
There are three reasons for Google not to report results of a control-case test:
1) It didn't occur to them to run a control in their experiment.
2) It did occur to them but they decided not to do it for some reason.
3) They did run the control but it did not bolster their case.
As you point out, if they ran it and it bolstered their case there would be no reason not to report it. Of the above reasons, #2 seems the least likely (because it's easy to set up a control and it would better clarify the situation). I do not have enough evidence to judge whether #1 or #3 is more likely.
Most of my thoughts have already been echoed elsewhere in this thread, but:
> "To me, what the experiment proved was that clicks on Google are being incorporated in Bing’s rankings."
It proved that clicks on other websites are being incorporated in Bing's rankings, which had already been public knowledge, I think. It didn't prove that only or disproportionately clicks on Google are being thus incorporated, although that is what Google is repeatedly claiming.
> "If clicks on Google really account for only 1/1000th (or some other trivial fraction) of Microsoft’s relevancy, why not just stop using those clicks and reduce the negative coverage and perception of this?"
Is Matt Cutts suggesting that Bing special-case an exclusion for Google results?
I totally agree with this. There's not really any good reason why Microsoft should piggyback off of Google's search results. The fact that they're able to observe Google's results by observing user clicks in a browser rather than by simply harvesting results directly off of Google's website confuses the matter, but ultimately it's irrelevant. This is just a sneaky way for Microsoft to observe a user's interaction with an information service (Google), to see what information the user obtained, and then offer the same information themselves.
The technical aspect is plain. Copying Google is a great idea. If you're tying to predict the weather and are given a general bag of indicators, the one that is a well-reasoned expert opinion of the weather is probably a very important source. It may even account for a majority of your own predictive power. It would be stupid to ignore it if your goal is to simply make the best prediction.
Likewise, aggressively scraping Google is a smart move. Then you add some new innovation atop it and have a real opportunity to return more informed responses. This is done all the time in science.
In some sense Google simply has to acknowledge that they are a pretty important segment of the web, not some separate entity from it.
So only the legal/ethical question remains. In science it's unethical to work atop someone else's project without crediting them. I doubt Bing would be interested in adding a Powered by Google bar. Moreover, since Bing could directly profit off Google's work undercutting actual algorithmic progress through pure marketing competition (hypothetically, anyway, I am sure that Bing has added tech too) I feel like it's better to restrict this sort of thing.
I think it's fair to say that much like commercial images on Flickr or sample songs, it is unethical and illegal to copy digital services and goods then either claim them your own or profit off of them. I think Google results are suitably close in spirit to this.
So maybe Bing and Girl Talk need to team up and discover and defend the ethical rights of sampling digital goods.
Seems like Bing is having trouble explaining exactly what they are doing, and Google is having trouble explaining why they should stop.
Is it illegal? If not, why should Bing stop doing it?
> I think Bing’s engineers deserve to know that when they beat Google on a query, it’s due entirely to their hard work. Unless Microsoft changes its practices, there will always be a question mark.
Kind of rings hollow to me. If I were Bing, I'd want to do what's best for my users.
Well, the whole thing's kinda meta, isn't it? The content isn't on either site, it's just, really, the content identifiers. Not even that, this is about a sorting of identifiers against arbitrary strings of text. Whatever distinctions we have here are gonna appear incredibly small (and petty), but it's really the bread and butter of search.
Both are map makers in a sense, providing guides to what they didn't create, but still spent plenty of effort to make that guide. In the literal, map-making world, the artifact is the printed map. In that world, the way to check for copyright infringement is to see if mapping errors are duplicated as well.
The presumption being that if the errors were copied, then so was the good data.
Short-term, the benefit was that users would get discounts on high-quality data, as they only had to pay for the efforts of the map-copy, not the original map data acquisition. Of course, then you're just waiting for the quality to drop, as there's less and less incentive to actually do the map-making work. The margins go down, and the original sources have to update their maps less often to keep their costs low enough to be competitive.
There isn't a printed page with web search; the product is the output of a continuously-running dataset & algorithm.
But, I'm gonna ask, in the web-search world, how do you define copying and how do you test for it? If you don't think there is a valid definition, please don't count yourself the same as the group who thinks that there is a valid definition and this isn't it. They're two separate things.
(I've framed the question how I see it, and I work for Google, but I'm obviously no official speaker -- I've only been here a few months, and don't work in search quality. This is (almost definitionally) a fanboi war of sorts, and I wanted to stay out. I probably should have :( )
Google should do more testing. For example: repeat the same tests (different nonsense search terms and results just to be sure) where none of the data sent to/from the browser contains any google specific terminology. If the Bing toolbar isn't special-casing google then I think the test results should be the same.
That video was way more interesting than this semi-fabricated drama. Search result spam, adsense role in it, and how to (not) tackle it is what I'd like to hear more about. Google search results degraded over the years to the point where I have a bing search on my wp7 phone as default search and I actually don't mind/don't care, because search results from google are not an imperative for me anymore (and I'm not cheerleading for anyone).
Either bring a lawsuit, or be quiet. You aren't going to shame Microsoft into changing their current policy, and even if it does improve results by only 1/1000, as long as they are legally within their right to do it, why wouldn't they?
It's laughable to think this "controversy" is putting a black eye on to Bing in any way shape or form, except among perhaps the most technical and anal retentive among the population.
I understand the down votes, my comment wasn't all that constructive. But seriously, the amount of attention paid to this issue from Google might be better spent making their own products better. At best this is an issue for lawyers. The low road Google is taking by making a big stink in the 'blogosphere' is asinine to me, but clearly many of you disagree by continually voting up these stories.
> Well, no, that's a research paper that says that they have made experiments in that direction, but this doesn't imply that this is currently done in Bing.
I answered Matt's questions and raised some other points, still waiting for an answer. Matt, could you please comment?
The top 4 results (including one which has the domain name of the query) are the same but in a different order. The rest of the results appear totally different.
Even if one random query you ran returned an identical first 10 results, exactly what would that prove? Statistically significant how?
I'd start with the fact that they both have the keywords of the search string in the page title and in the URL string.
Look, I'm not saying it's impossible, I'm just saying that two different algorithms can yield the same results without one having used the other as an "indicator". To trot out a tired phrase, correlation does not imply causation (and I'd argue you don't even have correlation at this point.)
Check your competitors page right now. It will take you less time than reading the next post on Hacker News, and what you find may (or may not) surprise you. That consulting tip is on the house. :-)
My two favorite takeaways from this blog post are:
1) You can increase your site's rankings by installing the Bing Toolbar amongst a group of people and have said group search google for your target keyword and click through on your result.
2) Google is able and willing to manually screw around with search results in a seemingly easy manner.
As to the issue of Bing slurping up google's data for their own purposes: Scraping the web is a double edged sword. They aren't doing anything illegal. If anything, this whole ordeal has made both Search Engines look like a bunch of teenagers fighting over a who stole who's boyfriend.
You can increase your site's rankings by installing the Bing Toolbar amongst a group of people and have said group search google for your target keyword and click through on your result.
But you can just skip the toolbar altogether. Just go to Bing and do the search and click your site. Likewise, if you go to Google and do the same, you'll increase your ranking on Google. There is nothing new here, except if you install the toolbar and go to Google you can increase it on Bing also.
This is something people have been doing for a while.
Google is able and willing to manually screw around with search results in a seemingly easy manner.
Google can do this even if Bing stops this current practice. Just go to Bing, do a search, go the last page and click a link.
Fundamentally, with respect to fraud, nothing has really changed.
Unless Bing weighs clicks on Google SERPs more strongly than clicks on Bing SERPs! After all, Google is the market leader...
This is the problem, we really don't know how hollow Bing is any more. If it's a thin UI over results that grow closer to Google, it's intellectually shallow. I wouldn't trust Bing search to combat, e.g., click fraud and SEO better than Google. If anything I would expect the edge cases and tricky behavior to derange Bing further.
Easy to catch and filter. But yes, in theory, this will improve results. The proportion of users clicking the 10th link on a page is much lower than those who click the 1st. If users clicking the 10th > 1st, then there is an assumed ranking problem. So clicks help fix ranking.
My question is... did Google install the Bing toolbar and then performing these searches on Google and click the links? If not, is Bing suggesting that users searched for those convoluted strings, in IE, with the Bing toolbar, on Google's website, and then clicked those links?
I hope this doesn't sound hyperbolic, that is really my understanding of the issue. Sorry, I'm not a big SEO guy, it's very possible my understanding or comprehension here is just flawed.
Really, -1? When I asked a question and prefaced it with a huge disclaimer stating that I might not understand the full implications of how Bing is using their toolbar metrics?
I just don't understand how Bing can claim they were using toolbar metrics when it seems highly unlikely that any user would be searching for these keywords. Let alone on Google, when they have the Bing toolbar installed.
My take on this is quite the opposite. If MS thinks they're doing nothing wrong, they shouldn't stop. They should do what is best for their customers.
Reversing what they do now because Google "caught" them, would IMO, imply they were doing something wrong and got caught.
I personally think what Bing is doing is great. I regularly use Bing and Google. I wish there were a streamlined way for me to broadcast to both of them... "for search term X this is the relevant link!"
MS has found a way to do this, and as long as its opt-in, I like it. I wish Google had the same (although I do realize that since Google owns so much of the traffic it is less of an issue for them -- its a net loss in terms of the flow of information).
But maybe a question for Matt... Can you and MS work together for a toolbar that does just this for both engines?
As a user this isn't a matter of Bing copying Google or not. But its about the fact that search relevance still kind of sucks. I feel like I can make the search experience better, especially in those occassions when I go to your competitors site.