15 July 2010

A followup on the "Cake Today" blogpost thefts

A couple weeks ago I reported that a bot-generated pseudoblog called "Cake Today" was stealing content from TYWKIWDBI and reposting the material with links to Amazon Cake sales.  I was puzzled in part by the fact that this website was horribly mistranslating my text in the process.  Here, for example, are two sentences I wrote about the American Lady caterpillars:
"Several weeks ago I wrote about the host plant and the eggs of the American Lady. Now I can offer some photos about the young caterpillars... The first instar has a semi-translucent body and is very difficult to see except for that black head."
And here's how those sentences were rendered at the other website:
"Several weeks ago I wrote most the patron plant and the eggs of the dweller Lady. Now I crapper substance whatever photos most the young caterpillars...  The prototypal instar has a semi-translucent embody and is rattling difficult to wager eliminate for that black head."
Obviously the text had been rendered into some other language (??Klingon) - and then retranslated back into English.  But why??  TYWKIWDBI reader Andrew offered this concise explanation: "It's been re-translated to make it look like original content to search engines," and reader Kirk contacted Ed Kohler of The Deets, who explained the process as a type of "black hat SEO" ["search engine optimization"]:
"Since the content isn't really meant to be read (just draw traffic from search that can then leave through ads) it doesn't matter if it's particularly legible. Also, he could publish additional sites in additional languages. Again, not really any more work other than setting it up."
Interesting.  The auto-retranslation might also bypass any legal claims that they are stealing my content.  So what to do about it?  I did ponder the obvious, of trying to sabotage the process by posting something embarrassing to them.  The post I wrote about them stealing material was posted at their site (!), so the process was clearly automated.  I could post anything, and since their site did the repost within minutes I could then delete whatever I wrote from my site.

But then I thought why bother?  Since TYWK is a nonprofit blog, I'm not actually losing anything.  There is also the principle of not bringing a knife to a gunfight.  Whoever was doing this was a log-power more technically sophisticated that I am, and if they perceived that I was trying to mess with them, they might know of ways to hit back.

The best response seemed to be to report the site, as several of my readers suggested.  I looked into the process, and it seemed to be complicated.   While I was diddling around, TYWKIWDBI-oldtimers soubriquet and Nathan reported the offender on my behalf, and within a couple days the site was taken down by Blogger/Google.

I'm posting this now in case other bloggers encounter a similar situation in the future.   One can reasonably expect that this sort of thing happens all the time, and that in the vast world of the intertubes we wouldn't even be aware of it unless we have occasion to run a search for text we've written or TinEye one of our posted photos.

My sincere thanks to all the TYWK readers for your technical advice and assistance during this incident.

And now back to our regularly-scheduled programming...

6 comments:

  1. One interesting thing about that blog is that they somehow used HTML or CSS to cover up the banner at the top of the page. You know, that place with the "Report Abuse" link?

    So, yeah. Some clever stuff going on there, trying to get people to land there and click on affiliate links and ads. I wonder how much money something like that makes, in the end.

    ReplyDelete
  2. I love this story, and I think you dealt with it the right way. Some blogs post a line at the end of each post saying "If you aren't reading this on xxxxxxx.com, it's stolen content". Maybe a bit over the top, since 90% of the time, it will be readers on your own website reading that. But in your case, they wouldn't have cared because they weren't targeting human readers anyway.

    ReplyDelete
  3. Google has removed the whole blog!

    ReplyDelete
  4. That's great to hear. A little less litter on the web.

    ReplyDelete
  5. Long story short, you can't fight back if you don't own your blogging platform - it's up to blogger.com (and, as a company, they have very little financial incentive to do anything).

    I came up with a fun little system to poison my site content when bots I didn't approve of happened by (it's not too hard to figure out which visitors are bots if you have full access to the log files) - you can see examples of what bots see here:

    pseudo-RSS feed

    ... and this leaves the somewhat whimsical pastime of watching site scrapers publish complete gibberish.

    ReplyDelete