If you go to Google today, you'll notice that its logo has been blacked out.
If you go to Wikipedia, you'll see a headline that reads "Imagine a World Without Free Knowledge," followed by the following text:
"For over a decade, we have spent millions of hours building the largest encyclopedia in human history. Right now, the U.S. Congress is considering legislation that could fatally damage the free and open Internet. For 24 hours, to raise awareness, we are blacking out Wikipedia."
The reason for these actions by Google, Wikipedia, and other major web-based companies is a pair of bills currently being considered by both houses of the U.S. Congress. They are best-known by their acronyms, SOPA and PIPA.
Diffbot is one of those applications (and companies) you probably are not even aware of when you use it, but that's not necessarily a problem for the company's co-founder and CEO Michael Tung.
That's because his product is a "visual learning robot," that hundreds of developers are using to translate web content into better mobile apps, and as such it stays pretty much under the hood.
"We've invented this visual ID algorithm," says Tung. One of our core insights is that the entire web can be classified down to 30 page types. There are product pages, event pages, news pages -- we can identify them visually with 99.999 accuracy."
Diffbot technology identifies each page's components, such as nav bars, footers, etc., as part of its identification process. Design standards are such that there is a high degree of similarity between the various page types grouped by category.
One customer using Diffbot at present is AOL's recently launched Editions, which is a personalized daily magazine for the tablet.
Quick. Of the thousands of web-based content companies headquartered in San Francisco, which one has the highest traffic?
The numbers for this website are mind-boggling:
- 414 million unique visitors monthly
- 12 billion page views monthly
Here's a few more hints: This site contains roughly 18 million original articles, which is five million more than the entire 160-year archive of The New York Times. Plus all of those articles have been created in only the past ten years.