CommonCrawl

CommonCrawl Common Crawl is a non-profit foundation dedicated to providing an open repository of web crawl data to be accessed and analyzed by everyone.

Curious as to how widely Google Analytics is used across the web, or how big data can help fix spelling and grammar? Com...
07/17/2014

Curious as to how widely Google Analytics is used across the web, or how big data can help fix spelling and grammar? Come down to RiskIQ in SF on the 23rd for the answers! Two tech talks (Stephen Merity, Oskar Singer) followed by food and drinks!

Please join us for an evening of discussion about big open data! There will be two excellent presentations describing projects done with Common Crawl data.

07/17/2014

Up for a mid week big data challenge? New 183TB dataset release containing over 2.6 billion web pages!

The April crawl of 2014 is now available! The new dataset is over 183TB in size containing approximately 2.6 billion webpages. The new data is located in the aws-publicdatasets bucket at /common-crawl/crawl-data/CC-MAIN-2014-15/.

Address

San Francisco, CA

Alerts

Be the first to know and let us send you an email when CommonCrawl posts news and promotions. Your email address will not be used for any other purpose, and you can unsubscribe at any time.

Share