>The Common Crawl corpus contains petabytes of data collected over 8 years of web crawling. The corpus contains raw web page data, metadata extracts and text extracts. Common Crawl data is stored on Amazon Web Services’ Public Data Sets and on multiple academic cloud platforms across the world.
>Access to the Common Crawl corpus hosted by Amazon is free. You may use Amazon’s cloud platform to run analysis jobs directly against it or you can download parts or all of it.
Downloading the raw WARC, metadata, or text extracts is probably way too much for any individual to donwload, though CommmonCrawl also releases Pagerank and hosts-to-hosts link metadata, for example:
>Host- and Domain-Level Web Graphs Aug/Sep/Oct 2018
https://commoncrawl.org/2018/11/web-graphs-aug-sep-oct-2018/
>Host-level graph >5.66 GB cc-main-2018-aug-sep-oct-host-vertices.paths.gz nodes ⟨id, rev host⟩, paths of 42 vertices files
>23.60 GB cc-main-2018-aug-sep-oct-host-edges.paths.gz edges ⟨from_id, to_id⟩, paths of 98 edges files
>9.63 GB cc-main-2018-aug-sep-oct-host.graph graph in BVGraph format
>2 kB cc-main-2018-aug-sep-oct-host.properties
>10.83 GB cc-main-2018-aug-sep-oct-host-t.graph transpose of the graph (outlinks inverted to inlinks)
>2 kB cc-main-2018-aug-sep-oct-host-t.properties
>1 kB cc-main-2018-aug-sep-oct-host.stats WebGraph statistics
>13.47 GB cc-main-2018-aug-sep-oct-host-ranks.txt.gz harmonic centrality and pagerank
>Domain-level graph >0.60 GB cc-main-2018-aug-sep-oct-domain-vertices.txt.gz nodes ⟨id, rev domain, num hosts⟩
>5.95 GB cc-main-2018-aug-sep-oct-domain-edges.txt.gz edges ⟨from_id, to_id⟩
>3.24 GB cc-main-2018-aug-sep-oct-domain.graph graph in BVGraph format
>2 kB cc-main-2018-aug-sep-oct-domain.properties
>3.39 GB cc-main-2018-aug-sep-oct-domain-t.graph transpose of the graph
>2 kB cc-main-2018-aug-sep-oct-domain-t.properties
>1 kB cc-main-2018-aug-sep-oct-domain.stats WebGraph statistics
>1.89 GB cc-main-2018-aug-sep-oct-domain-ranks.txt.gz harmonic centrality and pagerank
>>1896 >Damn that image turned out shit when thumbnailed and compressed.
I think hokage's thumbnailer routine is prioritizing file size over the image not looking like shit.
A thread for sharing data dumps, below are some dumps to get the thread going:
4chan/pol/ 2013-2019 (18 GB for posts; 42.4 GB for thumbnails)
https://archive.org/details/4plebs-org-data-dump-2019-01
https://archive.org/details/4plebs-org-thumbnail-dump-2019-01
Reddit 2006-2018 (446 GB for comments; 145 GB for submissions)
https://files.pushshift.io/reddit/
Gab.ai 2016-2018 (4.06 GB)
https://files.pushshift.io/gab/
Hacker News 2006-2018 (2.04 GB)
https://files.pushshift.io/hackernews/
Google Books Ngrams 1505-2008 (lots of GB if you want 3+ Ngrams)
https://storage.googleapis.com/books/ngrams/books/datasetsv2.html
Stack Exchange till 2018-12 (~59 GB)
https://archive.org/details/stackexchange
GB != GiB