/g/ - Technology

install openbsd

[Make a Post]
[X]





Data Dumps General Nanonymous No.1894 [D][U][F][S][L][A][C] >>1896
File: 9fee46955193aee74345648d1a2697c7ef50efb5c5d05b6f68b578597725cebf.jpg (dl) (169.37 KiB)

A thread for sharing data dumps, below are some dumps to get the thread going:

4chan/pol/ 2013-2019 (18 GB for posts; 42.4 GB for thumbnails)
https://archive.org/details/4plebs-org-data-dump-2019-01
https://archive.org/details/4plebs-org-thumbnail-dump-2019-01

Reddit 2006-2018 (446 GB for comments; 145 GB for submissions)
https://files.pushshift.io/reddit/

Gab.ai 2016-2018 (4.06 GB)
https://files.pushshift.io/gab/

Hacker News 2006-2018 (2.04 GB)
https://files.pushshift.io/hackernews/

Google Books Ngrams 1505-2008 (lots of GB if you want 3+ Ngrams)
https://storage.googleapis.com/books/ngrams/books/datasetsv2.html

Stack Exchange till 2018-12 (~59 GB)
https://archive.org/details/stackexchange

GB != GiB

Kiwix ZIM files OP No.1896 [D] >>1899

Kiwix ZIM files
Has all Wikimedia and StackExchange sites for offline browsing. The ZIM format is a custom format for XZ-compressed web content (e.g., HTML) you can also download the "portable" files (.ZIP) which include the search index as well.
Some places where you can download ZIM files:
https://ftp.fau.de/kiwix/
https://mirrors.dotsrc.org/kiwix/
https://download.kiwix.org/
https://ftp.nluug.nl/pub/kiwix/
https://ftp.acc.umu.se/mirror/kiwix.org/
https://mirror.isoc.org.il/pub/kiwix/ (Israeli server)

>>1894
Damn that image turned out shit when thumbnailed and compressed.

Common Crawl Nanonymous No.1897 [D][U][F]
File: e16886dacf1b73adca017e46d157ac1dd1caff5e814a367b9d931edb20b2f8f4.png (dl) (6.19 KiB)

Common Crawl
https://commoncrawl.org/

>The Common Crawl corpus contains petabytes of data collected over 8 years of web crawling. The corpus contains raw web page data, metadata extracts and text extracts. Common Crawl data is stored on Amazon Web Services’ Public Data Sets and on multiple academic cloud platforms across the world.

>Access to the Common Crawl corpus hosted by Amazon is free. You may use Amazon’s cloud platform to run analysis jobs directly against it or you can download parts or all of it.

Downloading the raw WARC, metadata, or text extracts is probably way too much for any individual to donwload, though CommmonCrawl also releases Pagerank and hosts-to-hosts link metadata, for example:
>Host- and Domain-Level Web Graphs Aug/Sep/Oct 2018
https://commoncrawl.org/2018/11/web-graphs-aug-sep-oct-2018/

>Host-level graph
>5.66 GB cc-main-2018-aug-sep-oct-host-vertices.paths.gz nodes ⟨id, rev host⟩, paths of 42 vertices files
>23.60 GB cc-main-2018-aug-sep-oct-host-edges.paths.gz edges ⟨from_id, to_id⟩, paths of 98 edges files
>9.63 GB cc-main-2018-aug-sep-oct-host.graph graph in BVGraph format
>2 kB cc-main-2018-aug-sep-oct-host.properties
>10.83 GB cc-main-2018-aug-sep-oct-host-t.graph transpose of the graph (outlinks inverted to inlinks)
>2 kB cc-main-2018-aug-sep-oct-host-t.properties
>1 kB cc-main-2018-aug-sep-oct-host.stats WebGraph statistics
>13.47 GB cc-main-2018-aug-sep-oct-host-ranks.txt.gz harmonic centrality and pagerank

>Domain-level graph
>0.60 GB cc-main-2018-aug-sep-oct-domain-vertices.txt.gz nodes ⟨id, rev domain, num hosts⟩
>5.95 GB cc-main-2018-aug-sep-oct-domain-edges.txt.gz edges ⟨from_id, to_id⟩
>3.24 GB cc-main-2018-aug-sep-oct-domain.graph graph in BVGraph format
>2 kB cc-main-2018-aug-sep-oct-domain.properties
>3.39 GB cc-main-2018-aug-sep-oct-domain-t.graph transpose of the graph
>2 kB cc-main-2018-aug-sep-oct-domain-t.properties
>1 kB cc-main-2018-aug-sep-oct-domain.stats WebGraph statistics
>1.89 GB cc-main-2018-aug-sep-oct-domain-ranks.txt.gz harmonic centrality and pagerank

You can search their URL index here:
http://index.commoncrawl.org/

Here are the instructions for downloading the CommonCrawl data:
https://commoncrawl.org/the-data/get-started/

Torrent Dumps/Metadata OP No.1898 [D][U][F]
File: 593249a2f2f5be70d0b8ddff764c99aa70f48ee6e67818dac472b5b8862a2baa.png (dl) (1.83 KiB)

The Pirate Bay (no seed/leech ratio metadata)
https://thepiratebay.org/static/dump/csv/

Kickass Torrents June 2015
https://web.archive.org/web/20150609001718if_/http://kat.cr/dailydump.txt.gz (~640 MB)
Info: https://web.archive.org/web/20150518164224/https://kat.cr/api/

TorrentProject July 2016
https://web.archive.org/web/20160721213429if_/https://torrentproject.se/dailydump.txt.gz (~610 MB)
Info: https://web.archive.org/web/20160721213302/https://torrentproject.se/api

Bitsnoop May 2016
https://web.archive.org/web/20160327181910if_/http://ext.bitsnoop.com/export/b3_all.txt.gz
https://web.archive.org/web/20170324033525/https://bitsnoop.com/info/api.html

OfflineBay (Electron-based software)
https://github.com/techtacoriginal/offlinebay
https://www.offlinebay.com/
https://pirates-forum.org/Thread-Release-OfflineBay-v2-Open-source-and-No-more-Java-dependency

Nanonymous No.1899 [D] >>1900

>>1896
>Damn that image turned out shit when thumbnailed and compressed.
I think hokage's thumbnailer routine is prioritizing file size over the image not looking like shit.

Nanonymous No.1900 [D]

>>1899
Should've added more contrast.