Downloading The Data

The data is downloaded to a set of csv file using the code available at https://github.com/szymonlipinski/hackernews_dowloader.

This made the following files:

## 504M /home/data/hn/1554994838_1565866358__19635134_20704074.data.csv
## 485M /home/data/hn/1401111353_1421029195__994369_8872196.data.csv
## 467M /home/data/hn/1300452676_1326824346__2340190_3475825.data.csv
## 450M /home/data/hn/1209976952_1271054983__181299_1258580.data.csv
## 506M /home/data/hn/1543560409_1554994838__18567251_19635134.data.csv
## 505M /home/data/hn/1556804358_1567598222__19807762_20876231.data.csv
## 493M /home/data/hn/1454584884_1468706219__11033266_12108033.data.csv
## 505M /home/data/hn/1531176188_1543560409__17494018_18567251.data.csv
## 490M /home/data/hn/1421029195_1437923693__8872196_9951246.data.csv
## 499M /home/data/hn/1493867970_1506462121__14262172_15342765.data.csv
## 465M /home/data/hn/1271054983_1300452676__1258580_2340190.data.csv
## 478M /home/data/hn/1367793791_1384391674__5660073_6729887.data.csv
## 82M  /home/data/hn/1554994838_1556804358__19635134_19807762.data.csv
## 505M /home/data/hn/1519132696_1531176188__16420052_17494018.data.csv
## 474M /home/data/hn/1326824346_1348853030__3475825_4586677.data.csv
## 498M /home/data/hn/1468706219_1481804010__12108033_13184043.data.csv
## 502M /home/data/hn/1481804010_1493867970__13184043_14262172.data.csv
## 502M /home/data/hn/1506462121_1519132696__15342765_16420053.data.csv
## 476M /home/data/hn/1348853030_1367793791__4586677_5660073.data.csv
## 498M /home/data/hn/1437923693_1454584884__9951246_11033266.data.csv
## 69M  /home/data/hn/1160418111_1209976952__1_181299.data.csv
## 478M /home/data/hn/1384391674_1401111353__6729887_7799657.data.csv
## 9,7G /home/data/hn

Creating The Database Structure

All the data is too large to keep it in R in memory for processing on my machine. An alternative is to keep it in a database, I chose PostgreSQL.

The table structure for the csv data is:

## 
## CREATE TABLE raw_data (
##     title TEXT,
##     url TEXT,
##     author TEXT,
##     points INT,
##     story_text TEXT,
##     comment_text TEXT,
##     num_comments INT,
##     story_id INT,
##     story_title TEXT,
##     story_url TEXT,
##     parent_id INT,
##     created_at_i INT,
##     type TEXT,
##     object_id INT
## );

All the files have been loaded with:

## #!/bin/bash
## 
## 
## if (( $# != 4 )); then
##     echo "Loads data from csv files to a postgres database"
##     echo "USAGE:"
##     echo "./load_files.sh DBNAME DBUSER TABLE_NAME FILES_DIRECTORY"
##     exit 0
## fi
## 
## DBNAME=$1
## DBUSER=$2
## TABLE_NAME=$3
## FILES_DIRECTORY=$4
## 
## for f in $FILES_DIRECTORY/*.csv
## do
##     echo "Loading $f"
##     psql $DBNAME -U $DBUSER -c "\\COPY $TABLE_NAME FROM $f WITH CSV DELIMITER ',' HEADER " 
## done

The loading time was about 6s per file.

Basic Data Cleaning

Removing Duplicates

According to the documentation of the downloader program:

Some entries in the files are duplicated, which is basically because of the Algolia API limitations. What's more, Hackernews users can edit their entries, so when downloading the data after some time, some entries may be different. Mechanism of loading the data to a processing pipeline should update the entries when will have a duplicated entry id.

To remove the duplicates, I used a simple query which should create a new table without the duplicated rows. The primary key for the data is the object_id column, so to make things faster, I created an index, and used distinct on:

## BEGIN;
## 
## CREATE INDEX i_raw_data_object_id ON raw_data (object_id);
## 
## CREATE TABLE data AS 
## SELECT DISTINCT ON (object_id) *
## FROM raw_data;
## 
## DROP TABLE raw_data;
## 
## COMMIT;

Adding Indices

I also need some indices on the data table for faster searching. I omitted the text columns, except for the ones where I will use the whole text to search, like type = 'comment'.

## CREATE INDEX i_data_author       ON data (author);
## CREATE INDEX i_data_points       ON data (points);
## CREATE INDEX i_data_num_comments ON data (num_comments);
## CREATE INDEX i_data_story_id     ON data (story_id);
## CREATE INDEX i_data_parent_id    ON data (parent_id);
## CREATE INDEX i_data_created_at_i ON data (created_at_i);
## CREATE INDEX i_data_type         ON data (type);
## CREATE INDEX i_data_object_id    ON data (object_id);

Preprocessing Data

In the further data processing, I will need to repeat some data operations. To speed it up, I will calculate a couple of things and store it in the database. I like to use materialized views for this for two reasons:

  1. They can be easily refreshed to recalculate the data again.
  2. They don’t change the original data.

Calculating The Dates

The only date field in the data table is the created_at_i which is an integer with number of seconds since the Jan 1st, 1970. As I will need to aggregate dates by weeks, days of week, months, years, to decrease the query time later, I will calculate it now:

## create materialized view dates as
## select
## object_id,
## timestamp 'epoch' + created_at_i * interval '1 second' as date,
## date_part('year',   timestamp 'epoch' + created_at_i * interval '1 second') as year,
## date_part('month',  timestamp 'epoch' + created_at_i * interval '1 second') as month,
## date_part('week',   timestamp 'epoch' + created_at_i * interval '1 second') as week,
## date_part('day',    timestamp 'epoch' + created_at_i * interval '1 second') as day,
## date_part('dow',    timestamp 'epoch' + created_at_i * interval '1 second') as dow,
## date_part('hour',   timestamp 'epoch' + created_at_i * interval '1 second') as hour,
## date_part('minute', timestamp 'epoch' + created_at_i * interval '1 second') as minute,
## date_part('second', timestamp 'epoch' + created_at_i * interval '1 second') as second,
## to_char(timestamp 'epoch' + created_at_i * interval '1 second', 'yyyy-MM')  as year_month
## from data;

For faster searching, I will add some indices on the above view:

## create index i_dates_object_id on dates(object_id);
## create index i_dates_year on dates(year);
## create index i_dates_month on dates(month);
## create index i_dates_date on dates(date);

Getting URLs

I will also get all the urls from the specific fields. For now I will mark the source of the url, as it is possible that the urls distribution in stories text is different than in comments.

## -- The urls can be everywhere
## -- If the entry type is a story, then it has fields like: title, url
## -- If it's a comment, then it has comment_text, story_title, story_url
## -- Jobs can have url, title, and story_text
## create materialized view 
## urls as
## with url_data as 
## (
##     select
##         distinct
##         object_id, 'comment_text' as type,
##         unnest(
##             regexp_matches(comment_text, '((?:http|https)://[a-zA-Z0-9][a-zA-Z0-9\.-]*\.[a-zA-Z]{2,}/?[^\s<"]*)',  'gi')
##         ) url
##     from data
##     UNION ALL
##     select
##         distinct
##         object_id, 'story_title',
##         unnest(
##             regexp_matches(title, '((?:http|https)://[a-zA-Z0-9][a-zA-Z0-9\.-]*\.[a-zA-Z]{2,}/?[^\s<"]*)',  'gi')
##         ) url
##     from data
##     UNION ALL
##     select
##         distinct
##         object_id, 'story_text',
##         unnest(
##             regexp_matches(story_text, '((?:http|https)://[a-zA-Z0-9][a-zA-Z0-9\.-]*\.[a-zA-Z]{2,}/?[^\s<"]*)',  'gi')
##         ) url
##     from data
##     UNION ALL
##     select
##         distinct
##         object_id, 'url',
##         unnest(
##             regexp_matches(url, '((?:http|https)://[a-zA-Z0-9][a-zA-Z0-9\.-]*\.[a-zA-Z]{2,}/?[^\s<"]*)',  'gi')
##         ) url
##     from data
## ),
## clean_urls as (
##      SELECT DISTINCT 
##         object_id, 
##         type, 
##         case when rtrim(url, './') ilike '%(%)%' 
##              then rtrim(url, './') 
##              else rtrim(url, './)')
##         end as url
##      FROM url_data
##      WHERE url not like '%...'
## ),
## parts as (
##  SELECT 
##     object_id, type, url,
## 
##     (regexp_matches(lower(url), '^(\w*)://[^/]*/?.*/?$')::TEXT[])[1] as protocol,
##     (regexp_matches(lower(url), '^\w*://([^/]*)/?.*/?$')::TEXT[])[1] as domain,
##     (regexp_matches(lower(url), '^\w*://(?:www.)?([a-zA-Z0-9_\.-]*).*$')::TEXT[])[1] as domain_without_www,
##     (regexp_matches(url, '^\w*://[^/]*(/.*)/?$')::TEXT[])[1] as full_path,
##     (regexp_matches(url, '^\w*://[^/]*/.*/?\?(.*)/?$')::TEXT[])[1] as params,
##     (regexp_matches(url, '^\w*://[^/]*(/[^?#]*?)/?')::TEXT[])[1] as path
##  FROM clean_urls
## )
## select
##  * 
## from parts;

For faster searching, I will add some indices on the above view:

## create index i_urls_object_id on urls(object_id);
## create index i_urls_protocol on urls(protocol);

Database Size

The main table size with all indices:

Tables

##   Table Name    Size
## 1       data 9764 MB
## 2   urls_old  858 MB

Materialized Views

##   View Name    Size
## 1     dates 2156 MB
## 2      urls 1241 MB

Indices

##    Table Name          Index Name   Size
## 1        data       i_data_author 505 MB
## 2        data i_data_created_at_i 414 MB
## 3        data i_data_num_comments 414 MB
## 4        data    i_data_object_id 414 MB
## 5        data    i_data_parent_id 414 MB
## 6        data       i_data_points 414 MB
## 7        data     i_data_story_id 414 MB
## 8        data         i_data_type 414 MB
## 9       dates        i_dates_date 414 MB
## 10      dates       i_dates_month 414 MB
## 11      dates   i_dates_object_id 414 MB
## 12      dates        i_dates_year 414 MB
## 13       urls    i_urls_object_id 116 MB
## 14       urls     i_urls_protocol 116 MB

Protocol Distribution

The URLs are gathered regardless of the field type, we count one URL per object_id. So, if the same URL appears in the same entry in the title and the description, then it’s counted as one. However, if it’s in a story and in a comment to this story, then they are counted as two separate URLs.

The number of https links started growing very fast in 2011. In the middle of 2013, the number of http links started decreasing. What’s more interesting, the http links haven’t disappeared, yet.

The number of comments for a month is growing almost linearly. The number of stories also grows, but much slower.

The number of stories and the number of comments with links is also growing, which can be caused by the growth of the entries.

Almost every story has a link and less than 25% of the comments.

The percentage of the stories and the comments with links is almost the same. It looks like the growth of the entries with links, shown in the previous charts, is caused by the overall growth of the number of entries.

The Most Active Domains

The domains with the biggest number of posts are:

The domains with the biggest number of points all the time are:

XKCD :: Comics with The Biggest Number of Points

https://xkcd.com/927/
https://xkcd.com/1354/
https://xkcd.com/1838/
https://xkcd.com/538/
https://xkcd.com/1827/
https://xkcd.com/1110/
https://xkcd.com/936/
https://xkcd.com/1053/
https://xkcd.com/radiation/
https://xkcd.com/1150/

Wikipedia :: Pages with The Biggest Number of Points

John McCarthy (computer scientist) John McCarthy (September 4, 1927 – October 24, 2011) was an American computer scientist and cognitive scientist. McCarthy was one of the founders of the discipline of artificial intelligence.[1] He coined the term “artificial intelligence” (AI),[2] developed the Lisp programming language family, significantly influenced the design of the ALGOL programming language, popularized timesharing, and was very influential in the early development of AI. https://en.wikipedia.org/wiki/John_McCarthy_(computer_scientist)

High-Tech Employee Antitrust Litigation High-Tech Employee Antitrust Litigation is a 2010 United States Department of Justice (DOJ) antitrust action and a 2013 civil class action against several Silicon Valley companies for alleged “no cold call” agreements which restrained the recruitment of high-tech employees. https://en.wikipedia.org/wiki/High-Tech_Employee_Antitrust_Litigation

Timeline of the far future While the future can never be predicted with absolute certainty,[1] present understanding in various scientific fields allows for the prediction of some far-future events, if only in the broadest outline. These fields include astrophysics, which has revealed how planets and stars form, interact, and die; particle physics, which has revealed how matter behaves at the smallest scales; evolutionary biology, which predicts how life will evolve over time; and plate tectonics, which shows how continents shift over millennia. https://en.wikipedia.org/wiki/Timeline_of_the_far_future

Wikipedia:Wikipedia Signpost/2017-02-27/Op-ed In biology, the hallmarks of an aggressive cancer include limitless and exponential multiplication of ordinarily beneficial cells, even when the body signals that further multiplication is no longer needed. The Wikipedia page on the wheat and chessboard problem explains that nothing can keep growing exponentially forever. In biology, the unwanted growth usually terminates with the death of the host. Exponential spending increases can often lead to the same undesirable result in organizations. https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2017-02-27/Op-ed

Great Molasses Flood The Great Molasses Flood, also known as the Boston Molasses Disaster or the Great Boston Molasses Flood, and sometimes referred to locally as the Boston Molassacre,[1][2] occurred on January 15, 1919, in the North End neighborhood of Boston, Massachusetts. A large storage tank burst, filled with 2,300,000 US gal (8,700 m3; 8,706,447 liters)[3] (ca 12,000 tons; 10,886 metric tons; 24,000,000 lbs)[4] of molasses, and a wave of molasses rushed through the streets at an estimated 35 mph (56 km/h), killing 21 and injuring 150.[5] The event entered local folklore and residents claimed for decades afterwards that the area still smelled of molasses on hot summer days.[6][5] https://en.wikipedia.org/wiki/Great_Molasses_Flood

User talk:Jimbo Wales Congratulations, Jimbo Wales! Despite your request for donations by the end of June, you have refunded all donations from a small donor. I am glad that Wikipedia is so successful. 84.120.0.236 (talk) 14:11, 5 August 2019 (UTC) https://en.wikipedia.org/wiki/User_talk:Jimbo_Wales

Hy Hy (alternately, Hylang) is a programming language, a dialect of the language Lisp designed to interact with the language Python by translating expressions into Python’s abstract syntax tree (AST). Hy was introduced at Python Conference (PyCon) 2013 by Paul Tagliamonte.[1] https://en.wikipedia.org/wiki/Hy

Potato paradox The potato paradox is a mathematical calculation that has a counter-intuitive result. The so-called paradox involves dehydrating potatoes by a seemingly minuscule amount, and then calculating a change in mass which is larger than expected. https://en.wikipedia.org/wiki/Potato_paradox

Zero rupee note A zero-rupee note is a banknote imitation issued in India as a means of helping to fight systemic political corruption. The notes are “paid” in protest by angry citizens to government functionaries who solicit bribes in return for services which are supposed to be free. Zero rupee notes, which are made to resemble the regular 50 rupee banknote of India, are the creation of a non-governmental organization known as 5th Pillar which has, since their inception in 2007, distributed over 2.5 million notes as of August 2014. The notes remain in current use and thousands of notes are distributed every month. https://en.wikipedia.org/wiki/Zero_rupee_note

Room 641A Coordinates: 37°47′07″N 122°23′48″W / 37.78528°N 122.39667°W / 37.78528; -122.39667 https://en.wikipedia.org/wiki/Room_641A

Arstechnica :: Pages with The Biggest Number of Points

SpaceX plans worldwide satellite Internet with low latency, gigabit speed SpaceX has detailed ambitious plans to bring fast Internet access to the entire world with a new satellite system that offers greater speeds and lower latency than existing satellite networks. https://arstechnica.com/information-technology/2016/11/spacex-plans-worldwide-satellite-internet-with-low-latency-gigabit-speed

Xamarin now free in Visual Studio SAN FRANCISCO—Microsoft bought Xamarin, the popular C#-and-.NET-on-iOS-and-Android, last month. At its Build developer conference today, the company announced the first big step for its new acquisition: Xamarin is now included in every Visual Studio version. https://arstechnica.com/information-technology/2016/03/xamarin-now-free-in-visual-studio

How “omnipotent” hackers tied to NSA hid for 14 years—and were found at last CANCUN, Mexico — In 2009, one or more prestigious researchers received a CD by mail that contained pictures and other materials from a recent scientific conference they attended in Houston. The scientists didn’t know it then, but the disc also delivered a malicious payload developed by a highly advanced hacking operation that had been active since at least 2001. The CD, it seems, was tampered with on its way through the mail. https://arstechnica.com/security/2015/02/how-omnipotent-hackers-tied-to-the-nsa-hid-for-14-years-and-were-found-at-last

Facebook scraped call, text message data for years from Android phones [Updated] [Update, March 25, 2018, 20:24 Eastern Time]: Facebook has responded to this and other reports regarding the collection of call and SMS data with a blog post that denies Facebook collected call data surreptitiously. The company also writes that it never sells the data and that users are in control of the data uploaded to Facebook. This “fact check” contradicts several details Ars found in analysis of Facebook data downloads and testimony from users who provided the data. More on the Facebook response is appended to the end of the original article below. https://arstechnica.com/information-technology/2018/03/facebook-scraped-call-text-message-data-for-years-from-android-phones

Microsoft and GitHub team up to take Git virtual file system to macOS, Linux One of the more surprising stories of the past year was Microsoft’s announcement that it was going to use the Git version control system for Windows development. Microsoft had to modify Git to handle the demands of Windows development but said that it wanted to get these modifications accepted upstream and integrated into the standard Git client. https://arstechnica.com/gadgets/2017/11/microsoft-and-github-team-up-to-take-git-virtual-file-system-to-macos-linux

Sorry, Comcast: Voters say “yes” to city-run broadband in Colorado Voters in Fort Collins, Colorado, yesterday approved a ballot question that authorizes the city to build a broadband network, rejecting a cable and telecom industry campaign against the initiative. https://arstechnica.com/tech-policy/2017/11/voters-reject-cable-lobby-misinformation-campaign-against-muni-broadband

Google’s constant product shutdowns are damaging its brand It’s only April, and 2019 has already been an absolutely brutal year for Google’s product portfolio. The Chromecast Audio was discontinued January 11. YouTube annotations were removed and deleted January 15. Google Fiber packed up and left a Fiber city on February 8. Android Things dropped IoT support on February 13. Google’s laptop and tablet division was reportedly slashed on March 12. Google Allo shut down on March 13. The “Spotlight Stories” VR studio closed its doors on March 14. The goo.gl URL shortener was cut off from new users on March 30. Gmail’s IFTTT support stopped working March 31. https://arstechnica.com/gadgets/2019/04/googles-constant-product-shutdowns-are-damaging-its-brand

Patent war goes nuclear: Microsoft, Apple-owned “Rockstar” sues Google Canada-based telecom Nortel went bankrupt in 2009 and sold its biggest asset—a portfolio of more than 6,000 patents covering 4G wireless innovations and a range of technologies—at an auction in 2011. https://arstechnica.com/tech-policy/2013/10/patent-war-goes-nuclear-microsoft-apple-owned-rockstar-sues-google

After 37 years, Voyager 1 has fired up its trajectory thrusters At present, the Voyager 1 spacecraft is 21 billion kilometers from Earth, or about 141 times the distance between the Earth and Sun. It has, in fact, moved beyond our Solar System into interstellar space. However, we can still communicate with Voyager across that distance. https://arstechnica.com/science/2017/12/after-37-years-voyager-has-fired-up-its-trajectory-thrusters

Mickey Mouse will be public domain soon—here’s what that means As the ball dropped over Times Square last night, all copyrighted works published in 1923 fell into the public domain (with a few exceptions). Everyone now has the right to republish them or adapt them for use in new works. https://arstechnica.com/tech-policy/2019/01/a-whole-years-worth-of-works-just-fell-into-the-public-domain

Github :: Pages with The Biggest Number of Points

https://github.com/minimaxir/big-list-of-naughty-strings

https://github.com/jmdugan/blocklists/blob/master/corporations/facebook/all

https://github.com/Eloston/ungoogled-chromium

https://github.com/dear-github/dear-github

https://github.com/docker/docker.github.io/issues/6910

https://github.com/blog/2164-introducing-unlimited-private-repositories

https://github.com/kubernetes/kubernetes/blob/ec2e767e59395376fa191d7c56a74f53936b7653/pkg/controller/volume/persistentvolume/pv_controller.go

https://github.com/shadowsocks/shadowsocks-iOS/issues/124

https://github.com/mxgmn/WaveFunctionCollapse

https://github.com/open-guides/og-aws.git

Twitter :: Twits and Accounts with The Biggest Number of Points

Sometimes adblocks don’t display the twitter cards, then you will see just the links

https://twitter.com/lemiorhan/status/935578694541770752
https://twitter.com/ctavan/status/1044282084020441088
https://twitter.com/FrancescoC/status/1119596234166218754
https://twitter.com/Senficon/status/1014814460488413185
https://twitter.com/sarahjeong/status/735924335412543488
https://twitter.com/GambleLee/status/862307447276544000
https://twitter.com/NASA/status/539814651404754944
https://twitter.com/dustinkirkland
https://twitter.com/hintjens/status/783254242052206592
https://twitter.com/KodyKinzie/status/1146196570083192832

Software Version

About The Author

Made by Szymon Lipiński.

Published at https://www.simononsoftware.com/hackernews-links-analysis/