Downloading The Data

The data is downloaded to a set of csv file using the code available at https://github.com/szymonlipinski/hackernews_dowloader.

This made the following files:

du -ah /home/data/hn

## 504M /home/data/hn/1554994838_1565866358__19635134_20704074.data.csv
## 485M /home/data/hn/1401111353_1421029195__994369_8872196.data.csv
## 467M /home/data/hn/1300452676_1326824346__2340190_3475825.data.csv
## 450M /home/data/hn/1209976952_1271054983__181299_1258580.data.csv
## 506M /home/data/hn/1543560409_1554994838__18567251_19635134.data.csv
## 505M /home/data/hn/1556804358_1567598222__19807762_20876231.data.csv
## 493M /home/data/hn/1454584884_1468706219__11033266_12108033.data.csv
## 505M /home/data/hn/1531176188_1543560409__17494018_18567251.data.csv
## 490M /home/data/hn/1421029195_1437923693__8872196_9951246.data.csv
## 499M /home/data/hn/1493867970_1506462121__14262172_15342765.data.csv
## 465M /home/data/hn/1271054983_1300452676__1258580_2340190.data.csv
## 478M /home/data/hn/1367793791_1384391674__5660073_6729887.data.csv
## 82M  /home/data/hn/1554994838_1556804358__19635134_19807762.data.csv
## 505M /home/data/hn/1519132696_1531176188__16420052_17494018.data.csv
## 474M /home/data/hn/1326824346_1348853030__3475825_4586677.data.csv
## 498M /home/data/hn/1468706219_1481804010__12108033_13184043.data.csv
## 502M /home/data/hn/1481804010_1493867970__13184043_14262172.data.csv
## 502M /home/data/hn/1506462121_1519132696__15342765_16420053.data.csv
## 476M /home/data/hn/1348853030_1367793791__4586677_5660073.data.csv
## 498M /home/data/hn/1437923693_1454584884__9951246_11033266.data.csv
## 69M  /home/data/hn/1160418111_1209976952__1_181299.data.csv
## 478M /home/data/hn/1384391674_1401111353__6729887_7799657.data.csv
## 9,7G /home/data/hn

Creating The Database Structure

All the data is too large to keep it in R in memory for processing on my machine. An alternative is to keep it in a database, I chose PostgreSQL.

The table structure for the csv data is:

## 
## CREATE TABLE raw_data (
##     title TEXT,
##     url TEXT,
##     author TEXT,
##     points INT,
##     story_text TEXT,
##     comment_text TEXT,
##     num_comments INT,
##     story_id INT,
##     story_title TEXT,
##     story_url TEXT,
##     parent_id INT,
##     created_at_i INT,
##     type TEXT,
##     object_id INT
## );

All the files have been loaded with:

## #!/bin/bash
## 
## 
## if (( $# != 4 )); then
##     echo "Loads data from csv files to a postgres database"
##     echo "USAGE:"
##     echo "./load_files.sh DBNAME DBUSER TABLE_NAME FILES_DIRECTORY"
##     exit 0
## fi
## 
## DBNAME=$1
## DBUSER=$2
## TABLE_NAME=$3
## FILES_DIRECTORY=$4
## 
## for f in $FILES_DIRECTORY/*.csv
## do
##     echo "Loading $f"
##     psql $DBNAME -U $DBUSER -c "\\COPY $TABLE_NAME FROM $f WITH CSV DELIMITER ',' HEADER " 
## done

The loading time was about 6s per file.

Basic Data Cleaning

Removing Duplicates

According to the documentation of the downloader program:

Some entries in the files are duplicated, which is basically because of the Algolia API limitations. What's more, Hackernews users can edit their entries, so when downloading the data after some time, some entries may be different. Mechanism of loading the data to a processing pipeline should update the entries when will have a duplicated entry id.

To remove the duplicates, I used a simple query which should create a new table without the duplicated rows. The primary key for the data is the object_id column, so to make things faster, I created an index, and used distinct on:

## BEGIN;
## 
## CREATE INDEX i_raw_data_object_id ON raw_data (object_id);
## 
## CREATE TABLE data AS 
## SELECT DISTINCT ON (object_id) *
## FROM raw_data;
## 
## DROP TABLE raw_data;
## 
## COMMIT;

Adding Indices

I also need some indices on the data table for faster searching. I omitted the text columns, except for the ones where I will use the whole text to search, like type = 'comment'.

## CREATE INDEX i_data_author       ON data (author);
## CREATE INDEX i_data_points       ON data (points);
## CREATE INDEX i_data_num_comments ON data (num_comments);
## CREATE INDEX i_data_story_id     ON data (story_id);
## CREATE INDEX i_data_parent_id    ON data (parent_id);
## CREATE INDEX i_data_created_at_i ON data (created_at_i);
## CREATE INDEX i_data_type         ON data (type);
## CREATE INDEX i_data_object_id    ON data (object_id);

Preprocessing Data

In the further data processing, I will need to repeat some data operations. To speed it up, I will calculate a couple of things and store it in the database. I like to use materialized views for this for two reasons:

They can be easily refreshed to recalculate the data again.
They don’t change the original data.

Calculating The Dates

The only date field in the data table is the created_at_i which is an integer with number of seconds since the Jan 1st, 1970. As I will need to aggregate dates by weeks, days of week, months, years, to decrease the query time later, I will calculate it now:

## create materialized view dates as
## select
## object_id,
## timestamp 'epoch' + created_at_i * interval '1 second' as date,
## date_part('year',   timestamp 'epoch' + created_at_i * interval '1 second') as year,
## date_part('month',  timestamp 'epoch' + created_at_i * interval '1 second') as month,
## date_part('week',   timestamp 'epoch' + created_at_i * interval '1 second') as week,
## date_part('day',    timestamp 'epoch' + created_at_i * interval '1 second') as day,
## date_part('dow',    timestamp 'epoch' + created_at_i * interval '1 second') as dow,
## date_part('hour',   timestamp 'epoch' + created_at_i * interval '1 second') as hour,
## date_part('minute', timestamp 'epoch' + created_at_i * interval '1 second') as minute,
## date_part('second', timestamp 'epoch' + created_at_i * interval '1 second') as second,
## to_char(timestamp 'epoch' + created_at_i * interval '1 second', 'yyyy-MM')  as year_month
## from data;

For faster searching, I will add some indices on the above view:

## create index i_dates_object_id on dates(object_id);
## create index i_dates_year on dates(year);
## create index i_dates_month on dates(month);
## create index i_dates_date on dates(date);

Getting URLs

I will also get all the urls from the specific fields. For now I will mark the source of the url, as it is possible that the urls distribution in stories text is different than in comments.

## -- The urls can be everywhere
## -- If the entry type is a story, then it has fields like: title, url
## -- If it's a comment, then it has comment_text, story_title, story_url
## -- Jobs can have url, title, and story_text
## create materialized view 
## urls as
## with url_data as 
## (
##     select
##         distinct
##         object_id, 'comment_text' as type,
##         unnest(
##             regexp_matches(comment_text, '((?:http|https)://[a-zA-Z0-9][a-zA-Z0-9\.-]*\.[a-zA-Z]{2,}/?[^\s<"]*)',  'gi')
##         ) url
##     from data
##     UNION ALL
##     select
##         distinct
##         object_id, 'story_title',
##         unnest(
##             regexp_matches(title, '((?:http|https)://[a-zA-Z0-9][a-zA-Z0-9\.-]*\.[a-zA-Z]{2,}/?[^\s<"]*)',  'gi')
##         ) url
##     from data
##     UNION ALL
##     select
##         distinct
##         object_id, 'story_text',
##         unnest(
##             regexp_matches(story_text, '((?:http|https)://[a-zA-Z0-9][a-zA-Z0-9\.-]*\.[a-zA-Z]{2,}/?[^\s<"]*)',  'gi')
##         ) url
##     from data
##     UNION ALL
##     select
##         distinct
##         object_id, 'url',
##         unnest(
##             regexp_matches(url, '((?:http|https)://[a-zA-Z0-9][a-zA-Z0-9\.-]*\.[a-zA-Z]{2,}/?[^\s<"]*)',  'gi')
##         ) url
##     from data
## ),
## clean_urls as (
##      SELECT DISTINCT 
##         object_id, 
##         type, 
##         case when rtrim(url, './') ilike '%(%)%' 
##              then rtrim(url, './') 
##              else rtrim(url, './)')
##         end as url
##      FROM url_data
##      WHERE url not like '%...'
## ),
## parts as (
##  SELECT 
##     object_id, type, url,
## 
##     (regexp_matches(lower(url), '^(\w*)://[^/]*/?.*/?$')::TEXT[])[1] as protocol,
##     (regexp_matches(lower(url), '^\w*://([^/]*)/?.*/?$')::TEXT[])[1] as domain,
##     (regexp_matches(lower(url), '^\w*://(?:www.)?([a-zA-Z0-9_\.-]*).*$')::TEXT[])[1] as domain_without_www,
##     (regexp_matches(url, '^\w*://[^/]*(/.*)/?$')::TEXT[])[1] as full_path,
##     (regexp_matches(url, '^\w*://[^/]*/.*/?\?(.*)/?$')::TEXT[])[1] as params,
##     (regexp_matches(url, '^\w*://[^/]*(/[^?#]*?)/?')::TEXT[])[1] as path
##  FROM clean_urls
## )
## select
##  * 
## from parts;

For faster searching, I will add some indices on the above view:

## create index i_urls_object_id on urls(object_id);
## create index i_urls_protocol on urls(protocol);

Database Size

The main table size with all indices:

require("RPostgreSQL")
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = "hn",
                 host = "localhost", port = 5432,
                 user = "hn", password = "hn")

tables <- dbGetQuery(con, '
  SELECT
    tablename "Table Name",
    pg_size_pretty(pg_relation_size(tablename::text)) "Size"
  FROM
    pg_tables where schemaname=\'public\'
  ORDER BY
    tablename
  ')

views <- dbGetQuery(con, '
  SELECT 
    matviewname "View Name",
    pg_size_pretty(pg_relation_size(matviewname::text)) "Size"
  FROM
    pg_matviews where schemaname=\'public\'
  ORDER BY
    matviewname
  ')

indices <- dbGetQuery(con, '
  SELECT
    tablename "Table Name",
    indexname "Index Name",
    pg_size_pretty(pg_relation_size(indexname::text)) "Size"
  FROM
    pg_indexes
  WHERE schemaname = \'public\'
  ORDER BY tablename, indexname;
  ')

Tables

##   Table Name    Size
## 1       data 9764 MB
## 2   urls_old  858 MB

Materialized Views

##   View Name    Size
## 1     dates 2156 MB
## 2      urls 1241 MB

Indices

##    Table Name          Index Name   Size
## 1        data       i_data_author 505 MB
## 2        data i_data_created_at_i 414 MB
## 3        data i_data_num_comments 414 MB
## 4        data    i_data_object_id 414 MB
## 5        data    i_data_parent_id 414 MB
## 6        data       i_data_points 414 MB
## 7        data     i_data_story_id 414 MB
## 8        data         i_data_type 414 MB
## 9       dates        i_dates_date 414 MB
## 10      dates       i_dates_month 414 MB
## 11      dates   i_dates_object_id 414 MB
## 12      dates        i_dates_year 414 MB
## 13       urls    i_urls_object_id 116 MB
## 14       urls     i_urls_protocol 116 MB

Protocol Distribution

The URLs are gathered regardless of the field type, we count one URL per object_id. So, if the same URL appears in the same entry in the title and the description, then it’s counted as one. However, if it’s in a story and in a comment to this story, then they are counted as two separate URLs.

drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = "hn",
                 host = "localhost", port = 5432,
                 user = "hn", password = "hn")

urls <- dbGetQuery(con, "
  WITH distinct_urls AS (
    SELECT DISTINCT object_id, protocol, url
    FROM urls
    UNION
    SELECT DISTINCT object_id, 'all',
           domain || '/' || path as url
    FROM urls
  )
  SELECT protocol, year_month date, count(*)
  FROM distinct_urls
  JOIN dates USING (object_id)
  WHERE date < date_trunc('month', now())
  GROUP BY protocol, year_month
  ORDER BY year_month desc, protocol
")

dateRange <- dbGetQuery(con, "SELECT min(year) min, max(year) max FROM dates")
breaks <- mapply({function(year) sprintf("%s-01", year)},
                 seq(dateRange$min[1], dateRange$max[1]))

gg <- ggplot(data = urls, aes(x = date, y = count, group = protocol)) +
  geom_point(aes(color = protocol)) +
  geom_smooth(method = "loess", se = FALSE, aes(color = protocol)) +  
  labs(title = "Protocol Distribution For URLs By Month",
       x = "Date",
       y = "Count",
       color = "Protocol Type") +
  scale_x_discrete(breaks = breaks) + 
  theme(axis.text.x = element_text(angle = 30, hjust = 1)) 

plot(gg)

The number of https links started growing very fast in 2011. In the middle of 2013, the number of http links started decreasing. What’s more interesting, the http links haven’t disappeared, yet.

entries <- dbGetQuery(con, "
  WITH all_entries as (
      SELECT year_month date, data.type, count(data.object_id)
      FROM data
      JOIN dates USING(object_id) 
      WHERE data.type in ('story', 'comment')
        AND date < date_trunc('month', now())
      GROUP BY year_month, data.type
  ),
  entries_with_comments as (
      SELECT year_month date, data.type, count(data.object_id)
      FROM data
      JOIN urls USING(object_id) 
      JOIN dates USING(object_id) 
      WHERE data.type in ('story', 'comment')
        AND date < date_trunc('month', now())
      GROUP BY year_month, data.type
  )
  SELECT e.date, e.type, e.count allCount, c.count commentsCount, 
         100.0 * c.count / e.count percent
  FROM all_entries e
  JOIN entries_with_comments c USING (date, type)
  ORDER BY e.date, e.type
")

gg <- ggplot(data = entries, aes(x = date, y = allcount, group = type)) +
  geom_point(aes(color = type)) +
  geom_smooth(method = "loess", se = FALSE, aes(color = type)) +  
  labs(title = "Number of Entries By Month",
       x = "Date",
       y = "Count",
       color = "Entry Type") +
  scale_x_discrete(breaks = breaks) + 
  theme(axis.text.x = element_text(angle = 30, hjust = 1))
plot(gg)

The number of comments for a month is growing almost linearly. The number of stories also grows, but much slower.

gg <- ggplot(data = entries, aes(x = date, y = commentscount, group = type)) +
  geom_point(aes(color = type)) +
  geom_smooth(method = "loess", se = FALSE, aes(color = type)) +  
  labs(title = "Entries With URLs By Month",
       x = "Date",
       y = "Count",
       color = "Entry Type") +
  scale_x_discrete(breaks = breaks) + 
  theme(axis.text.x = element_text(angle = 30, hjust = 1))
plot(gg)

The number of stories and the number of comments with links is also growing, which can be caused by the growth of the entries.

gg <- ggplot(data = entries, aes(x = date, y = percent, group = type)) +
  geom_point(aes(color = type)) +
  geom_smooth(method = "loess", se = FALSE, aes(color = type)) +  
  labs(title = "Percentage of Entries With URLs By Month",
       x = "Date",
       y = "Percent",
       color = "Entry Type") +
  scale_x_discrete(breaks = breaks) + 
  theme(axis.text.x = element_text(angle = 30, hjust = 1))
plot(gg)

Almost every story has a link and less than 25% of the comments.

The percentage of the stories and the comments with links is almost the same. It looks like the growth of the entries with links, shown in the previous charts, is caused by the overall growth of the number of entries.

The Most Active Domains

mostActiveDomainsByMonth <- dbGetQuery(con, "
  WITH domains AS (
      SELECT domain_without_www, count(*)
      FROM urls
      GROUP BY domain_without_www
      ORDER BY count DESC
      LIMIT 10
  )
  SELECT domain_without_www as domain, year_month date, count(*)
  FROM urls
  JOIN dates USING (object_id)
  WHERE date < date_trunc('month', now())
    AND domain in (SELECT domain_without_www FROM domains)
  GROUP BY domain_without_www, year_month
  ORDER BY domain_without_www, year_month
  ;
")

domainsWithMostPointsByMonth <- dbGetQuery(con, "
  WITH domains AS (
      SELECT domain_without_www, sum(coalesce(points, 0)) points
      FROM urls
      JOIN data USING (object_id)
      GROUP BY domain_without_www
      ORDER BY points DESC
      LIMIT 10
  )
  SELECT domain_without_www as domain, year_month date, sum(coalesce(points, 0)) points
  FROM urls
  JOIN data USING (object_id)
  JOIN dates USING (object_id)
  WHERE date < date_trunc('month', now())
    AND domain in (SELECT domain_without_www FROM domains)
  GROUP BY domain_without_www, year_month
  ORDER BY domain_without_www, year_month
;
")

mostActiveDomains <- dbGetQuery(con, "
      SELECT domain_without_www, count(*)
      FROM urls
      GROUP BY domain_without_www
      ORDER BY count DESC
      LIMIT 10
")

domainsWithMostPoints <- dbGetQuery(con, "
      SELECT domain_without_www, sum(coalesce(points, 0)) points
      FROM urls
      JOIN data USING (object_id)
      GROUP BY domain_without_www
      ORDER BY points DESC
      LIMIT 10
")

gg <- ggplot(data = mostActiveDomainsByMonth, aes(x = date, y = count, group = domain)) +
  geom_line(aes(color = domain)) +
  labs(title = "Most Mentioned Domains By Month",
       x = "Date",
       y = "Count",
       color = "Domain") +
  scale_x_discrete(breaks = breaks) + 
  theme(axis.text.x = element_text(angle = 30, hjust = 1))
plot(gg)

The domains with the biggest number of posts are:

en.wikipedia.org
github.com
news.ycombinator.com
youtube.com
medium.com

gg <- ggplot(data = domainsWithMostPointsByMonth, aes(x = date, y = points, group = domain)) +
  geom_line(aes(color = domain)) +
  labs(title = "Domains With Most Points By Month",
       x = "Date",
       y = "Points",
       color = "Domain") +
  scale_x_discrete(breaks = breaks) + 
  theme(axis.text.x = element_text(angle = 30, hjust = 1))
plot(gg)

The domains with the biggest number of points all the time are:

github.com
nytimes.com
techcrunch.com
medium.com
en.wikipedia.org

getDomainData <- function(con, domain) {
  res <- dbGetQuery(con, "
    SELECT path, sum(coalesce(points, 0)) point
    FROM urls
    JOIN data USING (object_id)
    WHERE (path IS NOT NULL AND path <> '/')
      AND domain_without_www = $1
    GROUP BY path
    ORDER BY point DESC 
    LIMIT 20;
  ", params = c(domain))
  res$url <- paste(paste("https://", domain, sep=""), res$path, sep="")
  res
}

XKCD :: Comics with The Biggest Number of Points

xkcdMappings <- list(
  "/standards.png",
  "/heartbleed_explanation.png",
  "/machine_learning.png",
  "/security.png",
  "/survivorship_bias.png",
  "/click_and_drag.png",
  "/password_strength.png",
  "/ten_thousand.png",
  "/instagram.png"
)
names(xkcdMappings) <- list(
  "/927", 
  "/1354",
  "/1838",
  "/538",
  "/1827",
  "/1110",
  "/936",
  "/1053",
  "/1150"
)

baseXkcdURL <- "https://imgs.xkcd.com/comics"

xkcd <- getDomainData(con, 'xkcd.com')

xkcd$embeddedUrl <- mapply({function(path) {
      mappedUrl <- if (!(path %in% names(xkcdMappings))) {
        path
      } else {
        xkcdMappings[path][1]
      }
      # This is a special case, as this comic is not stored
      # at the usual embedded URL
      if (path == "/radiation") {
        return("https://imgs.xkcd.com/blag/radiation.png")
      }
      paste(baseXkcdURL, mappedUrl, sep = "")
    }
  }, xkcd$path)

xkcd$html <- sprintf(
  '<span class="list-item"> <span class="image"> <a href="%s" target="<!!>">![](%s)</a> <span class="link"> %s/ </span> </span> </span>',
  xkcd$url,
  xkcd$embeddedUrl,
  xkcd$url
)

xkcd

kable(xkcd$html[1:10], col.names=c(''))

https://xkcd.com/927/

https://xkcd.com/1354/

https://xkcd.com/1838/

https://xkcd.com/538/

https://xkcd.com/1827/

https://xkcd.com/1110/

https://xkcd.com/936/

https://xkcd.com/1053/

https://xkcd.com/radiation/

https://xkcd.com/1150/

Wikipedia :: Pages with The Biggest Number of Points

wikiData <- getDomainData(con, 'en.wikipedia.org')

# I need to remove some of the links, as they are not interested for the analysis
# /w/index.php - it's the main wikipedia page
wikiData <- wikiData %>%
  filter(!(path %in% c("/w/index.php")))

wikiData <- wikiData[1:10, ]

# download the html pages from the wikipedia for the urls
wikiData$page <- mapply({function(url) url %>% GET() %>% content("text")}, wikiData$url)

# parse the html to get the article title
wikiData$title <- mapply({function(html) read_html(html) %>% html_nodes("#firstHeading") %>% html_text()}, wikiData$page)

# parse the html to get the article first paragraph
wikiData$firstParagraph <- mapply({function(html) {
    x <- read_html(html) %>%
      html_nodes("#mw-content-text .mw-parser-output p")
    
    x <- Filter({function (t) html_text(t) != "\n"}, x)[1]
    html_text(x)
  }
}, wikiData$page)

data <- sprintf(
  '<p class="list-item"> <span class="title"> <a href="%s" target="<!!>">%s</a> </span> <span class="first-paragraph"> %s </span> <span class="link"> <a href="%s" target="<!!>">%s</a> </span> </p>',
  wikiData$url,
  wikiData$title,
  wikiData$firstParagraph,
  wikiData$url,
  wikiData$url
)

# I haven't found simpler way to list the content to markdown (sometimes kable doesn't work well):

John McCarthy (computer scientist) John McCarthy (September 4, 1927 – October 24, 2011) was an American computer scientist and cognitive scientist. McCarthy was one of the founders of the discipline of artificial intelligence.[1] He coined the term “artificial intelligence” (AI),[2] developed the Lisp programming language family, significantly influenced the design of the ALGOL programming language, popularized timesharing, and was very influential in the early development of AI. https://en.wikipedia.org/wiki/John_McCarthy_(computer_scientist)

High-Tech Employee Antitrust Litigation High-Tech Employee Antitrust Litigation is a 2010 United States Department of Justice (DOJ) antitrust action and a 2013 civil class action against several Silicon Valley companies for alleged “no cold call” agreements which restrained the recruitment of high-tech employees. https://en.wikipedia.org/wiki/High-Tech_Employee_Antitrust_Litigation

Timeline of the far future While the future can never be predicted with absolute certainty,[1] present understanding in various scientific fields allows for the prediction of some far-future events, if only in the broadest outline. These fields include astrophysics, which has revealed how planets and stars form, interact, and die; particle physics, which has revealed how matter behaves at the smallest scales; evolutionary biology, which predicts how life will evolve over time; and plate tectonics, which shows how continents shift over millennia. https://en.wikipedia.org/wiki/Timeline_of_the_far_future

Wikipedia:Wikipedia Signpost/2017-02-27/Op-ed In biology, the hallmarks of an aggressive cancer include limitless and exponential multiplication of ordinarily beneficial cells, even when the body signals that further multiplication is no longer needed. The Wikipedia page on the wheat and chessboard problem explains that nothing can keep growing exponentially forever. In biology, the unwanted growth usually terminates with the death of the host. Exponential spending increases can often lead to the same undesirable result in organizations. https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2017-02-27/Op-ed

Great Molasses Flood The Great Molasses Flood, also known as the Boston Molasses Disaster or the Great Boston Molasses Flood, and sometimes referred to locally as the Boston Molassacre,[1][2] occurred on January 15, 1919, in the North End neighborhood of Boston, Massachusetts. A large storage tank burst, filled with 2,300,000 US gal (8,700 m3; 8,706,447 liters)[3] (ca 12,000 tons; 10,886 metric tons; 24,000,000 lbs)[4] of molasses, and a wave of molasses rushed through the streets at an estimated 35 mph (56 km/h), killing 21 and injuring 150.[5] The event entered local folklore and residents claimed for decades afterwards that the area still smelled of molasses on hot summer days.[6][5] https://en.wikipedia.org/wiki/Great_Molasses_Flood

User talk:Jimbo Wales Congratulations, Jimbo Wales! Despite your request for donations by the end of June, you have refunded all donations from a small donor. I am glad that Wikipedia is so successful. 84.120.0.236 (talk) 14:11, 5 August 2019 (UTC) https://en.wikipedia.org/wiki/User_talk:Jimbo_Wales

Hy Hy (alternately, Hylang) is a programming language, a dialect of the language Lisp designed to interact with the language Python by translating expressions into Python’s abstract syntax tree (AST). Hy was introduced at Python Conference (PyCon) 2013 by Paul Tagliamonte.[1] https://en.wikipedia.org/wiki/Hy

Potato paradox The potato paradox is a mathematical calculation that has a counter-intuitive result. The so-called paradox involves dehydrating potatoes by a seemingly minuscule amount, and then calculating a change in mass which is larger than expected. https://en.wikipedia.org/wiki/Potato_paradox

Zero rupee note A zero-rupee note is a banknote imitation issued in India as a means of helping to fight systemic political corruption. The notes are “paid” in protest by angry citizens to government functionaries who solicit bribes in return for services which are supposed to be free. Zero rupee notes, which are made to resemble the regular 50 rupee banknote of India, are the creation of a non-governmental organization known as 5th Pillar which has, since their inception in 2007, distributed over 2.5 million notes as of August 2014. The notes remain in current use and thousands of notes are distributed every month. https://en.wikipedia.org/wiki/Zero_rupee_note

Room 641A Coordinates: 37°47′07″N 122°23′48″W / 37.78528°N 122.39667°W / 37.78528; -122.39667 https://en.wikipedia.org/wiki/Room_641A

Arstechnica :: Pages with The Biggest Number of Points

# Get the links
arstechnicaData <- getDomainData(con, 'arstechnica.com')
arstechnicaData <- arstechnicaData[1:10, ]

# Download the pages
arstechnicaData$page <- mapply({function(url) url %>% GET() %>% content("text")}, arstechnicaData$url)

# parse the html to get the article title
arstechnicaData$title <- mapply({function(html) read_html(html) %>% html_nodes("h1") %>% html_text()}, arstechnicaData$page)

# parse the html to get the article first paragraph
arstechnicaData$firstParagraph <- mapply({function(html) {
    x <- read_html(html) %>%
      html_nodes(".article-content p")

    html_text(x[1])
  }
}, arstechnicaData$page)

data <- sprintf(
  '<p class="list-item"> <span class="title"> <a href="%s" target="<!!>">%s</a> </span> <span class="first-paragraph"> %s </span> <span class="link"> <a href="%s" target="<!!>">%s</a> </span> </p>',
  arstechnicaData$url,
  arstechnicaData$title,
  arstechnicaData$firstParagraph,
  arstechnicaData$url,
  arstechnicaData$url
)

SpaceX plans worldwide satellite Internet with low latency, gigabit speed SpaceX has detailed ambitious plans to bring fast Internet access to the entire world with a new satellite system that offers greater speeds and lower latency than existing satellite networks. https://arstechnica.com/information-technology/2016/11/spacex-plans-worldwide-satellite-internet-with-low-latency-gigabit-speed

Xamarin now free in Visual Studio SAN FRANCISCO—Microsoft bought Xamarin, the popular C#-and-.NET-on-iOS-and-Android, last month. At its Build developer conference today, the company announced the first big step for its new acquisition: Xamarin is now included in every Visual Studio version. https://arstechnica.com/information-technology/2016/03/xamarin-now-free-in-visual-studio

How “omnipotent” hackers tied to NSA hid for 14 years—and were found at last CANCUN, Mexico — In 2009, one or more prestigious researchers received a CD by mail that contained pictures and other materials from a recent scientific conference they attended in Houston. The scientists didn’t know it then, but the disc also delivered a malicious payload developed by a highly advanced hacking operation that had been active since at least 2001. The CD, it seems, was tampered with on its way through the mail. https://arstechnica.com/security/2015/02/how-omnipotent-hackers-tied-to-the-nsa-hid-for-14-years-and-were-found-at-last

Facebook scraped call, text message data for years from Android phones [Updated] [Update, March 25, 2018, 20:24 Eastern Time]: Facebook has responded to this and other reports regarding the collection of call and SMS data with a blog post that denies Facebook collected call data surreptitiously. The company also writes that it never sells the data and that users are in control of the data uploaded to Facebook. This “fact check” contradicts several details Ars found in analysis of Facebook data downloads and testimony from users who provided the data. More on the Facebook response is appended to the end of the original article below. https://arstechnica.com/information-technology/2018/03/facebook-scraped-call-text-message-data-for-years-from-android-phones

Microsoft and GitHub team up to take Git virtual file system to macOS, Linux One of the more surprising stories of the past year was Microsoft’s announcement that it was going to use the Git version control system for Windows development. Microsoft had to modify Git to handle the demands of Windows development but said that it wanted to get these modifications accepted upstream and integrated into the standard Git client. https://arstechnica.com/gadgets/2017/11/microsoft-and-github-team-up-to-take-git-virtual-file-system-to-macos-linux

Sorry, Comcast: Voters say “yes” to city-run broadband in Colorado Voters in Fort Collins, Colorado, yesterday approved a ballot question that authorizes the city to build a broadband network, rejecting a cable and telecom industry campaign against the initiative. https://arstechnica.com/tech-policy/2017/11/voters-reject-cable-lobby-misinformation-campaign-against-muni-broadband

Google’s constant product shutdowns are damaging its brand It’s only April, and 2019 has already been an absolutely brutal year for Google’s product portfolio. The Chromecast Audio was discontinued January 11. YouTube annotations were removed and deleted January 15. Google Fiber packed up and left a Fiber city on February 8. Android Things dropped IoT support on February 13. Google’s laptop and tablet division was reportedly slashed on March 12. Google Allo shut down on March 13. The “Spotlight Stories” VR studio closed its doors on March 14. The goo.gl URL shortener was cut off from new users on March 30. Gmail’s IFTTT support stopped working March 31. https://arstechnica.com/gadgets/2019/04/googles-constant-product-shutdowns-are-damaging-its-brand

Patent war goes nuclear: Microsoft, Apple-owned “Rockstar” sues Google Canada-based telecom Nortel went bankrupt in 2009 and sold its biggest asset—a portfolio of more than 6,000 patents covering 4G wireless innovations and a range of technologies—at an auction in 2011. https://arstechnica.com/tech-policy/2013/10/patent-war-goes-nuclear-microsoft-apple-owned-rockstar-sues-google

After 37 years, Voyager 1 has fired up its trajectory thrusters At present, the Voyager 1 spacecraft is 21 billion kilometers from Earth, or about 141 times the distance between the Earth and Sun. It has, in fact, moved beyond our Solar System into interstellar space. However, we can still communicate with Voyager across that distance. https://arstechnica.com/science/2017/12/after-37-years-voyager-has-fired-up-its-trajectory-thrusters

Mickey Mouse will be public domain soon—here’s what that means As the ball dropped over Times Square last night, all copyrighted works published in 1923 fell into the public domain (with a few exceptions). Everyone now has the right to republish them or adapt them for use in new works. https://arstechnica.com/tech-policy/2019/01/a-whole-years-worth-of-works-just-fell-into-the-public-domain

Github :: Pages with The Biggest Number of Points

githubData <- getDomainData(con, 'github.com')

githubData <- githubData %>%
  filter(!(path %in% c("/search", "/sponsors")))

githubData <- githubData[1:10, ]

data <- sprintf(
  '<p class="list-item"> <span class="link"> <a href="%s" target="<!!>">%s</a> </span> </p>',
  githubData$url,
  githubData$url
)

https://github.com/minimaxir/big-list-of-naughty-strings

https://github.com/jmdugan/blocklists/blob/master/corporations/facebook/all

https://github.com/Eloston/ungoogled-chromium

https://github.com/dear-github/dear-github

https://github.com/docker/docker.github.io/issues/6910

https://github.com/blog/2164-introducing-unlimited-private-repositories

https://github.com/kubernetes/kubernetes/blob/ec2e767e59395376fa191d7c56a74f53936b7653/pkg/controller/volume/persistentvolume/pv_controller.go

https://github.com/shadowsocks/shadowsocks-iOS/issues/124

https://github.com/mxgmn/WaveFunctionCollapse

https://github.com/open-guides/og-aws.git

Twitter :: Twits and Accounts with The Biggest Number of Points

Sometimes adblocks don’t display the twitter cards, then you will see just the links

twitterData <- getDomainData(con, 'twitter.com')

twitterData <- twitterData %>%
  filter(!(path %in% c("/search")))

twitterData <- twitterData[1:10, ]

twitterData$html <- mapply({function(url) {
      url <- paste("https://publish.twitter.com/oembed?url=", url ,sep = "")
      url %>%
        GET() %>% 
        content(as = "text") %>% 
        fromJSON() %>%
        .$html
    }
  },
  twitterData$url)

data <- sprintf(
  '<div class="list-item"> %s <span class="link"> <a href="%s" target="<!!>">%s</a> </span> </div>',
  twitterData$html,
  twitterData$url,
  twitterData$url
)

Dear @AppleSupport, we noticed a HUGE security issue at MacOS High Sierra. Anyone can login as “root” with empty password after clicking on login button several times. Are you aware of it @Apple?
— Lemi Orhan Ergin (@lemiorhan) November 28, 2017

https://twitter.com/lemiorhan/status/935578694541770752

“Clear all Cookies except Google Cookies”, thanks Chrome. /cc @matthew_d_green pic.twitter.com/tR0UJjtPFL
— Christoph Tavan (@ctavan) September 24, 2018

https://twitter.com/ctavan/status/1044282084020441088

It is with great sadness that I share news of Joe Armstrong's passing away earlier today. Whilst he may no longer be with us, his work has laid the foundation which will be used by generations to come. RIP @joeerl, thank you for inspiring us all.
— Francesco Cesarini (@FrancescoC) April 20, 2019

https://twitter.com/FrancescoC/status/1119596234166218754

Great success: Your protests have worked! The European Parliament has sent the copyright law back to the drawing board. All MEPs will get to vote on #uploadfilters and the #linktax September 10–13. Now let's keep up the pressure to make sure we #SaveYourInternet! pic.twitter.com/VwqAgH0Xs5
— Julia Reda (@Senficon) July 5, 2018

https://twitter.com/Senficon/status/1014814460488413185

🚨 GOOGLE'S USE OF THE DECLARING CODE AND SSO OF APIS IS FAIR USE 🚨
— sarah jeong (@sarahjeong) May 26, 2016

https://twitter.com/sarahjeong/status/735924335412543488

A crashed advertisement reveals the code of the facial recognition system used by a pizza shop in Oslo… pic.twitter.com/4VJ64j0o1a
— Lee Gamble (@GambleLee) May 10, 2017

https://twitter.com/GambleLee/status/862307447276544000

We're sending humans to Mars! Watch our #JourneytoMars briefing live today at 12pm ET: http://t.co/6XtjOi1yJo #Orion pic.twitter.com/wrf89sn35A
— NASA (@NASA) December 2, 2014

https://twitter.com/NASA/status/539814651404754944

Tweets by DustinKirkland https://twitter.com/dustinkirkland

I'm choosing euthanasia etd 1pm.

I have no last words.
— Pieter Hintjens (@hintjens) October 4, 2016

https://twitter.com/hintjens/status/783254242052206592

We made a video about launching fireworks over Wi-Fi for the 4th of July only to find out @YouTube gave us a strike because we teach about hacking, so we can't upload it.

YouTube now bans: “Instructional hacking and phishing: Showing users how to bypass secure computer systems”
— Kody (@KodyKinzie) July 2, 2019

https://twitter.com/KodyKinzie/status/1146196570083192832

Software Version

OS: Ubuntu 18.04.3 LTS
Python: 3.7.1 (default, Oct 22 2018, 11:21:55) , [GCC 8.2.0]
PostgreSQL: PostgreSQL 10.10 (Ubuntu 10.10-1.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0, 64-bit
R: x86_64-pc-linux-gnu, x86_64, linux-gnu, x86_64, linux-gnu, , 3, 4.4, 2018, 03, 15, 74408, R, R version 3.4.4 (2018-03-15), Someone to Lean On

About The Author

Made by Szymon Lipiński.

Published at https://www.simononsoftware.com/hackernews-links-analysis/

Hacker News Links Analysis

Szymon Lipiński