Monthly Readings #24: 2018 March

Monthly Readings #23: 2018 Jan

The USE Method

https://www.evernote.com/l/AATojLawMd1HpJz61cnRXrRRR2vmpGoTIvk

Solving the Linux storage scalability bottlenecks

https://www.evernote.com/l/AAQVd0liGrRDiaXQhwMA7dOW0qnLNwqLyBI

Looking at Disk Utilization and Saturation

https://www.evernote.com/l/AATvmxqSyOBEcoaSVrzwP-HoMuZ0skZELb4

A Comparison of NVMe and AHCI: Overview

https://www.evernote.com/l/AATUyzwnc1ZPrYiO7MDzkqvj-raF1x7KkBc

15 Feb 2018 by fleuria

Monthly Readings #22: 2017 Dec

Transactions in HBase

https://www.evernote.com/l/AARJfjNX7NpPe5YbKNDSsy6fxssk9TyoS_o

Omid, Reloaded: Scalable and Highly-Available Transaction Processing

https://www.evernote.com/l/AAQxcmbp-JxHf7nftOa9yLLujC_RWTnkxnY

Phoenix: ACID Transactions

https://www.evernote.com/l/AAQGoMUzXe5BCawhBSy_JpdyhBh2F7fN0jk

Allocation Efficiency in High-Performance Go Services

https://www.evernote.com/l/AASvlOb-RpNCp6MIpNDaKE-E_HoxvXoegiE

Error handling patterns in Go

https://www.evernote.com/l/AAR7O4SRR_VOUpP_w83iyN03_R9ymm3o4i4

阿里电商架构演变之路

https://www.evernote.com/l/AAR9ySau20BNfoAJWqZuFPXjMpTQiCrt-XI

31 Dec 2017 by fleuria

Monthly Readings #21: 2017 Augest

Dremel made simple with Parquet

https://www.evernote.com/l/AAQKTqv0eOdL9bdRDp8t4YULZa0jwozkmIc

Pseudo GTID and easy replication management

https://www.evernote.com/l/AAQ7t9l8Px9D7qd_WzyaMFffXvcVyEz0AKM

Understanding GC pauses in JVM, HotSpot’s minor GC.

https://www.evernote.com/l/AASyfwT65NZJMar2aDb_asYUigabLm1fnCA

Understanding GC pauses in JVM, HotSpot’s CMS collector.

https://www.evernote.com/l/AAQSmFgCqN1DHZNoKIcAtt_LOvz4OAAxcwM

Colum-Store vs Row-Stores: How Different Are They Really

https://docs.google.com/document/d/1vtxqDyREUfDOPubEcGC2PE-VYt7ZwDXGJCbnHEy8uMo/edit#

16 Sep 2017 by fleuria

Monthly Readings #20: 2017 July

Mining Massive Datasets 学习笔记: LSH

怎样在大规模数据集中发现相似的项对列表?

最糙的解法是嵌套两层 for 循环遍历所有的项对,计算彼此的相似性再按相似性排序,这一计算量显然是不可接受的。而 k-d tree 之类的多维搜索树,面向高维度的数据集性能也退化的厉害,几乎与嵌套两层 for 循环无异。

最近在看《大数据:互联网大规模数据挖掘与分布式处理》介绍了 LSH (Locality Sensitive Hashing, 距离敏感哈希) 算法,它的思路是对数据项计算哈希,想办法将相似的项对分到同一个哈希桶中,从而简化大大计算过程。过程很有意思,在这里记一下。

集合的 Jaccard 相似度

集合 S 和 T 的 Jaccard 相似度表示为 S 和 T 的交集与并集之间的差:

集合相似度可以用于衡量文本文档的相似性,比如论文抄袭、重复新闻的检测;可以应用于协同过滤的场景,如一位用户喜欢的书籍集合和我的书籍集合相似,那么可以认为我们喜好相似;也可以将 uber 的驾车行驶路线表示为集合,继而可以通过 LSH 算法找出相似路线。

将文本文档表示为集合:Shingling

将文本文档表示为集合的 Shingling 方法和搜索引擎的 N-gram 是同义词,如 k-shingle 等于长度为 k 的任意子串。

对于重复新闻检测,可以将 shingle 定义为停用词加上后续的两个词形成 shingle 集,这一做法基于假定:新闻正文中富含停用词,而正文内混插的广告语通常是简短的口号性的文本,停用词少。这样做有助于减少同一新闻内容受站点插入的广告造成的干扰。

集合的摘要表示: MinHash

shingle 集合非常大,远远大于原始文档的体积。幸运的是可以通过 SimHash 将集合压缩为摘要,而摘要仍能用于相似性的计算。

minHash 之所以有用,在于两个集合的在经过 minHash 转换之后相等的概率恰好等于两者的 Jaccard 相似度。

设 shingle 集中有两个 shingle 集合 S 和 T,分别求取 minHash 等价于在两个集合中随机抽取一个元素:

  • 总样本空间为
  • 两个元素相等的样本空间为

易知随机抽取的两个元素相等的概率等于 Jaccard 相似度。

类似 bloom filter,使用的哈希函数越多,Jaccard 相似度的估算也就越精确。生成 100 个哈希函数,分别应用于一个集合所得的 minHash 列表可作为一个集合的 minHash 签名,对两个集合的签名做概率计算,即可估算出彼此的 Jaccard 相似度。

LSH 的哈希桶

哈希表可以保证相同的键分到同一个哈希桶,而相似性哈希的精髓在于,将划分到同一哈希桶的对象视为相似,并通过增加样本数提高相似性判定的精确性。

对每个集合计算 minHash,并将 minHash 分别划分到哈希桶中,分到同一桶中的集合会被视为可能相似的候选集。

为了提高精确性,计算多个 minHash 并分桶,同一对集合可能存在于多个桶中,后续通过一次排重扫描,找出哈希桶重复度最高的一批集合匹配,最后在它们中间计算 Jaccard 距离,进一步确认出最相似的集合。

Reference

01 Jul 2017 by fleuria

Monthly Readings #19: 2017 Jun

Is Kanban a part of Scrum, and is it simply the task board?

https://www.evernote.com/l/AARzQeN-h8pLTYcYQXT4mSc3o-bKqoOL0Ao

What is scrum?

https://www.evernote.com/l/AATlTXZAYGFJppz4h3_UOXtUqnbtKSDAfqI

互联网广告拍卖机制设计

https://www.evernote.com/l/AAQBt8ARkWtPGK0m1N04To7vIz-fLVXbjmM

Monte Carlo Tree Search

https://www.evernote.com/l/AASOQIGeQ-xA4K7Zw0dI9ZOQPq8Ob1_awzc

What is the difference between “hill climbing” and “greedy” algorithms?

https://www.evernote.com/l/AAQZ4srFHh5AV4fX5uHNwa23dhaCYRwUg6M

Impossible Engineering Problems Often Aren’t

https://www.evernote.com/l/AAQKyUWqRNRMf65MM_O2_uzD460Y9H4hWhk

Locality Sensitive Hashing By Spark

https://www.evernote.com/l/AAQ9tPAfbd1NUphCmSAYeyFLumT80BbS8mw

15 Jun 2017 by fleuria

Monthly Readings #18: 2017 March

Spark GraphX源码分析: 分布式图计算

https://www.evernote.com/l/AATTemn2ag1BBYcxkyREZRGxDp481A4Uauo

Impala Concepts and Architecture

https://www.evernote.com/l/AAQENXLmPghDC5TIukKM8gExGhu7kGgEMiE

09 Apr 2017 by fleuria

Monthly Readings #17: 2017 Jan

Semi-Synchronous Replication at Facebook:

https://www.evernote.com/l/AATV7nQ9kf9OQbS0BNfY2IHjdu99iM4_GXM

The Architecture of Schemaless, Uber Engineering’s Trip Datastore Using MySQL

https://www.evernote.com/l/AASmU-mwfQxJ3YWeogHM2yzFwsbn_nWKWZc

Designing Schemaless, Uber Engineering’s Scalable Datastore Using MySQL

https://www.evernote.com/l/AAQlAIyDfWVOoqlnk6SCipe9lmJP3z6QElY

Why Uber Engineering Switched from Postgres to MySQL

https://www.evernote.com/l/AAT4duPitB1GA4gpjjx1DX_vYa3wezBXNbY

Sharding Pinterest: How we scaled our MySQL fleet

https://www.evernote.com/l/AATkU-5xjDJPT4WXSk-indgUaqdCwK8HFI4

MySQL Backup in Facebook

https://www.evernote.com/l/AAQEbbfOhr5ASYpl4Dkhdlz7mtD-qKrVjls

If You Must Deploy Multi-Master Replication, Read This First

https://www.evernote.com/l/AAQxpve8CxVL9bL69Bi5DSWJGj4arRJvD9w

Best Practices for Amazon RDS

https://www.evernote.com/l/AATzBLB_6AZN1b8BKEkMTbvgi0dPQmvwH04

gh-ost: Triggerless design

https://www.evernote.com/l/AARj67QKJqlBlIfudgyiAgRfoKGtQZrDxe4

gh-ost: Sub-second replication lag throttling

https://www.evernote.com/l/AARjhPwJHbNAEqQ6kUmV1NFlVioDqFJPrF8

31 Jan 2017 by fleuria

Monthly Readings #16: December

Apache Hadoop Goes Realtime at Facebook:

https://www.evernote.com/l/AAQKwpeL_xFHKaqhsBCed9Bhlz2f9mjnP_E

gh-ost:

https://www.evernote.com/l/AATAfa7ydpNB_or2vZ-jW9a8Ps-A2d5fm1Q

LVS: How virtual server works?:

https://www.evernote.com/l/AATi2DGeTs9Nc7tQY_pYMhElH7-RF6JvWug

Ten years of KVM:

https://www.evernote.com/l/AASTEQ5KkdtEZ6fnFIO8QF0RSrHHIuroEtw

01 Jan 2017 by fleuria

Monthly Readings #15: November

Druid: A Real-time Analytical Data Store

https://www.evernote.com/l/AASFinGZRzVKJ6I5orCqFQ8Uxu52rHbL8Ug

Error Handling in Node.js

https://www.evernote.com/l/AATLkkYbySBBnbejRGJCKkfhjjA9HqOv8yQ

Putting Apache Kafka To Use: A Practical Guide to Building a Stream Data Platform

https://www.evernote.com/l/AATPH3DPMjtJgYf4zGfDxOho7Qnm4tnSWmo

Making “Push on Green” a Reality

https://www.evernote.com/l/AATrdAOKFDdA9a7a9bE0EFk1pQnpj21oFd4

Big Data in Real-Time at Twitter

https://www.evernote.com/l/AAQfykd2fw1IOJZMZYFZK4RVVbLKmr-SDag

04 Dec 2016 by fleuria

Monthly Readings #14: October

FollowFeed: LinkedIn’s Feed Made Faster and Smarter

https://www.evernote.com/l/AARahELmw89Pw78UcnP_H5M0-dGdOlDxNCk

Design Decisions For Scaling Your High Traffic Feeds

https://www.evernote.com/l/AARsZ-atKkZHppDeskPAHfdR0kVo6NUKdFs

Etsy Activity Feeds Architecture

https://www.evernote.com/l/AARhu-dZlMFAR4yN0qevIrtyf62LDX6qJ8A

What I Wish I Had Known Before Scaling Uber to 1000 Services

https://www.evernote.com/l/AASoq1-krsdI2oKJhD8f5V-gd_N2yUCO9Ig

Wasting Time TDDing The Wrong Things

https://www.evernote.com/l/AATqMOkwCS5HB6ydLcL8UZXHX_AzsEAjW3E

There is not Fork: an Abstraction for Efficient, Concurrent, and Concise Data Access

https://www.evernote.com/l/AAQJF0wQWKxCLaLAoqVMhQXyJkneH1SY-0o

dataloader.js

https://www.evernote.com/l/AASdRH_2b0NGq7od1OxCMpP9Nc6yjks2YZU

Comparing Redux and Relay

https://www.evernote.com/l/AASmZ_MkGtxF1b7bmrwfYX4ddyAG80n0Cs8

29 Oct 2016 by fleuria


... MORE