TinyURL

Leave a comment

April 29, 2017 by oneOokay

设计一个TinyURL, system design相关.

TinyURL is a URL shortening service where you enter a URL such as https://leetcode.com/problems/design-tinyurl and it returns a short URL such as http://tinyurl.com/4e9iAk. unique key/identifier: 6 digit.

http://tinyurl.com/是固定的,然后把你的original long url以某种方式转化成6 digit字符.

首先这样一个base62的identifier:一共可以存:62^6个url. 如果run out of urls,那么就可以变成7位digit.

以何种方式存储以及以何种方式转化是最开始的两个问题.

从最简单的implementation不涉及数据库开始以及不限制identifier的format开始:

最简单的是只用一个array string[]来存: http://tinyurl.com/0;http://tinyurl.com/1;…

  • pros:
    • 如果一个long url已经存过了,没办法知道它,只能继续把它存到array里面,浪费空间
    • 人可以知道我已经存了多少个url
    • 人可以刷到自己想要的数字:666
    • 能够存的url的数目很少:比如6个digit的才能存999999个,比起62^6少太多.
  • cons:超简单的encode和decode.

接着就是用HashMap了<String, String>: <identifier, longUrl>:需要一个mechanism来根据longUrl来产生一个unique identifier. 就想到了hashing.

需要两个hashMap, 一个来存<identifier,longUrl>,另一个存:<longUrl,identifier>这样如果在同样的longUrl进行非初次encoding请求的时候,能够直接返回已经存在的identifier.

如何hashing:(简单的implementation)

  • 一个charset string, 取一个random的6 digit string(要会写这个代码).
    • charset string包含所有可以用来encoding的chars
    • 产生Random数的方法:
      • Random rand = new Random(); rand.nextInt(int num); 会产生0 – (num-1)之间的一个random int数字.
      • (int)Math.random() % charset.length(): Math.random()产生的一个double,所以要转化.
  • 问题:hash collision: 重复generate identifiers直到找到一个没有用过的.
  • 可以由一串不重复的数字:可以是sequential numbers 或者request的timestamp + string.hashcode(). 把这串数字转换成base62的string来得到identifier.这样就不会有collision.
  • 如何取得baseUrl之后的identifiers:就会删掉baseUrl.
    • shortUrl.replace(baseUrl, "")

代码:

    private HashMap<String, String> identifierMap = new HashMap<>();
    private HashMap<String, String> longUrlMap = new HashMap<>();
    static String baseUrl = "http://tinyurl.com/";
    static String charSet = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
    // Encodes a URL to a shortened URL.
    public String encode(String longUrl) {
        if (longUrlMap.containsKey(longUrl)) return baseUrl + longUrlMap.get(longUrl);
        
        String key = null;
        do {
            StringBuilder tinyUrl = new StringBuilder();
            for(int i = 0; i < 6; i ++){
                char c = charSet.charAt((int)(Math.random() % charSet.length()));
                tinyUrl.append(c);
            }
            key = tinyUrl.toString();
        }
        while (identifierMap.containsKey(key)); //while这里要一个分号
        identifierMap.put(key, longUrl);
        longUrlMap.put(longUrl, key);
        return baseUrl + key;
    }

    // Decodes a shortened URL to its original URL.
    public String decode(String shortUrl) {
        String key = shortUrl.replace(baseUrl, "");
        return identifierMap.containsKey(key)? identifierMap.get(key) : "";
    }

接下来进入Industry了,存数据库.

flat file database:不够,没有index,读写太慢. 每次读写都要读写整个file.

  • A flat file database is a database which is stored on its host computer system as an ordinary unstructured file called a “flat file”. To access the structure of the data and manipulate it, the file must be read in its entirety into the computer’s memory. Upon completion of the database operations, the file is again written out in its entirety to the host’s file system. In this stored mode the database is said to be “flat”, meaning that it has no structure for indexing and there are usually no structural relationships between the records. A flat file can be a plain text file or a binary file.
  • Flat file databases are most often used in a “transactional” nature and when entire file processing is required, where Relational Databases are generally found in data warehousing implementations where direct record access is essential.

所以还是relational database. memory比较便宜….

table: id|identifier|longUrl|count|lastVisitedAt

  1. 业界一般都是increment id,把id用base36或者base62转换成identifier
  2. 用MD5或者SHA256对longURL进行hashing得到一串数字,把数字用base36或者base62转化成identifier
  3. 用一个standalone key generation service(KGS) generates random six letter strings beforehand and stores them in a database. Whenever we want to shorten a URL, we will just take one of the already generated keys and use it. This approach will make things quite simple and fast since we will not be encoding the URL or worrying about duplications or collisions. KGS will make sure all the keys inserted in our keys database are unique.

这个会是一个read-heavy的system. bottle neck在读的peak的时候…(?)

在一个病毒性传播的tinyUrl有N多人点开的时候,我们就需要cache了.

  • How would you scale the service? For example, a viral link which is shared in social media could result in a peak QPS at a moment’s notice.
  • How could you handle redundancy? i,e, if a server is down, how could you ensure the service is still operational?
  • Keep URLs forever or prune, pros/cons? How we do pruning? 
  • What API would you provide to a third-party developer? 
  • If you can enable caching, what would you cache and what’s the expiry time? 

这五个绝对是经典的system design 问题了.不过我不记得了=.=要看一些design twitter回头再来看这个.

Estimate the maximum number of URLs a single machine can store.

Assume each row in the database consists of an ID (4 bytes), long url (2090 bytes)=2kb, short url identifier (6 bytes), so each row is about 2100 bytes. That means 10 million URLs take about 21GB. Assume a machine has ~100GB of storage => ~50 million URLs per machine. That’s a lot of URLs :)can also combines timestamp

怎么估算QPS?

301 redirect:

  • A 301 redirect is a method of telling web browsers and search engines that a web page or site has been permanently moved to a new location. Usually a 301 redirect includes the address to which the resource has been moved. Web browsers will typically follow 301 redirects to the new location automatically, without the need for user action.Branded Link Shortener
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: