구현을 통한 Bloom Filter 알아보기 with Python3

Study/Data Structure 2020. 6. 11. 22:39

Bloom Filter를 구현하기에 앞서 Bloom Filter의 정의를 보겠습니다.

위키피디아에 따르면 "Bloom Filter는 원소가 집합에 속하는지 여부를 검사하는데 사용되는 확률적 자료구조이다.

Bloom Filter의 핵심은 바로 '확률적 자료구조'라는데 있습니다.

'블룸 필터에 의해 어떤 원소가 집합에 속한다고 판단된 경우 실제로는 원소가 집합에 속하지 않는 긍정 오류가 발생하는 것이 가능하지만, 반대로 원소가 집합에 속하지 않는 것으로 판단되었는데 실제로는 원소가 집합에 속하는 부정 오류는 절대로 발생하지 않는다는 특성이 있다'라고 합니다.

예를 들어 원소 'dog'이 집합에 속한다고 알려주지만 실제로는 'dog'이 집합의 원소가 아닐 수 있다는 것 입니다. 그리고 'cat'이라는 원소가 집합에 속하지 않는다면 절대로 집합에 포함될 일을 없다는 특징이 있다고 합니다.

블룸 필터를 구성하는 요소로 m비트 크기의 비트 배열과, k개의 해시 함수가 필요합니다.

아이템 'dog'을 필터에 추가하고 싶을 때 처음에 m비트 크기의 배열에 전부 0값이 들어가 있습니다. 'dog'을 대해

k개의 해시함수의 입력으로 넣고 k개의 해시함수가 출력하는 값을 비트 배열 인덱스에 해당하는 위치에 1로 설정합니다.

m이 10인 비트 배열이라면

k가 3개인 해시함수

h1('dog') % 10 = 1

h2('dog') % 10 = 2

h3('dog') % 10 = 3

를 통해서 나온 출력 값 1,2,3을 비트 배열에 해당하는 인덱스에 1로 설정합니다.

만일 'cat'을 해시함수에 넣고 나온 입력값도 1, 2, 3이라면 'cat'을 넣지 않았음에도 'cat'을 넣었다고 말합니다.

m을 비트 배열의 크기로, k를 해시 함수의 수로, n을 필터에 삽입 할 것으로 예상되는 요소 수로 지정하면 p의 확률은 다음과 같이 계산 될 수 있습니다.

예상되는 개수의 요소 n이 알려져 있고 원하는 확률이 p이면 비트 배열 m의 크기는 다음과 같이 계산 될 수 있습니다.

m이 비트 배열의 크기이고 n이 삽입 될 요소의 수이면 k는 다음과 같이 계산 될 수 있습니다.

hash map, trie, array, linked list를 사용하여 집합에 원소를 저장할 수 있지만 메모리 효율적이지 못합니다. 그러나 블룸 필터는 데이터 자체를 저장하는 것이 아닙니다. 어느 정도의 해시 충돌을 허용해야 메모리 공간을 줄일 수 있습니다.

블룸 필터의 해시함수는 independent하며 uniform distribution이어야 합니다. 그리고 빨라야 합니다. 빠르고 간단하면서 independent하고 암호학적 해시함수가 아닌 것으로 murmur 해시함수가 있습니다. 암호학적 해시함수는 안정적이고 좋지만 계산비용이 너무 비쌉니다. 해시함수 k가 증가할 수록 블룸필터는 느려질 것 입니다. 암호학적 해시함수를 쓰는지 않는게 좀 찝찝하긴 하지만 성능면에선 확실히 우수합니다.

아래는 Python3를 이용한 bloom filter 구현입니다.

# Python 3 program to build Bloom Filter 
# Install mmh3 and bitarray 3rd party module first 
# pip install mmh3 
# pip install bitarray 
import math 
import mmh3 
from bitarray import bitarray 
  
class BloomFilter(object): 
  
    ''' 
    Class for Bloom filter, using murmur3 hash function 
    '''
  
    def __init__(self, items_count,fp_prob): 
        ''' 
        items_count : int 
            Number of items expected to be stored in bloom filter 
        fp_prob : float 
            False Positive probability in decimal 
        '''
        # False posible probability in decimal 
        self.fp_prob = fp_prob 
  
        # Size of bit array to use 
        self.size = self.get_size(items_count,fp_prob) 
  
        # number of hash functions to use 
        self.hash_count = self.get_hash_count(self.size,items_count) 
  
        # Bit array of given size 
        self.bit_array = bitarray(self.size) 
  
        # initialize all bits as 0 
        self.bit_array.setall(0) 
  
    def add(self, item): 
        ''' 
        Add an item in the filter 
        '''
        digests = [] 
        for i in range(self.hash_count): 
  
            # create digest for given item. 
            # i work as seed to mmh3.hash() function 
            # With different seed, digest created is different 
            digest = mmh3.hash(item,i) % self.size 
            digests.append(digest) 
  
            # set the bit True in bit_array 
            self.bit_array[digest] = True
  
    def check(self, item): 
        ''' 
        Check for existence of an item in filter 
        '''
        for i in range(self.hash_count): 
            digest = mmh3.hash(item,i) % self.size 
            if self.bit_array[digest] == False: 
  
                # if any of bit is False then,its not present 
                # in filter 
                # else there is probability that it exist 
                return False
        return True
  
    @classmethod
    def get_size(self,n,p): 
        ''' 
        Return the size of bit array(m) to used using 
        following formula 
        m = -(n * lg(p)) / (lg(2)^2) 
        n : int 
            number of items expected to be stored in filter 
        p : float 
            False Positive probability in decimal 
        '''
        m = -(n * math.log(p))/(math.log(2)**2) 
        return int(m) 
  
    @classmethod
    def get_hash_count(self, m, n): 
        ''' 
        Return the hash function(k) to be used using 
        following formula 
        k = (m/n) * lg(2) 
  
        m : int 
            size of bit array 
        n : int 
            number of items expected to be stored in filter 
        '''
        k = (m/n) * math.log(2) 
        return int(k)

from bloomfilter import BloomFilter 
from random import shuffle 
  
n = 20 #no of items to add 
p = 0.05 #false positive probability 
  
bloomf = BloomFilter(n,p) 
print("Size of bit array:{}".format(bloomf.size)) 
print("False positive Probability:{}".format(bloomf.fp_prob)) 
print("Number of hash functions:{}".format(bloomf.hash_count)) 
  
# words to be added 
word_present = ['abound','abounds','abundance','abundant','accessable', 
                'bloom','blossom','bolster','bonny','bonus','bonuses', 
                'coherent','cohesive','colorful','comely','comfort', 
                'gems','generosity','generous','generously','genial'] 
  
# word not added 
word_absent = ['bluff','cheater','hate','war','humanity', 
               'racism','hurt','nuke','gloomy','facebook', 
               'geeksforgeeks','twitter'] 
  
for item in word_present: 
    bloomf.add(item) 
  
shuffle(word_present) 
shuffle(word_absent) 
  
test_words = word_present[:10] + word_absent 
shuffle(test_words) 
for word in test_words: 
    if bloomf.check(word): 
        if word in word_absent: 
            print("'{}' is a false positive!".format(word)) 
        else: 
            print("'{}' is probably present!".format(word)) 
    else: 
        print("'{}' is definitely not present!".format(word))

코드 및 내용 출처:

https://www.geeksforgeeks.org/bloom-filters-introduction-and-python-implementation/

https://ko.wikipedia.org/wiki/%EB%B8%94%EB%A3%B8_%ED%95%84%ED%84%B0

'Study > Data Structure' 카테고리의 다른 글

[Data Structure] 자료구조 - Array (0)	2020.09.21
Trie 자료구조 (0)	2020.06.11
스택 Stack (0)	2020.04.18

ABOUT ME

DE 개발일지 DE 개발일지

'Study > Data Structure' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'Study > Data Structure' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바