[Python] 순서를 유지하면서 리스트의 연속된 중복 제거하기

루ㅌ 2020. 2. 11. 18:25

https://exmemory.tistory.com/4?category=791023

[Python] 순서를 유지하면서 리스트의 중복 제거하기

순서를 유지하지 않으면서 리스트의 중복을 제거하는 경우는 list(set(list())) 위와 같이 합니다. 그러나 순서를 유지해야 하는 경우가 생길 수 있는데 아래의 코드로 중복을 제거할 수 있습니다! def ordered_un..

exmemory.tistory.com

위 링크의 '순서를 유지하면서 리스트의 중복 제거하기'는

x = ['a', 'a', 'b', 'b', 'c', 'c']

x와 같은 리스트가 있다면

[a, b, c]와 같이 순서를 유지하면서 리스트의 중복을 제거해 주었습니다.

다량의 텍스트 데이터에 대한 토크나이징을 하면서 도배글에서 만들어진 값들을 제거할 필요가 생겼습니다.

예) [말', '어', '어', '엉', '더', '어', '어', '엉', '말', '어', '어', '엉', '말', '어', '어', '엉', '더', '어', '어', '엉', '말', '어', '어', '엉', '말', '어', '어', '엉', '더', '어', '어', '엉', '말', '어', '어', '엉', '말', '어', '어', '엉', '더', '어', '어', '엉', ]

위 코드로

x = ['a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c']와 같은 리스트를

[a, b, c]와 같이 연속된 값에 대한 중복도 제거할 수 없기 때문에 코드를 새로 만들게 되었습니다.

def find_repeat(find_list, input_list, mem_len):
	if not len(input_list) == len(find_list):
		find_list = find_list[-len(input_list):]

	for index in range(1, len(input_list)+1):
		if find_list[len(input_list)-index:] == input_list[:index]:
			return index	
	return -1

def remove_repeated_word(input_list, mem_len=4):
	return_list = input_list[:mem_len]
	check_index = mem_len
	for index, data in enumerate(input_list):
		if index < check_index:
			pass
		else:
			find_data = find_repeat(return_list[-mem_len:], input_list[index:index+mem_len], mem_len)
			if not find_data == -1:
				check_index = index + find_data
			else:
				return_list.append(data)
	return return_list

함수 remove_repeated_word()는 연속되는 중복을 제거할 리스트와 mem_len이라는 인자를 받는데

mem_len로 몇개의 연속된 중복을 제거할지 정할 수 있습니다.

예를 들어 mem_len이 3이라면 최대 3개의 연속된 값에 대해 중복을 제거할 수 있으며

위의 x에 대해서 mem_len이 2라면 중복이 제거되지 않습니다.

반대로 mem_len이 10이라고 한다면

초기 x[:mem_len]에 있는 값에 대해선 중복이 제거되지 않습니다.