187. Repeated DNA Sequences

Description

The DNA sequence is composed of a series of nucleotides abbreviated as 'A', 'C', 'G', and 'T'.

  • For example, "ACGAATTCCG" is a DNA sequence.

When studying DNA , it is useful to identify repeated sequences within the DNA.

Given a string s that represents a DNA sequence , return all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule. You may return the answer in any order.

Example 1:

Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Output: ["AAAAACCCCC","CCCCCAAAAA"]

Constraints:

  • 1 <= s.length <= 10^5

  • s[i] is either 'A', 'C', 'G', or 'T'.

Tags

Hash Table, Bit Manipulation

Solution

We first encode each type of nucleotide into an interger between 0 (00) and 3 (11). Then, we slide a window with length of 10 over the given string and use a hash table to check duplicates. The encoding precedures are as follows:

  • Map 'A', 'C', 'G', and 'T' into 0, 1, 2 and 3, respectively, and use an integer bitmask to represent the substring a.k.a the sliding window in binary form;

  • Left shift bitmask by 2 bits, since we are using 00~11 to represent each type of nucleotide;

  • Add the incoming nucleotide (binary form) onto the bitmask;

  • Clear the highest 2 bits by & (1111 1111 1111 1111 1111).

Complexity

  • Time complexity: O(nL)O(n)O(n-L) \sim O(n)

  • Space complexity: O(nL)O(n)O(n-L) \sim O(n)

Code

func findRepeatedDnaSequences(s string) []string {
	var ans []string
	bitSet, dict, L, bitmask := map[int]int{}, map[byte]int{'A': 0, 'T': 1, 'C': 2, 'G': 3}, 10, 0
	for i := 0; i < len(s); i++ {
		bitmask = bitmask<<2 | dict[s[i]]
		bitmask &= (1<<32 - 1) >> (32 - L*2) // 1111 1111 1111 1111 1111
		if i >= L-1 {                        // start from s[0:10)
			if bitSet[bitmask] == 1 {
				ans = append(ans, s[i-L+1:i+1])
			}
			bitSet[bitmask]++
		}
	}
	return ans
}

Reference

Last updated

Was this helpful?