187. Repeated DNA Sequences

Description

The DNA sequence is composed of a series of nucleotides abbreviated as 'A', 'C', 'G', and 'T'.

  • For example, "ACGAATTCCG" is a DNA sequence.

When studying DNA , it is useful to identify repeated sequences within the DNA.

Given a string s that represents a DNA sequence , return all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule. You may return the answer in any order.

Example 1:

Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Output: ["AAAAACCCCC","CCCCCAAAAA"]

Constraints:

  • 1 <= s.length <= 10^5

  • s[i] is either 'A', 'C', 'G', or 'T'.

Tags

Hash Table, Bit Manipulation

Solution

We first encode each type of nucleotide into an interger between 0 (00) and 3 (11). Then, we slide a window with length of 10 over the given string and use a hash table to check duplicates. The encoding precedures are as follows:

  • Map 'A', 'C', 'G', and 'T' into 0, 1, 2 and 3, respectively, and use an integer bitmask to represent the substring a.k.a the sliding window in binary form;

  • Left shift bitmask by 2 bits, since we are using 00~11 to represent each type of nucleotide;

  • Add the incoming nucleotide (binary form) onto the bitmask;

  • Clear the highest 2 bits by & (1111 1111 1111 1111 1111).

Complexity

  • Time complexity: O(nL)O(n)O(n-L) \sim O(n)

  • Space complexity: O(nL)O(n)O(n-L) \sim O(n)

Code

Reference

Last updated

Was this helpful?