Most cloud providers provide services for Personal Identifiable Information (PII) management, but it comes at a cost, especially when dealing with massive volume of data that needs to be scanned. In this article, we will go through identifying PII using Presidio, an open source library under MIT License created by Microsoft. Another use case would be to run Presidio’s custom models in conjunction with other services which do not provide custom models.

  1. Installing Presidio
  2. Using built-in recogniser
  3. Creating Regex recogniser
  4. Conclusion

Installing Presidio

Presidio works with 3 NLP engines spaCytransformers and stanza. When installing Presidio, we can select one of these engines depending on the use case. For this demonstration, we will use transformers which also needs spacy embeddings and processing.

pip install "presidio_analyzer[transformers]"
pip install presidio_anonymizer
python -m spacy download en_core_web_sm

Using built-in recogniser

Presidio has a module called AnonymizerEngine which does the heavy lifting for you. It take 3 major inputs to get started
i. text – text to analyse
ii. entities – an array to entities to find in the input string.
iii. language

from presidio_analyzer import AnalyzerEngine

analyzer = AnalyzerEngine()

results = analyzer.analyze(text="I am Yash Mehta, you can reach out to me on 0400000000",
                           entities=["PERSON", "PHONE_NUMBER"],
                           language='en')
print(results)
----------------------------------------------------------------------------

python3 main.py
[type: PERSON, start: 5, end: 15, score: 0.85, type: PHONE_NUMBER, start: 44, end: 54, score: 0.4]

The output contains all the identified entities, position in the string, and confidence.

Using Regex recogniser

Presidio provides options to embed regex matching within the framework using PatternRecognizer. The output looks similar to AnonymizerEngine.

from presidio_analyzer import Pattern, PatternRecognizer

numbers_pattern = Pattern(name="numbers_pattern", 
                          regex=r"(?:\+?(61))? ?(?:\((?=.*\)))?(0?[2-57-8])\)? ?(\d\d(?:[- ](?=\d{3})|(?!\d\d[- ]?\d[- ]))\d\d[- ]?\d[- ]?\d{3})$", 
                          score=0.5
                )

# Define the recognizer with one or more patterns
number_recognizer = PatternRecognizer(
    supported_entity="NUMBER", patterns=[numbers_pattern]
)

numbers_result = number_recognizer.analyze(text=text, entities=["NUMBER"])

print(numbers_result)
----------------------------------------------------------------------------

python3 main.py
[type: NUMBER, start: 43, end: 54, score: 0.5]

Bringing it all together

Now we will look at how to de-identify the results we a obtained from the previous steps. In this section we will combine both types on recognisers into the program and then feed the results into AnonymizerEngine

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer import Pattern, PatternRecognizer
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig

text = "I am Yash Mehta, you can reach out to me on 0400000000"

# Initialising built-in analyzer 
analyzer = AnalyzerEngine()

# Initialising anonymizer engine
anonymizer_engine = AnonymizerEngine()

results = analyzer.analyze(text=text,
                           entities=["PERSON"],
                           language='en')

# Initializing pattern 
numbers_pattern = Pattern(name="numbers_pattern", 
                          regex=r"(?:\+?(61))? ?(?:\((?=.*\)))?(0?[2-57-8])\)? ?(\d\d(?:[- ](?=\d{3})|(?!\d\d[- ]?\d[- ]))\d\d[- ]?\d[- ]?\d{3})$", 
                          score=0.5
                )

# Initialize the recognizer with one or more patterns
number_recognizer = PatternRecognizer(
    supported_entity="NUMBER", patterns=[numbers_pattern]
)

numbers_result = number_recognizer.analyze(text=text, entities=["NUMBER"])

deidentified_text = anonymizer_engine.anonymize(
    text=text,
    analyzer_results=results+numbers_result,
    operators={"ALL": OperatorConfig("replace", {"new_value": "BIP"})},
)

print(deidentified_text.text)
----------------------------------------------------------------------------

python3 main.py
I am <PERSON>, you can reach out to me on<NUMBER>

Conclusion

In this article, we created a PII de-identification engine and understood the details of Presidio engine. There are multiple use cases where budget is the primary concern when de-identifying PII data and tools like Presidio help achieve that.

For more information about Presidio, visit https://microsoft.github.io/presidio/


Leave a Reply

Your email address will not be published. Required fields are marked *