Create a keyword dictionary

Data loss prevention (DLP) in Office 365 can identify, monitor, and protect your sensitive information. Identifying sensitive information sometimes requires looking for keywords, particularly when identifying generic content (such as healthcare-related communication) or inappropriate or explicit language. While you can create keyword lists in sensitive information types, keyword lists are limited in size and require modifying XML to create or edit them. Keyword dictionaries provide simpler management of keywords and at a much larger scale, supporting up to 100,000 terms per dictionary.

Basic steps to creating a keyword dictionary

The keywords for your dictionary could come from a variety of sources, most commonly from a file (such as a .csv or .txt list), from a list you enter directly in the cmdlet, or from an existing dictionary. When you create a keyword dictionary, you follow the same core steps:

  1. Connect to Security & Compliance Center PowerShell - see this topic.

  2. Define or load your keywords from your intended source - the cmdlet to create a keyword dictionary accepts a comma-separated list of keywords, so this step will vary slightly depending on where your keywords come from.

  3. Encode your keywords - once loaded, they're converted to a byte array before they're imported.

  4. Create your dictionary - choose a name and description and create your dictionary.

Create a keyword dictionary from a file

Often when you need to create a large dictionary, it's to use keywords from a file or a list exported from some other source. In this case, you'll create a keyword dictionary containing a list of inappropriate language to screen in external email. You need to first connect to the Security & Compliance Center PowerShell.

  1. Copy the keywords into a text file and make sure that each keyword is on a separate line.

  2. Save the text file with Unicode encoding. In Notepad > Save As > Encoding > Unicode.

  3. Read the file into a variable by running this cmdlet:

    $fileData = Get-Content <filename> -Encoding Byte -ReadCount 0
  4. Create the dictionary by running this cmdlet:

    New-DlpKeywordDictionary -Name <name> -Description <description> -FileData $fileData

Modifying an existing keyword dictionary

You might need to modify keywords in one of your keyword dictionaries, or modify one of the built-in dictionaries. In this example, we’ll modify some terms in PowerShell, save the terms locally where you can modify them in an editor, and then update the previous terms in place. First, retrieve the dictionary object:

$dict = Get-DlpKeywordDictionary -Name "Diseases"

Printing $dict will show the various variables. The keywords themselves are stored in an object on the backend, but $dict.KeywordDictionary contains a string representation of them, which you’ll use to modify the dictionary. Before you modify the dictionary, you need to turn the string of terms back into an array using the .split(‘,’) method. Then you’ll clean up the unwanted spaces between the keywords with the .trim() method, leaving just the keywords to work with.

$terms = $dict.KeywordDictionary.split(',').trim()

Now you'll remove some terms from the dictionary. Because the example dictionary has only a few keywords, you could just as easily skip to exporting the dictionary and editing it in Notepad, but dictionaries generally contain a large amount of text, so you’ll first learn this way to edit them easily in PowerShell.

In the last step, you saved the keywords to an array. There are several ways to remove items from an array, but as a straightforward approach, you'll create an array of the terms you want to remove from the dictionary, and then copy only the dictionary terms to it that aren’t in the list of terms to remove.

Run the command $terms to show the current list of terms. The output of the command looks like this:

aarskog's syndrome
abandonment
abasia
abderhalden-kaufmann-lignac
abdominalgia
abduction contracture
abetalipoproteinemia
abiotrophy
ablatio
ablation
ablepharia
abocclusion
abolition
aborter
abortion
abortus
aboulomania
abrami's disease

Run this command to specify the terms that you want to remove:

$termsToRemove = @('abandonment', 'ablatio')

Run this command to actually remove the terms from the list:

$updatedTerms = $terms | Where-Object{ $_ -notin $termsToRemove }

Run the command $updatedTerms to show the updated list of terms. The output of the command looks like this (the specified terms have been removed):

aarskog's syndrome
abasia
abderhalden-kaufmann-lignac
abdominalgia
abduction contracture
abetalipo proteinemia
abiotrophy
ablation
ablepharia
abocclusion
abolition
aborter
abortion
abortus
aboulomania
abrami's disease

Now save the dictionary locally and add a few more terms. You could add the terms right here in PowerShell, but you’ll still need to export the file locally to ensure it’s saved with Unicode encoding and contains the BOM.

Save the dictionary locally by running the following:

Set-Content $updatedTerms -Path "C:\myPath\terms.txt"

Now simply open the file, add your additional terms, and save with Unicode encoding (UTF-16). Now you’ll upload the updated terms and update the dictionary in place.

PS> Set-DlpKeywordDictionary -Identity "Diseases" -FileData (Get-Content -Path "C:myPath\terms.txt" -Encoding Byte -ReadCount 0)

Now the dictionary has been updated in place. Note that the Identity field takes the name of the dictionary. If you wanted to also change the name of your dictionary using the set- cmdlet, you would just need to add the -Name parameter to what’s above with your new dictionary name.

Using keyword dictionaries in custom sensitive information types and DLP policies

Keyword dictionaries can be used as part of the match requirements for a custom sensitive information type, or as a sensitive information type themselves. Both require creating a custom sensitive information type. Follow the instructions in the linked article to create a sensitive information type. Once you have the XML, you’ll need the GUID identifier for the dictionary to use it.

<Entity id="9e5382d0-1b6a-42fd-820e-44e0d3b15b6e" patternsProximity="300" recommendedConfidence="75">
	<Pattern confidenceLevel="75">
		<IdMatch idRef=". . ."/>
	</Pattern>
</Entity>

To get the identity of your dictionary, run this command and copy the Identity property value:

Get-DlpKeywordDictionary -Name "Diseases"

The output of the command looks like this:

RunspaceId        : 138e55e7-ea1e-4f7a-b824-79f2c4252255
Identity          : 8d2d44b0-91f4-41f2-94e0-21c1c5b5fc9f
Name              : Diseases
Description       : Names of diseases and injuries from ICD-10-CM lexicon
KeywordDictionary : aarskog's syndrome, abandonment, abasia, abderhalden-kaufmann-lignac, abdominalgia, abduction contracture, abetalipo
                    proteinemia, abiotrophy, ablatio, ablation, ablepharia, abocclusion, abolition, aborter, abortion, abortus, aboulomania,
                    abrami's disease, abramo
IsValid           : True
ObjectState       : Unchanged

Paste the identity into your custom sensitive information type's XML and upload it. Now your dictionary will appear in your list of sensitive information types and you can use it right in your policy, specifying how many keywords are required to match.

<Entity id="d333c6c2-5f4c-4131-9433-db3ef72a89e8" patternsProximity="300" recommendedConfidence="85">
      <Pattern confidenceLevel="85">
        <IdMatch idRef="8d2d44b0-91f4-41f2-94e0-21c1c5b5fc9f" />
      </Pattern>
    </Entity>
    <LocalizedStrings>
      <Resource idRef="d333c6c2-5f4c-4131-9433-db3ef72a89e8">
        <Name default="true" langcode="en-us">Diseases</Name>
        <Description default="true" langcode="en-us">Detects various diseases</Description>
      </Resource>
    </LocalizedStrings>
Connect with an expert
Contact us
Expand your skills
Explore training

Was this information helpful?

Thank you for your feedback!

Thank you for your feedback! It sounds like it might be helpful to connect you to one of our Office support agents.

×