CloudNine Analyst: Understanding Deduplication Options

Deduplication options for messages and items in CloudNine Analyst

CloudNine Analyst offers multiple options for removal of duplicates (deduplication) from data that will aid with paring down data sets to a manageable population. However, there are instances where you may not want to deduplicate at a global level. Here is a breakdown of the options available:
  1. Do not deduplicate 
  2. Globally deduplicate (provided hash)
  3. Globally deduplicate (system hash)
  4. Deduplicate based on Import
  5. Deduplicate based on Evidence Source

Do Not Deduplicate

If this option is selected, no deduplication will occur and every item will be loaded into the platform regardless of previous deduplication settings applied to previous imports.

Globally Deduplicate (Provided Hash)

All items will be hashed and deduplicated against every item within the designated Project as well as the items currently being ingested in the order in which they are read based upon the hash value provided (MD5, SHA-1 or other).  
 
If this option is selected, you must provide a hash value column in your load file.
This hash value will be leveraged to perform the deduplication process.
 
If this hash value is not provided, the item will be skipped (rejected) during import. 
 

Globally Deduplicate (System Hash)

All items will be hashed and deduplicated against every item within the designated Project as well as the items currently being ingested in the order in which they are read based upon the system's generated MD5 hash value. Each metadata type incorporates different fields to create the system hash value. 
Global deduplication of decentralized communications such as chat, sms and mms may not result in 100% deduplication of all items. This is due to the nuances of content storage settings of mobile and app-based data. Deduplication may have an unexpected effect on messages and threads. Deduplication looks for messages with the exact same content sent at the exact same time with the same senders and recipients. If any metadata is differing across messages due to local device settings, this may cause items not to be deduplicated. 

Deduplicate Based on Import

All items will be hashed and deduplicated against every item within only the items currently being ingested in the order in which they are read based upon the system's generated MD5 hash value. This is essentially a selection that allows you to isolate just the records being imported to deduplication.
If this option is selected items will not be deduplicated against any previously imported items. 
 

Deduplicate Based on Evidence Container

All items will be hashed and deduplicated against every item within the designated Project that originated from the designated evidence container (drop-down selection in step 1 of your import) as well as the items currently being ingested in the order in which they are read.
This deduplication option is helpful if you wish to deduplicate items based upon a specific device or set of devices defined by a single source of evidence (such as a custodian). it is sometimes referred to as "custodial level deduplication".