Blog

PDF metadata: what is PDF metadata and what properties does it have?

pdf metadata

What is metadata?

We can find metadata in practically any digital file: text documents, spreadsheets, photos, videos, audios and songs, etc...

In the case of office documents, metadata can store information on who created it, who modified it, who last accessed the document and the corresponding dates, how long the document was edited, the device or software used to create the document, or the company and department to which the document belongs.

The main reason for creating metadata is to facilitate the search for relevant information using various search criteria. Metadata can help organise electronic documents, facilitate interoperability between organisations, provide digital identification and support document lifecycle management.

Metadata is usually hidden and are not visible using the standard settings of the application with which we are working on the file. In order to view them, it is necessary to set up a specific configuration or even use specific software to reveal this hidden data.

In the case of office documents, in addition to metadata, there may also be other types of information hidden in the content of the document itself, such as text and objects formatted as invisible, data outside the document viewing area, or information regarding comments and revision changes and the identity of who made each of them. This is often referred to as Hidden Information or Hidden Data and because it is not visible to the naked eye, the user may not be aware of its existence and is also a risk if the document is distributed to people outside the organisation.

Therefore, and as the channel of dissemination of information archives is increasing, individuals and organisations should establish measures to protect your private and confidential information. Part of these protection measures require procedures and tools for reviewing and cleaning documents and files, to minimise the risk of sensitive information being revealed through metadata or hidden data.

 

Metadata in office documents and PDFs

This type of document contains embedded metadata through the document properties. This metadata contains information such as: title, subject, comments, tags, author, creation and modification dates, date of printing, last user who modified the document, editing time, statistics, etc.

Metadata in document properties can be standard metadata (metadata prefixed by the programme) filled in automatically by the programme, or manually by the user or the organisation. They can also be custom metadata, which are specific types of metadata that are created and filled in by the user or organisation.

These documents, in addition to containing metadata in their properties, may have more specific metadata associated with them in various formats (XMP, RDF, etc.), either embedded in the document or separate from it (e.g. in separate files called “sidecar files”).

 

Risks and Threats

Metadata is a source of risk, as it may contain sensitive information that should not be disclosed to people outside the organisation. It is therefore necessary for organisations and users to be aware of the risk posed by the leakage of such sensitive information, such as customer data, intellectual property, financial details or any other information that would be inconvenient for the organisation to disclose.

The following figure shows an example of the impact that could be caused by exposing certain information stored in the metadata of a document.

 

metadata-risks

Figure 1.- Example of a document revealing sensitive information through the metadata of the document properties.

 

As can be seen in the figure, the implications and severity of the risk vary depending on the type of information that can be disclosed or deduced. At best, it will only damage the reputation of the organisation (e.g. in case the customer deduces that he has received a document whose content has been copied from another). At worst, it could lead to invalid contracts, litigation, penalties or serious damage to the organisation.

Social engineering uses a multitude of methods and techniques, and metadata and hidden data are a very useful means to this end, as a large amount of valuable information about the organisation can be extracted relatively easily for use in subsequent attacks.

Social engineering in the context of information security can be defined as the art of finding out sensitive information and/or manipulating individuals to perform certain actions, resulting in a breach of the organisation's security.

In the case of office and PDF documents, metadata and hidden data may contain information such as: name, initials or even username that created or modified the document, name of the computer, its operating system and the programme that created the document, e-mail addresses, etc. In this way, this data could be used to perform different actions:

  • Through employee names and complemented by a search on social networks (e.g. LinkedIn), a whole list of employees of the organisation, their positions, and even their e-mail addresses can be obtained, which can be used for phishing attacks.
  • Through the operating system and the applications used by computers, it is possible to learn about the organisation's technological environment and carry out more effective targeted attacks.
  • Through usernames, it is possible to deduce the naming convention used in the organisation and compose email addresses for phishing attacks or brute force attacks.

Below is a table of some of the metadata and hidden data that may be present in documents and their associated risks.

 

metadata-types

metadata-types

metadata-types-from

 

The use of automatic metadata inspection and deletion tools can bring great benefits to the organisation:

  • Reducing risk by automatically scrubbing metadata from documents before they can be distributed outside the organisation and avoiding financial costs or reputational damage.
  • Increased security by preventing the disclosure of private or sensitive information.
  • Time saving, as they are automatic and avoid repeating the activities involved in manually debugging documents.
  • Compliance with rules and regulations and compliance with the organisation's Document Management Policy.

MetaClean offers different automatic solutions for automatic Metadata processing, both for Web and File Servers as well as for e-mail clients (Outlook).

MetaClean Sync detects in real time when a file has been created or modified and will apply the established metadata policy to it.

It runs as a background service to automatically monitor selected disk drives for the specified file types (Microsoft Word, Excel, PDF, etc.), all transparently and without user intervention.

MetaClean Sync is positioned as the comprehensive solution to prevent the leakage of sensitive information that occurs when sharing documents, as all files will be sanitised before they are shared by any of the available means (email, social networks, WEB/FTP servers, etc.).

REFERENCES