Posts

PDF Indexing Filter for native Windows10 applications

Info

If you observe that pdf files will not be indexed in your libraries, you need to check for the correct Windows10 PDF Filter. This How-To is only for Win10 – Check other PDF IFilter article for Win7.

PDF Indexing: How-To Inspect and Change the Filter Handlers

First, open the PDF Indexing Options panel in the Control panel:

Control Panel for PDF Indexing Options

Control Panel for PDF Indexing Options

Now click on Indexing Options / Advanced / File Types. This shows you the list of file extensions and the default Filter Handler registered for it. After installing an Adobe Filter, you can see that it adds a Handler for PDF that it calls “PDF Filter”.

Installed PDF Indexing Filter

Installed PDF Indexing Filter

Any indexing of PDF content at this point will use the Adobe Filter. To get PDF indexing working with Windows10 Store Universal Windows Platform Apps like Noggle, you need to use the native Windows10 pdf filter which is already shipped with Windows10. To change it, you need to know the GUID for the filter. The please take a note now:

What’s the GUID for the naitive Windows10 UWP PDF Filter?

Adobe GUID: {E8978DA6-047F-4E3D-9C78-CDBE46041603}
Windows10 GUID: {6C337B26-3E38-4F98-813B-FBA18BAB64F5}

That’s great, but now what if you want to switch back and forth?

Default Handlers in the Registry

How do we find out where the Default handler is configured in the Registry? Open the registry editor by typing RegEdit in the windows search box and start the desktop command.

Let’s look at HKEY_CLASSES_ROOT.pdf. In my case, it contains a PersistentHandler sub-key. This GUID is a registry branch that defines the Filter Handler for PDFs.

RegEdit PDF Indexing GUID

Note: this GUID is not constant like the IFilter GUIDs are. Yours will be different.

So let’s take a look at {F6594…..382E} by searching for it. This brings us to HKEY_CLASSES_ROOTCLSID{F6594…..382E}:

RegEdit PDF Indexing Filter Handler

RegEdit PDF Filter Handler

And there it is, under PersistentAddInsRegistered, the (Default or Standard) key pointing to the Adobe GUID of {E8978DA6-047F-4E3D-9C78-CDBE46041603}. As you’ve probably guessed, to change the default handler to the native Windows 10 PDF handler, we just have to replace this GUID with the Windows10 GUID: {6C337B26-3E38-4F98-813B-FBA18BAB64F5}. Let’s try it.

RegEdit PDF Indexing Windows 10 IFilter

RegEdit PDF Windows 10 IFilter

Now let’s take another look at Advanced Indexing Options:

PDF Indexing Win10 Filter activated

PDF Win10 Filter activated

And we’re on the Windows10 “Reader Search Handler” for PDF indexing with UWP apps. That’s it!

Summary

Here is how the registry entries are structured to define the default or standard handler:

HKEY_CLASSES_ROOT.pdf
PersistentHandler
(Default)={PDF Handler GUID}
|
˅
HKEY_CLASSES_ROOTCLSID{PDF Handler GUID}
PersistentAddInsRegistered
{Some other GUID}
(Default or Standard)={Filter GUID} <– Change this

Finally, you can check if the correct iFilter is available via the SearchFilterView Tool:

SearchViewFilter Tool

SearchFilterView Tool with correct Windows10 Filter handler activated for the extension .pdf

 

References:

How To Article for Win7 / Desktop Apps:

Tool to check available filter components:

Technical Info from Microsoft:

https://msdn.microsoft.com/en-us/library/windows/desktop/dd940433(v=vs.85).aspx

 

PDF – Indexing on 64bit platforms (Win 7 / Desktop Apps)

Info

This documentation refers to Win7 or Desktop Applications. If you use Noggle via the Windows Store as native UWP app, please refer to the original Win10 article in the knowledge base!

PDF iFilter Interface

Adobe does not bundle the iFilter interface in the latest version of Adobe Acrobat Reader 11.x or DC 64bit. You need to manually activate the Adobe iFilter Add-On in order to be able to index and search PDF documents.

Click here to download and install the Adobe iFilter interface: Activate Adobe iFilter Add-On (64bit, Version 11.x or DC)

You should be fine if you use older versions or have also 32bit Acrobat reader installed. If not, please update in order to also get Noggle index your pdf files.

The Adobe PDF iFilter enables indexing Adobe PDF documents using Noggle indexing clients. This allows the user to easily search for text within Adobe PDF documents. The key benefits include:

  • Integrates with existing operating systems and enterprise tools.
  • Provides an easy solution to search within local Adobe PDF documents.
  • Greatly increases your ability to accurately locate information.

As shown below, the iFilter is either bundled with the product or provided as an add-on. 32-bit Acrobat 9.x-11.x products bundle a 32-bit PDF iFilter. 64-bit product installs require that the add-on be installed separately. If you already have an iFilter plugin from a previous install, reinstall it.

iFilter availability for both Acrobat and Reader
Version 32-bit 64-bit iFilter version and notes
Reader 8.x bundled None Version 6.
Acrobat 8.x bundled bundled Version 6.
All 9.x bundled Add on Version 9. First added in 10.1. 32 bit not in 10.0-10.0.3
10.x bundled Add on Version 9. Security improved with 10.1
11.x bundled Add on Version 11. Updated for 11.x products and its supported platforms.
DC not available Add on Version 11. No change for DC products and their supported platforms.

What is a noggle library?

What is a noggle library?

The Noggle library functions are based on Lucene, an open source, highly scalable text search-engine library available from the Apache Software Foundation. Web sites like Wikipedia and LinkedIn have been powered by Lucene.

Noggle brings the best availabe search and indexing technology right to your desktop, the Noggle App.

Based on Lucene in the back, Noggle is able to achieve fast search responses because, instead of searching the text directly, it searches an index instead – the “noggle library”. This would be the equivalent of retrieving pages in a book related to a keyword by searching the index at the back of a book, as opposed to searching the words in each page of the book.

Noggle library tools focus mainly on text indexing and searching. It is the core element that is used to build different search capabilities. Based on Lucene, the noggle library core has many features. It:

  • Has powerful, accurate, and efficient search algorithms.
  • Calculates a score for each document that matches a given query and returns the most relevant documents ranked by the scores.
  • Supports many powerful query types, such as PhraseQuery, WildcardQuery, RangeQuery, FuzzyQuery, BooleanQuery, and more.
  • Supports parsing of human-entered rich query expressions.
  • Allows users to extend the searching behavior using custom sorting, boosting and extending search ideas.
  • Uses a file-based locking mechanism to prevent concurrent index modifications.
  • Allows searching and indexing simultaneously.

The Noggle library core lets you index any data available in textual format. Therefore, Noggle uses pre-processing and parsing techniques to extract the plain text from different source formats like Word, PowerPoint, Excel, PDF files and other formats. Noggle can be used with almost any data source as long as textual information can be extracted from it. The first step of noggle before building the library by indexing the data is to make it available in simple text format. Noggle uses custom parsers and data converters; mainly based on the Microsoft IFilter technology.

Indexing is a process of converting text data into a format that facilitates rapid searching. A simple analogy is an index you would find at the end of a book: That index points you to the location of topics that appear in the book.

Noggle stores the input data in a data structure called an inverted index, which is stored on the file system or memory as a set of index files. Most Web search engines use an inverted index. It lets users perform fast keyword look-ups and finds the documents that match a given query. Before the text data is added to the index, it is processed by an custom noggle analyzer.

The analyzer is converting the text data into a fundamental unit of searching, which is called as term. During analysis, the text data goes through multiple operations: extracting the words, removing common words, ignoring punctuation, reducing words to root form, changing words to lowercase, etc. Analysis happens just before indexing and query parsing. Analysis converts text data into tokens, and these tokens are added as terms in the Noggle library index.

As a result, a high-performant library is created which can be shared with your peers to execute search request in milliseconds over the full content. The indexing and library building process is not only providing fast search results – it also provides relevant ranking scores back to the search results.

Once your decide to share a noggle library with one of your peers, the library will be encrypted and obfuscated once it leaves your client to the noggle network. Only the named peer is available to decrypt the library – so your library is always secure in the noggle network.