Web Crawling for Elastic App Search

Elastic App Search is an excellent solution… Once you get your content into it.

App Search’s New (7.9) Release still has complexity getting content in.

A few weeks ago, Elastic announced its latest release to its Enterprise Search product line.  The new 7.9 release features a tighter integration between Kibana and Enterprise search.  This 7.9 release enabled some powerful features for App Search.  For example, App Search can now inherit the index lifecycle management (LM) policies to manage logs and analytics automatically.  LM helps you simplify managing retention policies so that you keep what you are interested in and nothing extra. 

Elastic Cloud includes App Search. 

Most companies today are using Elasticsearch in some fashion at some level in their organization. These use case deployments, while business-critical, are often siloed to a specific function. The team supporting them may not have the resources, skillsets, or desire to dig deep to onboarding other high-value use cases on to the platform.  The Elastic Stack‘s flexibility can lead to analysis paralysis when choosing the right platform configuration for an organization’s specific landscape and requirements. Elastic App Search simplifies the implementation and maintenance of search implementations by encapsulating best of breed configuration for the standard search use cases.  This gives you incredible relevancy out of the box. 

Web Crawling for Elastic App Search

If your organization is new to App Search, you can sign up for Elastic search with a free 14-day trial and explore the technology stack.  After you have your account setup, the creation of an App Search deployment will have you up and running in moments (its that fast).  In this process, you have the option of choosing your cloud platform. Azure or GCP it makes no difference to Elastic.  If the Cloud is not your thing now or your use case needs to be on your hardware, you can deploy on-prem using a variety of methods. 

The Simplicity of the Cloud. 

Web Crawling App Search and the ‘Data Hurdle.’

App Search does not currently have a web crawler or other basic connectors.

Several years ago, we ran a survey asking what content sources and connectors we in use as Google Search Appliance.  We were surprised to discover that a majority of respondents needed to index content that was web-based, on file shares, or stored in databases.  

App Search provides a rich set of APIs for you to ingest your data.  Unfortunately, App Search currently does not offer any method to consume your data without writing a fair bit of code.  This complexity around getting your data in does create what we are calling the ‘data hurdle’ for what should be straight forward use cases (like crawling a website).  Indexing data with App Search typically starts with the most burdensome and typical for a search project. This hurdle slows down your time to value and unnecessarily complicates the implementation of the technology to solve business problems. 

A Web Crawler for Elastic App Search – Revisiting our Roots.  

Having built one of the first SharePoint connectors (before Microsoft thoroughly documented their APIs) as well as connectors to other enterprise systems (Box, Drive, Egnyte, etc.), we understand the challenges of getting content out of systems and into search platforms.  So we set out to develop the App Search Connector to accommodate the typical customer use case of needing to getting content in quickly.   

We built the connector upon the open-source leading ETL platform, ManifoldCF. The ManifoldCF project has been around for quite some time, and MC+A is one of the contributors to maintaining the Elastic search connector.  ManifoldCF provides a host of ingestion sources or inputs that our connector allows you to output to App Search. The web crawler can handle content is behind a web form, protected by a variety of standard methods.  A configuration of the web crawler trains it to prosecute the web form and traverse through the content. 

Deployment Options

We’ve simplified the deployment of ManifoldCF by using Docker containers.  You can quickly deploy to the cloud of your choice or use our installation scripts if you are still provisioning virtual machines. 

Your deployment, your choice, but in either case, we can deploy the crawler to your environment as fast as you can orchestrate the environment.

docker-compose up

Additional Features Beyond Web Crawling

In addition to the web crawling functionality, once you have the platform enabled, you can ingest file share content and connect to databases with ease. The MC+A Elastic App Search Connector opens up a wide range of content repositories.  Additionally, the platform has integration with Apache Tika, which parses metadata and text from over a thousand common file types (such as PPT, XLS, and PDF).  With a straightforward technology platform, you can convert content like PDF and Word files can be turned into searchable text.   

Seeing is Believing – Schedule a Demo Today

Scroll to Top