Extend your storage to improve integrity, variety and velocity
Are you looking for solutions to improve interaction with your data? Are you outgrowing your current system and need to evaluate other options? Are you already using Solr to provide search indexes? Do you already have established workflows to process data with Solr? If so, then you are already using Solr as a NoSQL solution, but could you do more? If you are not currently using Solr you may find its search features and established workflow a good fit. After reading this post you will understand Solr is more than a search index, it is a viable NoSQL solution that can play a supporting role in improving the integrity, variety and velocity of your data.
A variety of data can lead to a variety of integrity needs. Evaluating data integrity usually leads to evaluating the storage type and engine, the schema ability and the workflow involved in maintaining integrity.
Historically, Solr has been used as an indexing tool, but recent updates to its storage engine provide the ability for it to be used as a storage solution. When defining what fields are to be indexed, you can also define if a field should be stored, resulting in the opportunity to store your data and return a stored document. Even though developing with Solr feels similar to working with documents it actually stores the data as a key-value on the field level. Both Solr’s indexes and fields are stored on a file system and support master slave replication as well as sharding. With the release of Solr 4 additional storage integrity has been introduced for Optimistic Locking and Atomic Updates. Optimistic Locking allows multiple threads to operate on a documents while managing inserts and updates with your application only needing to understand puts, providing a true datastore solution. Atomic updates extends the storage interaction on documents by allowing updates to be performed on single key-value field instead of updating the complete document.
Solr requires a schema to define documents and indexes and this schema usually lives with the Solr config file in the config folder. While Solr requires some basic strong schema definitions, Solr has evolved into a flexible schema solution. It requires a schema configuration defining fields and field types, but also provides for dynamic fields. Dynamic fields provide a method to declare schema rules dynamically. You can define as many dynamic fields as needed.
<dynamicField name="example_*" type="text" indexed="true" stored="true"/>
Once you define dynamic fields in your Solr schema.xml, a restart of Solr is required for the changes to take effect, but new fields matching the dynamic fields rules require no schema update and allow for updating documents with new fields effortlessly. When combining strong fields with dynamic fields the result is a hybrid schema, one that is strict and protects data, while providing rules for flexibility.
One advantage Solr has when considering a supporting storage solution is workflow. Because of Solr’s popularity as a search index, many frameworks have already integrated workflows for data management. Usually the frameworks provide a batch insert, update and delete with the ability to clear and rebuild the index. These established workflows often provide a time-based update and require updating of the Solr schema in the event of a schema change; but with little effort, these established processes can be improved to provide more real-time updates and dynamic fields. Providing enhancements to established workflows provides a method to extend your storage solution.
Solr provides a variety of ways to access, query, group and boost your data for exploring, reporting and ranking.
You can access Solr using a RESTful API, JAVA library or other language libraries. My preference would be to use Solr’s RESTful API for queries and puts on documents. Using the URL approach provides flexibility for deployment and maintenance and simplifies integration across applications. Solr 3.x and above supports both XML and JSON objects for puts and returns, providing a document styling management of data. In addition to accessing Solr in different methods you can also segment your data into separate collections with a multicore setup or the use of multiple indexes within a single core.
Solr supports several query parsers, allowing you to select the parser that best fits your needs. You can define the query parser in your config file and override the query parser during your request. This plug and play approach to query parsers allows you to define which queryparser fits best for your application and to define them in dynamic ways! The DisMax query parser is my preference for its ability to change field level boosting in real time (query time) to customize Solr’s scoring. In addition to field level boosting, Solr also supports subqueries, date fields, range queries, surround (or span) queries, synonyms, grouping with facets, and geospatial searches for latitude and longitude queries. If you have very specific needs that require a custom query parser, you can actually write your own.
Solr has a number of built-in features to group, map (when sharding) and reduce your data for discovery. Solr is well known for it’s faceting abilities. Facets, by default, return a grouping on a term in a field or a unique count of a term in a field, but you can also use a query facet to return counts for arbitrary terms and expressions. You can set your facet to sort, limit, offset, perform date and range faceting and, with the pivot method perform tree level faceting. You may also consider using a SpatialSearch on your geospatial fields to produce facets on distances. Additional data exploration is available with word counts. Solr provides the ability to return word counts across a document, or field. In order to provide word counts, you will need to add the TermVectorComponent to your field index.
<dynamicField name="example_*" type="text" indexed="true" stored="true" termVectors=”true” termPositions=”true” termOffsets=”true”/>
With a RESTful API, several programing libraries, plug and play query options and built-in tools for grouping; Solr provides a variety of ways to store, index and retrieve your data, giving you the opportunity to provide your client with a variety of data to interact with.
With Solr’s support for scaling improvements, ability to perform date ranges, geospatial fields, and recent improvements with real-time ability, Solr is now a viable option for exploring velocity in data.
Scaling with Solr has taken some big steps in recent years. Not only does Solr support improved multi-core setup, but it also supports master slave replication and sharding. For advanced management of sharding a SolrCloud setup managed with Zookeeper allows support for growth of data.
With the ability to support large data sets, Solr is prepared to grow with your data, but it can also support another form of velocity, spatial velocity. With support for date ranges and geospatial fields Solr provides the ability to store velocity not only in relation to time, but also space. Solr also provides for the discovery of this data with faceting.
Recent additions to Solr 4 provide real-time or near real-time features. Realtime Get provides the ability to retrieve documents in real time, avoiding the traditional lag defined in Solr’s config file for time interval commits. The Realtime Gets feature relies on logging to retrieve the latest document changes, but you can also use logging to monitor real-time searches, and since you can use a RESTful API, you can monitor the log files of other applications (like Apache) that are passing search parameters. In addition to Realtime Gets, you can also perform a hardCommit, a commitWithin, or a softCommit. Traditionally, Solr has performed index updates on a timed interval defined in Solrs configuration file. One traditional option provided by Solr is to perform a "hard commit" instructing Solr to commit the changes now. A second option is to define a "commit within" interval, a type of soft commit. A third option is to perform a “soft commit” that commits the updates in memory immediately. A "hard commit" will still be performed over a longer interval, which writes the changes to disc. In the event of a power outage soft commits will be lost. With Real-time Gets, monitoring of log files and the flexibility to define different commits as well as soft commits to memory, Solr provides several approaches to handling a more real-time experience.
When extending your framework to support multiple backends, it is important to consider the objectives and requirements. Solr’s functionality provides data integrity by providing for sharding, replication, optimistic concurrency, atomic updates and a hybrid schema for strong and flexible definitions. In addition, Solr also provides established workflows ready to be enhanced with newer features. Solr also comes with a variety of methods to access, query, group, facet and reduce your data to not only return scored search results with boosting, but also to provide reporting and data exploration. Recent updates to the Solr platform have expanded its ability to manage larger data sets, true “put” tolerance with acid field updates and provide real-time experiences. If you already use Solr, or already need its powerful search capabilities, you may consider using it for more than just a search engine. Solr may be a great fit to extend your current storage with a NoSQL solution that fits your integrity, variety and velocity needs.