Class SortedSSTableWriter


  • public class SortedSSTableWriter
    extends java.lang.Object
    SSTableWriter that expects sorted data
    Note for implementor: the bulk writer always sort the data in entire spark partition before writing. One of the benefit is that the output sstables are sorted and non-overlapping. It allows Cassandra to perform optimization when importing those sstables, as they can be considered as a single large SSTable technically. You might want to introduce a SSTableWriter for unsorted data, say UnsortedSSTableWriter, and stop sorting the entire partition, i.e. repartitionAndSortWithinPartitions. By doing so, it eliminates the nice property of the output sstable being globally sorted and non-overlapping. Unless you can think of a better use case, we should stick with this SortedSSTableWriter

    Threading Model:

    This class has limited thread-safety guarantees:
    • Field Detail

      • CASSANDRA_VERSION_PREFIX

        public static final java.lang.String CASSANDRA_VERSION_PREFIX
        See Also:
        Constant Field Values
    • Constructor Detail

      • SortedSSTableWriter

        public SortedSSTableWriter​(org.apache.cassandra.bridge.SSTableWriter tableWriter,
                                   java.nio.file.Path outDir,
                                   DigestAlgorithm digestAlgorithm,
                                   int partitionId)
      • SortedSSTableWriter

        public SortedSSTableWriter​(BulkWriterContext writerContext,
                                   java.nio.file.Path outDir,
                                   DigestAlgorithm digestAlgorithm,
                                   int partitionId)
    • Method Detail

      • getPackageVersion

        @NotNull
        public java.lang.String getPackageVersion​(java.lang.String lowestCassandraVersion)
      • addRow

        public void addRow​(java.math.BigInteger token,
                           java.util.Map<java.lang.String,​java.lang.Object> boundValues)
                    throws java.io.IOException
        Add a row to be written.

        Threading: This method MUST be called from the same thread that calls close(BulkWriterContext) (typically the RecordWriter thread). It is NOT thread-safe and must not be called concurrently with any other method on this instance.

        Parameters:
        token - the hashed token of the row's partition key. The value must be monotonically increasing in the subsequent calls.
        boundValues - bound values of the columns in the row
        Throws:
        java.io.IOException - I/O exception when adding the row
      • setSSTablesProducedListener

        public void setSSTablesProducedListener​(java.util.function.Consumer<java.util.Set<org.apache.cassandra.bridge.SSTableDescriptor>> listener)
      • rowCount

        public long rowCount()
        Returns:
        the total number of rows written
      • bytesWritten

        public long bytesWritten()
        Returns:
        the total number of bytes written
      • sstableCount

        public int sstableCount()
        Returns:
        the total number of sstables written
      • prepareSStablesToSend

        public java.util.Map<java.nio.file.Path,​Digest> prepareSStablesToSend​(@NotNull
                                                                                    BulkWriterContext writerContext,
                                                                                    java.util.Set<org.apache.cassandra.bridge.SSTableDescriptor> sstables)
                                                                             throws java.io.IOException
        Prepares a set of SSTables to be sent to replicas by calculating digests and validating them.

        This method is called when SSTables are produced during the write process (before final close). It processes newly-produced SSTables, calculates their file digests, validates them, and updates the internal counters.

        Threading: This method is thread-safe and may be called concurrently from background threads (e.g., from DirectStreamSession.onSSTablesProduced(Set) via the executor service). It is synchronized to protect shared state (overallFileDigests, sstableCount, bytesWritten) from concurrent access with close(BulkWriterContext).

        Parameters:
        writerContext - the bulk writer context
        sstables - the set of SSTable descriptors to prepare
        Returns:
        a map of file paths to their digests, or an empty map if the writer is already closed
        Throws:
        java.io.IOException - if an I/O error occurs
      • close

        public void close​(BulkWriterContext writerContext)
                   throws java.io.IOException
        Closes this writer, flushes any remaining data, calculates digests, and validates all SSTables.

        This method performs the final flush of the SSTable writer, processes any SSTables that were not already handled by prepareSStablesToSend(BulkWriterContext, Set), calculates their digests, and validates all written SSTables.

        Threading: This method MUST be called from the same thread that calls addRow(BigInteger, Map) (typically the RecordWriter thread). It is synchronized to prevent races with concurrent prepareSStablesToSend(BulkWriterContext, Set) calls from background threads.

        This method is idempotent - calling it multiple times will return early after the first call completes.

        Parameters:
        writerContext - the bulk writer context
        Throws:
        java.io.IOException - if an I/O error occurs during closing
      • validateSSTables

        public void validateSSTables​(@NotNull
                                     BulkWriterContext writerContext)
      • validateSSTables

        public void validateSSTables​(@NotNull
                                     BulkWriterContext writerContext,
                                     @NotNull
                                     java.nio.file.Path outputDirectory,
                                     @Nullable
                                     java.util.Set<java.nio.file.Path> dataFilePaths)
        Validate SSTables. If dataFilePaths is null, it finds all sstables under the output directory of the writer and validates them
        Parameters:
        outputDirectory - output directory of the sstable writer
        writerContext - bulk writer context
        dataFilePaths - paths of sstables (data file) to be validated. The argument is nullable. When it is null, it validates all sstables under the output directory.
      • getTokenRange

        public com.google.common.collect.Range<java.math.BigInteger> getTokenRange()
      • getOutDir

        public java.nio.file.Path getOutDir()
      • fileDigestMap

        public java.util.Map<java.nio.file.Path,​Digest> fileDigestMap()
        Returns:
        a view of the file digest map