Database Design and Normalization Interview Questions and Answers (2026) Interview Questions | JiQuest

add

#

Database Design and Normalization Interview Questions and Answers (2026)

BACKEND INTERVIEW PREPARATION
Database Design & Normalization
Master 100 database design, ER modeling and normalization interview questions covering keys, relationships, 1NF–5NF, denormalization, schema design and data warehousing for backend and system design interviews.
⏳ 50 min read 📝 100 Q&As 🎯 Beginner to Advanced
⚡ Quick Reference
1NFAtomic column values, no repeating groups, each row unique
2NF1NF + no partial dependency on part of a composite key
3NF2NF + no transitive dependency (non-key → non-key)
BCNFEvery determinant of a functional dependency is a candidate key
Primary vs Foreign keyPK uniquely identifies a row; FK references a PK in another table
Surrogate vs Natural keySurrogate: system-generated (id); Natural: real-world value (email, ISBN)
DenormalizationDeliberate redundancy to speed reads at the cost of write complexity
OLTP vs OLAP / Star vs SnowflakeOLTP: transactional, normalized; OLAP: analytical. Star: denormalized dimensions; Snowflake: normalized dimensions
Normalization Progression
Unnormalized
repeating groups
1NF
atomic values
2NF
no partial deps
3NF
no transitive deps
BCNF
every determinant a key

Database Design Fundamentals & ER Modeling

Q1. What is database design and why does it matter?

A: Database design is the process of structuring data into tables, columns, keys and relationships so it can be stored without redundancy and retrieved efficiently. Good design prevents update anomalies, enforces data integrity at the storage layer, and scales with application growth. Poor design is expensive to fix later because schema migrations on large production tables are risky and slow.

Q2. What are the three phases of database design?

A: Conceptual design captures entities, attributes and relationships independent of any DBMS (usually as an ER diagram). Logical design converts that model into relations (tables, columns, keys) and normalizes them. Physical design maps the logical model onto a specific engine: data types, indexes, partitioning, storage parameters and tablespaces.

Q3. What is an ER diagram and what are its main components?

A: An Entity-Relationship diagram models the domain visually. Its components are entities (things with independent existence, drawn as rectangles), attributes (properties, ovals), and relationships (associations between entities, diamonds), annotated with cardinality (1:1, 1:N, M:N). ER diagrams are the standard bridge between business requirements and the logical schema.

Q4. What is the difference between a strong entity and a weak entity?

A: A strong entity has its own primary key and exists independently (e.g., orders). A weak entity cannot be uniquely identified without its owning entity — its key includes the owner's key (e.g., order_items identified by order_id + line_no). Weak entities are implemented with a composite key that embeds the parent's foreign key, and are deleted when the parent is deleted.

CREATE TABLE orders (
  order_id BIGINT PRIMARY KEY
);
CREATE TABLE order_items (          -- weak entity
  order_id BIGINT NOT NULL REFERENCES orders(order_id) ON DELETE CASCADE,
  line_no  INT    NOT NULL,
  PRIMARY KEY (order_id, line_no)   -- identity depends on the owner
);

Q5. What are simple, composite, derived and multivalued attributes?

A: Simple attributes are atomic (age). Composite attributes decompose into parts (address → street, city, zip). Derived attributes are computable from others (age from date_of_birth) and are usually not stored. Multivalued attributes hold several values per entity (phone numbers) and must be moved to a child table in a relational schema.

-- multivalued attribute becomes its own table
CREATE TABLE customer_phones (
  customer_id BIGINT NOT NULL REFERENCES customers(id),
  phone       VARCHAR(20) NOT NULL,
  phone_type  VARCHAR(10) NOT NULL,   -- 'MOBILE', 'HOME', 'WORK'
  PRIMARY KEY (customer_id, phone)
);

Q6. What is the difference between cardinality and participation (modality)?

A: Cardinality states the maximum number of relationship instances an entity can participate in (one or many). Participation states the minimum: total (mandatory, every instance must participate) or partial (optional). For example, every order must belong to exactly one customer (total, 1), while a customer may have zero or many orders (partial, N). Participation is enforced with NOT NULL foreign keys.

Q7. What is the difference between a schema and an instance?

A: The schema is the structural definition of the database — tables, columns, types, constraints — which changes rarely and only via DDL. An instance (or state) is the actual data stored at a moment in time, which changes with every DML operation. Interviewers use this to check you separate design-time decisions from runtime data.

Q8. What is the difference between a logical and a physical data model?

A: The logical model defines entities, attributes, keys and normalized relationships in a platform-neutral way. The physical model targets a specific engine and adds concrete data types, index definitions, partitioning schemes, denormalization decisions and storage options. The same logical model can produce different physical models for PostgreSQL, MySQL or Oracle.

Q9. What is data independence?

A: Data independence means changes at one level of the database do not force changes above it. Physical data independence: you can change storage structures or indexes without changing the logical schema or application queries. Logical data independence: you can evolve the logical schema (add columns, split tables) while shielding applications through views. It is a core benefit of the relational model.

Q10. Why does choosing correct column data types matter in schema design?

A: Data types drive storage size, index size, comparison semantics and integrity. Oversized types (VARCHAR(4000) everywhere, BIGINT for small enums) bloat indexes and cache; wrong types (storing dates or money as VARCHAR) disable range queries, sorting and constraint checking. A rule of thumb: use the smallest type that safely fits the domain, DECIMAL for money, and native DATE/TIMESTAMP types for time.

-- Bad
price      VARCHAR(20),      -- no arithmetic, bad sorting: '9' > '10'
created_at VARCHAR(30)       -- no range scans, no timezone handling

-- Good
price      DECIMAL(12,2) NOT NULL CHECK (price >= 0),
created_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT now()

Keys: Primary, Foreign, Candidate, Composite, Surrogate

Q11. What is a primary key and what rules must it obey?

A: A primary key uniquely identifies each row in a table. It must be unique, NOT NULL, and there is exactly one per table (though it may span multiple columns). Ideally it is also stable (never changes) and minimal (no unnecessary columns), because every foreign key and secondary index references it.

CREATE TABLE users (
  user_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  email   VARCHAR(255) NOT NULL UNIQUE
);

Q12. What is the difference between a primary key and a unique constraint?

A: Both enforce uniqueness, but a table can have only one primary key and many unique constraints. Primary key columns are implicitly NOT NULL; unique columns may allow NULLs (and most databases permit multiple NULLs in a unique column). Foreign keys conventionally reference the primary key, though they can reference any unique constraint.

Q13. What is a foreign key and what does it enforce?

A: A foreign key is a column (or set of columns) that references the primary/unique key of another table, enforcing referential integrity: you cannot insert a child row pointing at a non-existent parent, and you cannot delete a parent that still has children (unless a cascade rule says otherwise). It turns implicit application-level relationships into guarantees the database enforces.

CREATE TABLE orders (
  order_id    BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  customer_id BIGINT NOT NULL,
  CONSTRAINT fk_orders_customer
    FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);

Q14. What is the difference between a super key and a candidate key?

A: A super key is any set of columns that uniquely identifies a row — it may contain extra columns. A candidate key is a minimal super key: removing any column breaks uniqueness. A table can have several candidate keys (e.g., user_id and email); the designer picks one as the primary key and the rest become alternate keys.

Q15. What is a composite key and when do you use one?

A: A composite key is a primary key made of two or more columns whose combination is unique. It is natural for junction tables, weak entities and time-series facts. Trade-off: child tables must carry all key columns as foreign keys, so for deep hierarchies a surrogate key plus a composite unique constraint is often cleaner.

CREATE TABLE enrollments (
  student_id  BIGINT NOT NULL REFERENCES students(id),
  course_id   BIGINT NOT NULL REFERENCES courses(id),
  enrolled_on DATE   NOT NULL,
  PRIMARY KEY (student_id, course_id)   -- composite key
);

Q16. What is the difference between a surrogate key and a natural key?

A: A natural key is a real-world attribute that is inherently unique (email, ISBN, national ID). A surrogate key is a meaningless system-generated identifier (auto-increment, sequence, UUID). Surrogates are stable, compact and immune to business-rule changes; natural keys carry meaning but can change (people change emails) and may be long or legally sensitive. Common practice: surrogate primary key plus a unique constraint on the natural key.

Q17. Auto-increment integers vs UUIDs as primary keys — trade-offs?

A: Auto-increment IDs are small (8 bytes), index-friendly (sequential inserts append to the right of the B-tree) but leak row counts and are awkward across shards. UUIDs are globally unique, generatable on the client and safe to merge across databases, but random UUIDv4 causes B-tree page splits and index bloat. UUIDv7 (time-ordered) or database sequences per shard are common compromises.

-- MySQL
id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY

-- PostgreSQL sequential
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY

-- PostgreSQL UUID (prefer time-ordered v7 where available)
id UUID PRIMARY KEY DEFAULT gen_random_uuid()

Q18. Can a foreign key be NULL, and can a table reference itself?

A: Yes to both. A nullable foreign key models an optional relationship — NULL means "no parent", and referential integrity is only checked for non-NULL values. A self-referencing foreign key models hierarchies within one table, such as employees pointing to their manager.

CREATE TABLE employees (
  emp_id     BIGINT PRIMARY KEY,
  name       VARCHAR(100) NOT NULL,
  manager_id BIGINT NULL REFERENCES employees(emp_id)  -- NULL = CEO
);

Q19. What is an alternate key?

A: An alternate key is any candidate key that was not chosen as the primary key. It is implemented as a unique (usually NOT NULL) constraint. Example: in a users table with surrogate user_id as primary key, email and username are alternate keys — still unique, still usable for lookups, but not referenced by foreign keys.

Q20. Why is using email or SSN as a primary key usually a bad idea?

A: Primary keys should be immutable, but emails change and SSNs can be corrected — and changing a PK cascades into every referencing foreign key and index. Natural identifiers are also long (bigger indexes), sometimes reused, and legally sensitive (an SSN copied into ten child tables is a compliance problem). Keep them as unique alternate keys and use a surrogate as the PK.

Relationships & Cardinality

Q21. What relationship cardinalities exist and give an example of each?

A: One-to-one (1:1): a user has one passport record. One-to-many (1:N): a customer places many orders, each order belongs to one customer. Many-to-many (M:N): students enroll in many courses and courses have many students. 1:N is implemented with a foreign key on the "many" side; M:N always requires a junction table.

Q22. How do you implement a one-to-one relationship?

A: Put a foreign key on one table and make it UNIQUE (or use the same primary key value in both tables — a "shared primary key"). 1:1 splits are used to isolate optional or sensitive data (user_profiles, user_credentials), to separate rarely-used wide columns from a hot table, or to apply different access controls.

CREATE TABLE user_profiles (
  user_id BIGINT PRIMARY KEY REFERENCES users(user_id), -- PK = FK
  bio     TEXT,
  avatar_url VARCHAR(500)
);
-- PK-as-FK guarantees at most one profile per user

Q23. How do you implement a one-to-many relationship?

A: Place a foreign key on the child ("many") table pointing to the parent. Make it NOT NULL if participation is mandatory. Always index the foreign key column — the database only auto-indexes it in MySQL/InnoDB, not in PostgreSQL, and unindexed FKs make joins and cascading deletes slow.

CREATE TABLE orders (
  order_id    BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  customer_id BIGINT NOT NULL REFERENCES customers(customer_id)
);
CREATE INDEX idx_orders_customer ON orders(customer_id);

Q24. How do you implement a many-to-many relationship?

A: With a junction (bridge/associative) table containing foreign keys to both sides; the pair forms the primary key so the same association cannot be duplicated. Neither base table stores anything about the other — all linkage lives in the junction table.

CREATE TABLE student_courses (
  student_id BIGINT NOT NULL REFERENCES students(id),
  course_id  BIGINT NOT NULL REFERENCES courses(id),
  PRIMARY KEY (student_id, course_id)
);
CREATE INDEX idx_sc_course ON student_courses(course_id); -- reverse lookups

Q25. When does a junction table deserve its own attributes and surrogate key?

A: As soon as the relationship itself has data — enrollment date, grade, role, quantity — those columns belong on the junction table, effectively promoting it to a first-class entity. If other tables must reference the association (e.g., certificates reference an enrollment), give it a surrogate key and keep the pair as a unique constraint.

CREATE TABLE enrollments (
  enrollment_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  student_id    BIGINT NOT NULL REFERENCES students(id),
  course_id     BIGINT NOT NULL REFERENCES courses(id),
  enrolled_on   DATE NOT NULL,
  grade         CHAR(2),
  UNIQUE (student_id, course_id)   -- still one enrollment per pair
);

Q26. How do you model a hierarchy like employee→manager (self-referencing relationship)?

A: The simplest form is the adjacency list: a nullable manager_id foreign key to the same table. Querying a whole subtree then requires a recursive query. Alternatives for read-heavy trees are the path/materialized-path column, nested sets, or a closure table (see Q74).

WITH RECURSIVE reports AS (
  SELECT emp_id, name, manager_id FROM employees WHERE emp_id = 42
  UNION ALL
  SELECT e.emp_id, e.name, e.manager_id
  FROM employees e JOIN reports r ON e.manager_id = r.emp_id
)
SELECT * FROM reports;   -- everyone under employee 42

Q27. What is the difference between identifying and non-identifying relationships?

A: In an identifying relationship the child's primary key includes the parent's key — the child cannot exist without the parent (order_items includes order_id in its PK). In a non-identifying relationship the foreign key is an ordinary attribute outside the child's PK (orders reference customers, but an order's identity does not include customer_id). Identifying relationships correspond to weak entities.

Q28. What is a ternary relationship and how do you implement it?

A: A ternary relationship involves three entities simultaneously — e.g., a supplier supplies a part to a project. It is implemented as a single junction table with three foreign keys, with the primary key covering whichever combination is truly unique. Decomposing it into three pairwise M:N tables is not equivalent: you lose the fact of which supplier supplied which part to which project.

Q29. How do you enforce optional vs mandatory participation in a relationship?

A: Mandatory on the child side: make the foreign key NOT NULL. Optional: allow NULL. Mandatory on the parent side ("every order must have at least one item") cannot be expressed by plain FK constraints — enforce it in a transaction at the application/service layer, with deferred constraints, or with triggers, and treat it as a business invariant.

Q30. What is a polymorphic association and why is it problematic in SQL?

A: A polymorphic association is a child that can belong to different parent types, modeled as (parent_type, parent_id) — e.g., comments on either posts or photos. The problem: no real foreign key can reference "one of several tables", so the database cannot enforce integrity. Cleaner alternatives are one child table per parent type, separate nullable FK columns with a CHECK that exactly one is set, or a shared "commentable" supertable that both parents reference.

Normalization: 1NF to 5NF & Functional Dependencies

Q31. What is normalization and what problems does it solve?

A: Normalization is the systematic decomposition of tables to eliminate redundancy and the anomalies it causes: update anomalies (change a fact in one row, forget it elsewhere), insert anomalies (cannot record a fact without unrelated data), and delete anomalies (deleting a row destroys an unrelated fact). Each normal form removes a specific class of redundancy driven by functional dependencies.

Q32. What is a functional dependency?

A: A functional dependency X → Y means that the value of attribute set X uniquely determines the value of Y: any two rows agreeing on X must agree on Y. Example: employee_id → employee_name, or zip_code → city. Functional dependencies are the formal machinery behind every normal form — 2NF, 3NF and BCNF are all defined in terms of which FDs are allowed.

Q33. What is First Normal Form (1NF)?

A: 1NF requires every column to hold a single atomic value, no repeating groups or arrays, and each row to be uniquely identifiable. A comma-separated phone list or columns like phone1, phone2, phone3 violate the spirit of 1NF; the fix is a child table with one value per row.

-- Violates 1NF
CREATE TABLE customers_bad (
  id     BIGINT PRIMARY KEY,
  phones VARCHAR(200)          -- '9876,8765,7654'  ✗
);
-- 1NF
CREATE TABLE customer_phones (
  customer_id BIGINT NOT NULL REFERENCES customers(id),
  phone       VARCHAR(20) NOT NULL,
  PRIMARY KEY (customer_id, phone)
);

Q34. What is Second Normal Form (2NF) and what is a partial dependency?

A: 2NF = 1NF plus no non-key attribute depends on only part of a composite key (a partial dependency). In (student_id, course_id, student_name, grade), student_name depends only on student_id — a partial dependency causing the name to repeat for every course. Fix: split student attributes into the students table. Tables with single-column keys are automatically in 2NF.

-- Violates 2NF: student_name depends only on student_id
CREATE TABLE enrollments_bad (
  student_id   BIGINT,
  course_id    BIGINT,
  student_name VARCHAR(100),    -- partial dependency ✗
  grade        CHAR(2),
  PRIMARY KEY (student_id, course_id)
);
-- 2NF: move student_name to students; keep only full-key facts
CREATE TABLE enrollments (
  student_id BIGINT REFERENCES students(id),
  course_id  BIGINT REFERENCES courses(id),
  grade      CHAR(2),
  PRIMARY KEY (student_id, course_id)
);

Q35. What is Third Normal Form (3NF) and what is a transitive dependency?

A: 3NF = 2NF plus no non-key attribute depends on another non-key attribute (a transitive dependency: key → A → B). In employees(emp_id, dept_id, dept_name), dept_name depends on dept_id, not directly on emp_id — so renaming a department means updating thousands of employee rows. Fix: extract a departments table.

-- Violates 3NF: emp_id → dept_id → dept_name
CREATE TABLE employees_bad (
  emp_id    BIGINT PRIMARY KEY,
  dept_id   BIGINT,
  dept_name VARCHAR(100)        -- transitive dependency ✗
);
-- 3NF
CREATE TABLE departments (dept_id BIGINT PRIMARY KEY, dept_name VARCHAR(100) NOT NULL);
CREATE TABLE employees   (emp_id BIGINT PRIMARY KEY,
                          dept_id BIGINT NOT NULL REFERENCES departments(dept_id));

Q36. How is BCNF different from 3NF?

A: BCNF (Boyce-Codd Normal Form) is stricter: for every non-trivial functional dependency X → Y, X must be a candidate key. 3NF permits one exception — a dependency where Y is a prime attribute (part of some candidate key). Tables with overlapping composite candidate keys can be in 3NF yet violate BCNF. In practice most 3NF tables are already in BCNF.

Q37. Give a concrete example of a table in 3NF but not BCNF.

A: Consider bookings(student_id, subject, tutor) where each tutor teaches exactly one subject (tutor → subject) and a student has one tutor per subject (student_id, subject → tutor). tutor is not a candidate key, yet it determines subject, violating BCNF; the table is still 3NF because subject is a prime attribute. The BCNF decomposition is tutors(tutor, subject) and bookings(student_id, tutor) — note this decomposition does not preserve the (student_id, subject) dependency, which is the classic BCNF trade-off.

Q38. What is Fourth Normal Form (4NF)?

A: 4NF removes non-trivial multivalued dependencies: cases where two independent multi-valued facts about the same key are stored in one table, forcing a cartesian product of rows. Example: employee_info(emp_id, skill, language) where skills and languages are unrelated — every skill×language combination must be stored. Fix: split into employee_skills and employee_languages.

Q39. What is Fifth Normal Form (5NF)?

A: 5NF (project-join normal form) eliminates join dependencies: a table that can be losslessly reconstructed only by joining three or more of its projections. It matters for cyclic constraints like "if a supplier supplies a part and that part is used in a project and the supplier serves that project, then the supplier supplies that part to that project". Rare in practice; mostly asked to test theoretical depth.

Q40. Explain insert, update and delete anomalies with one example.

A: Take a denormalized orders(order_id, customer_name, customer_city, product). Update anomaly: the customer moves city — you must update every one of their order rows or data becomes inconsistent. Insert anomaly: you cannot record a new customer until they place an order. Delete anomaly: deleting a customer's only order erases the customer's existence entirely. Normalizing customers into their own table removes all three.

Q41. What is a full functional dependency?

A: Y is fully functionally dependent on X if X → Y holds and Y does not depend on any proper subset of X. It is the yardstick of 2NF: with key (student_id, course_id), grade is fully dependent (needs both), while student_name is only partially dependent (needs just student_id) and must be moved out.

Q42. What are prime and non-prime attributes?

A: A prime attribute is a member of at least one candidate key; a non-prime attribute belongs to none. The distinction matters because 2NF and 3NF constrain only non-prime attributes, while BCNF constrains all of them — which is exactly why a table can satisfy 3NF but fail BCNF (Q37).

Q43. What normal form should a production OLTP database target?

A: 3NF (effectively BCNF, since they rarely differ) is the standard target for OLTP systems: it removes practically all update anomalies while keeping join counts reasonable. 4NF/5NF violations are uncommon if you model entities cleanly. From there, denormalize selectively and deliberately only where measured read performance demands it — never as the default.

Q44. What are Armstrong's axioms?

A: Armstrong's axioms are the inference rules for deriving all functional dependencies implied by a given set: reflexivity (if Y ⊆ X then X → Y), augmentation (X → Y implies XZ → YZ), and transitivity (X → Y and Y → Z imply X → Z). They are sound and complete, and are used to compute attribute closures when finding candidate keys.

Q45. What are lossless-join decomposition and dependency preservation?

A: A decomposition is lossless if joining the resulting tables reproduces exactly the original rows — no spurious tuples. It is dependency-preserving if every original functional dependency can be checked without joining tables back together. 3NF decomposition can always achieve both; BCNF guarantees losslessness but sometimes must sacrifice dependency preservation — a classic reason to stop at 3NF.

Denormalization & When to Break the Rules

Q46. What is denormalization and when is it justified?

A: Denormalization deliberately reintroduces redundancy into a normalized schema to reduce joins and speed up reads. It is justified when a measured, hot read path is too slow despite proper indexing — typical cases: dashboards, feed queries, search listings, report tables. The cost is write-side complexity: every duplicate copy must be kept in sync, so denormalize only with a clear synchronization strategy.

Q47. What are common denormalization techniques?

A: Duplicating a frequently joined column onto the child table (e.g., customer_name on orders), precomputed aggregate columns (order_count, total_spent), summary/rollup tables refreshed periodically, and collapsing 1:1 tables together. Each trades one extra UPDATE path for the removal of a join or aggregation at read time.

-- summary table refreshed nightly (or incrementally by trigger/job)
CREATE TABLE daily_sales_summary (
  sales_date   DATE NOT NULL,
  product_id   BIGINT NOT NULL REFERENCES products(id),
  units_sold   INT NOT NULL,
  revenue      DECIMAL(14,2) NOT NULL,
  PRIMARY KEY (sales_date, product_id)
);

Q48. What are the risks of denormalization and how do you mitigate them?

A: The main risk is drift: the duplicated value diverges from the source of truth, and the update anomaly you normalized away returns. Mitigations: keep one table authoritative and treat copies as caches, update copies in the same transaction (or via triggers/CDC pipelines), add reconciliation jobs that detect divergence, and document every denormalized column so future developers know it is derived.

Q49. Materialized view vs manually denormalized table — which and why?

A: A materialized view is a database-managed precomputed query result: the definition lives in the database and refresh is declarative, which eliminates drift bugs. A hand-rolled summary table gives full control over incremental updates and works on engines with weak materialized-view support (MySQL has none natively). Prefer materialized views when the engine supports the refresh cadence you need; otherwise maintain a summary table via jobs or triggers.

-- PostgreSQL
CREATE MATERIALIZED VIEW top_products AS
SELECT product_id, SUM(quantity) AS units
FROM order_items GROUP BY product_id;

REFRESH MATERIALIZED VIEW CONCURRENTLY top_products;  -- non-blocking

Q50. How do read-heavy vs write-heavy workloads change schema design?

A: Read-heavy systems tolerate denormalization, aggressive indexing, and materialized aggregates because writes are rare enough to absorb the maintenance cost. Write-heavy systems want the opposite: lean normalized tables, minimal indexes (each index is extra write I/O), append-only patterns, and aggregation pushed to asynchronous consumers. Most real systems split the difference with a normalized write model plus derived read models (CQRS).

Q51. How do you design a counter column (e.g., likes_count) safely?

A: Store the source of truth as rows (one row per like) and keep the counter as a denormalized column updated atomically with UPDATE ... SET likes_count = likes_count + 1 in the same transaction as the insert. For very hot counters, shard the counter across N rows or buffer increments in Redis and flush periodically, accepting slight staleness. Never compute it by COUNT(*) on every page view.

BEGIN;
INSERT INTO post_likes (post_id, user_id) VALUES (10, 77);  -- truth
UPDATE posts SET likes_count = likes_count + 1 WHERE id = 10; -- cache
COMMIT;
-- unique PK (post_id, user_id) on post_likes prevents double-likes

Q52. When is storing JSON in a relational column an acceptable design?

A: JSON columns suit genuinely schemaless, read-mostly payloads: user preferences, third-party API responses, product attributes that vary wildly per category. They are a poor fit for data you filter, join, aggregate or constrain — you lose foreign keys, type checking and easy indexing. Rule: model what the database must understand as columns; keep what it merely stores as JSON.

CREATE TABLE products (
  id         BIGINT PRIMARY KEY,
  name       VARCHAR(200) NOT NULL,     -- relational: queried, constrained
  price      DECIMAL(12,2) NOT NULL,
  attributes JSONB                      -- flexible: {"color":"red","ram_gb":16}
);
CREATE INDEX idx_products_attrs ON products USING GIN (attributes);

Constraints & Data Integrity

Q53. What types of constraints does SQL provide?

A: PRIMARY KEY (unique + not null row identity), FOREIGN KEY (referential integrity), UNIQUE (no duplicate values), NOT NULL (mandatory values), CHECK (arbitrary boolean condition per row), and DEFAULT (value when none supplied). Together they let the database reject bad data regardless of which application or script writes it.

CREATE TABLE accounts (
  account_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  owner_id   BIGINT NOT NULL REFERENCES users(user_id),
  iban       VARCHAR(34) NOT NULL UNIQUE,
  balance    DECIMAL(14,2) NOT NULL DEFAULT 0 CHECK (balance >= 0),
  status     VARCHAR(10) NOT NULL DEFAULT 'ACTIVE'
             CHECK (status IN ('ACTIVE','FROZEN','CLOSED'))
);

Q54. What are the ON DELETE / ON UPDATE referential actions?

A: RESTRICT / NO ACTION rejects deleting a parent with children (the safe default). CASCADE deletes/updates children automatically. SET NULL detaches children by nulling the FK. SET DEFAULT repoints children at a default parent. Choose per relationship: cascade for owned/weak entities (order → order_items), restrict for referenced master data (product referenced by orders), set null for optional links.

CREATE TABLE order_items (
  order_id   BIGINT NOT NULL REFERENCES orders(order_id)  ON DELETE CASCADE,
  product_id BIGINT NOT NULL REFERENCES products(id)      ON DELETE RESTRICT,
  line_no    INT NOT NULL,
  PRIMARY KEY (order_id, line_no)
);

Q55. What is a CHECK constraint and what are its limits?

A: A CHECK constraint validates a boolean expression on each row at write time — ranges, enumerations, cross-column rules like end_date > start_date. Limits: it can only see the current row (no subqueries in most engines), so rules involving other rows or tables ("no overlapping bookings") need exclusion constraints, triggers or application logic. Note that MySQL only started enforcing CHECK in 8.0.16.

CREATE TABLE bookings (
  id         BIGINT PRIMARY KEY,
  start_date DATE NOT NULL,
  end_date   DATE NOT NULL,
  guests     INT  NOT NULL,
  CHECK (end_date > start_date),
  CHECK (guests BETWEEN 1 AND 10)
);

Q56. Why prefer NOT NULL with defaults over nullable columns?

A: NULL introduces three-valued logic: NULL = NULL is unknown, NULLs vanish from aggregates and normal comparisons, and every consumer must handle the missing case. Making columns NOT NULL with a sensible DEFAULT removes an entire class of bugs and simplifies queries. Reserve NULL for genuinely unknown or inapplicable values — not as a lazy default.

Q57. What are entity, referential and domain integrity?

A: Entity integrity: every table has a primary key and no part of it is NULL — each row is identifiable. Referential integrity: every foreign key value matches an existing parent row or is NULL. Domain integrity: every column value belongs to its legal domain, enforced by data types, NOT NULL, CHECK and lookup-table FKs. A well-designed schema enforces all three declaratively.

Q58. Why can ON DELETE CASCADE be dangerous?

A: A single parent delete can silently fan out into millions of child deletions across cascade chains, causing long lock-holding transactions, replication lag, and unrecoverable data loss from one mistaken statement. Guidelines: cascade only across true ownership (weak entities), never across independent business entities like customers → orders, and consider soft deletes or RESTRICT plus explicit archival for anything valuable.

Q59. What are deferred constraints?

A: Deferrable constraints (PostgreSQL, Oracle) are checked at COMMIT instead of after each statement. They solve chicken-and-egg cases: inserting two rows that reference each other, swapping unique values between two rows, or bulk loads in arbitrary order. Declared with DEFERRABLE INITIALLY DEFERRED; within a transaction you can toggle with SET CONSTRAINTS. MySQL does not support them.

Q60. Should business rules be enforced in the database or the application?

A: Both, at different levels. Structural invariants that must never be violated — uniqueness, referential integrity, non-negative balances — belong in the database, because it is the only layer every writer passes through (apps, migrations, ad-hoc scripts). Complex, changeable workflow rules belong in the application/service layer. The interview-ready answer: the database is the last line of defense, not the only one.

Q61. How do unique constraints interact with NULLs, and how do you enforce "unique among active rows"?

A: In most databases (PostgreSQL, Oracle, MySQL) a unique constraint allows multiple NULLs because NULL is not equal to NULL (SQL Server allows only one). To enforce uniqueness over a subset — the classic soft-delete case "email unique among non-deleted users" — use a partial unique index in PostgreSQL, or include a nullable deleted_at surrogate in the key.

-- PostgreSQL: uniqueness applies only to live rows
CREATE UNIQUE INDEX uq_users_email_active
ON users(email) WHERE deleted_at IS NULL;

Q62. What are the pros and cons of triggers for data integrity?

A: Pros: triggers run for every write path, can enforce multi-row/multi-table rules that CHECK cannot, and can maintain audit trails or denormalized copies automatically. Cons: hidden control flow (developers forget they exist), harder testing and debugging, per-row overhead on bulk operations, and portability pain across engines. Use them sparingly for cross-cutting concerns like auditing; keep business logic out of them.

Schema Design for Real Scenarios: E-commerce, Social, Banking

Q63. Design the core tables of an e-commerce system.

A: The backbone is customers, products, orders, and order_items (the junction resolving the M:N between orders and products, and the natural place for quantity and the price paid). Orders carry status, addresses (or FK to an addresses table) and totals; payments and shipments hang off orders as separate 1:N tables so an order can have retries and partial shipments.

CREATE TABLE orders (
  order_id    BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  customer_id BIGINT NOT NULL REFERENCES customers(id),
  status      VARCHAR(20) NOT NULL DEFAULT 'PENDING',
  total       DECIMAL(14,2) NOT NULL,
  created_at  TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE TABLE order_items (
  order_id   BIGINT NOT NULL REFERENCES orders(order_id) ON DELETE CASCADE,
  line_no    INT NOT NULL,
  product_id BIGINT NOT NULL REFERENCES products(id),
  quantity   INT NOT NULL CHECK (quantity > 0),
  unit_price DECIMAL(12,2) NOT NULL,   -- price AT purchase time
  PRIMARY KEY (order_id, line_no)
);

Q64. Why does order_items store unit_price when products already has a price?

A: Because the product's current price is mutable, but the price the customer actually paid is a historical fact. If order_items only referenced products.price, every price change would silently rewrite past orders, invoices and revenue reports. This is deliberate, correct duplication — snapshotting point-in-time facts is not a normalization violation, since paid price and current price are different attributes.

Q65. How do you model products whose attributes vary by category (EAV vs JSON)?

A: Laptops need RAM and CPU; shirts need size and color — a fixed column set cannot fit both. Options: EAV (entity-attribute-value rows: flexible, fully queryable, but joins explode and typing is weak), JSON column (simple, indexable with GIN in PostgreSQL, weaker validation), or table-per-category for a few stable categories. Modern default: common columns relational + JSONB for variant attributes, with search often delegated to Elasticsearch.

-- EAV alternative (use only when attributes must be first-class rows)
CREATE TABLE product_attributes (
  product_id BIGINT NOT NULL REFERENCES products(id),
  attr_name  VARCHAR(50) NOT NULL,
  attr_value VARCHAR(255) NOT NULL,
  PRIMARY KEY (product_id, attr_name)
);

Q66. How do you design inventory tracking that survives concurrency?

A: Keep a stock quantity per product/warehouse and decrement it atomically with a guarded UPDATE — UPDATE inventory SET qty = qty - 1 WHERE product_id = ? AND qty >= 1 — checking rows-affected to detect oversell, plus a CHECK (qty >= 0) as a safety net. For auditability, also append rows to an inventory_movements table (deltas with reason codes) so the balance can be recomputed and reconciled; reservations with expiry handle carts and pending checkouts.

Q67. How do you model followers in a social network?

A: Following is a self-referencing M:N on users, stored as a junction table of (follower_id, followee_id) pairs. The composite PK prevents duplicate follows, a CHECK prevents self-follows, and the second index serves the reverse query ("who follows me?"). At very large scale this table shards well by follower_id.

CREATE TABLE follows (
  follower_id BIGINT NOT NULL REFERENCES users(id),
  followee_id BIGINT NOT NULL REFERENCES users(id),
  created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
  PRIMARY KEY (follower_id, followee_id),
  CHECK (follower_id <> followee_id)
);
CREATE INDEX idx_follows_followee ON follows(followee_id);

Q68. Fan-out on write vs fan-out on read for a news feed — schema implications?

A: Fan-out on read stores each post once and builds the feed at request time by joining posts with follows — normalized, cheap writes, expensive reads for users following thousands of accounts. Fan-out on write maintains a denormalized per-user feed table (user_id, post_id, created_at) appended when someone they follow posts — fast reads, heavy write amplification for celebrity accounts. Real systems hybridize: fan-out on write for normal users, on read for mega-follower accounts.

Q69. How do you store threaded comments?

A: The adjacency list — a nullable parent_comment_id self-FK — is the standard choice, with recursive CTEs to load a thread. If you need "fetch entire subtree sorted" without recursion, add a materialized path column (e.g., 0001.0005.0002) that sorts lexicographically, at the cost of maintaining it on moves.

CREATE TABLE comments (
  comment_id        BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  post_id           BIGINT NOT NULL REFERENCES posts(id),
  parent_comment_id BIGINT NULL REFERENCES comments(comment_id),
  author_id         BIGINT NOT NULL REFERENCES users(id),
  body              TEXT NOT NULL,
  created_at        TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_comments_post   ON comments(post_id);
CREATE INDEX idx_comments_parent ON comments(parent_comment_id);

Q70. How do you design a banking ledger (double-entry)?

A: Use double-entry bookkeeping: every transfer is one transaction row plus two (or more) immutable ledger entries whose amounts sum to zero — a debit on one account, a credit on another. Money is never "updated", only appended, so every balance is explainable as the sum of its entries and errors are corrected by compensating entries, never by editing history.

CREATE TABLE transactions (
  txn_id     BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  txn_type   VARCHAR(20) NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE TABLE ledger_entries (         -- append-only, never UPDATE/DELETE
  entry_id   BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  txn_id     BIGINT NOT NULL REFERENCES transactions(txn_id),
  account_id BIGINT NOT NULL REFERENCES accounts(account_id),
  amount     DECIMAL(14,2) NOT NULL   -- negative = debit, positive = credit
);
-- invariant: SUM(amount) per txn_id = 0 (enforced in the service layer)

Q71. Should you store account balance as a column or derive it from the ledger?

A: The ledger is the source of truth; a balance column is a denormalized cache maintained in the same transaction as the ledger insert (with optimistic locking or SELECT ... FOR UPDATE to serialize concurrent transfers). Deriving the balance by summing millions of entries per read is too slow, but the cached balance must always be reconcilable against the ledger — run periodic consistency checks and never accept writes that touch the balance without a ledger entry.

Q72. What data type should store money and why?

A: Use exact decimal types — DECIMAL(p,s)/NUMERIC — or integer minor units (amount in cents as BIGINT, common in payment systems and multi-currency designs). Never FLOAT/DOUBLE: binary floating point cannot represent 0.1 exactly and rounding errors compound across millions of rows. Store the currency code alongside the amount (ISO 4217) rather than assuming one currency.

Q73. How do you model addresses — columns on the user or a separate table?

A: A separate addresses table with a 1:N from user, because users have multiple addresses (shipping, billing, work) that change over time. For orders, either snapshot the address into the order row (immutable historical fact, like unit_price in Q64) or reference an immutable address version — never point orders at a mutable address row, or editing an address rewrites shipping history.

Q74. Compare adjacency list, materialized path, nested sets and closure table for hierarchies.

A: Adjacency list (parent_id): simplest, cheap writes, subtree reads need recursive CTEs. Materialized path (path string): fast subtree reads via prefix match, moves rewrite descendants' paths. Nested sets (left/right numbers): very fast subtree reads, inserts renumber half the tree — best for near-static trees. Closure table (every ancestor-descendant pair): fast reads and clean SQL for any depth query, at the cost of O(depth) rows per node and an extra table.

CREATE TABLE category_closure (
  ancestor_id   BIGINT NOT NULL REFERENCES categories(id),
  descendant_id BIGINT NOT NULL REFERENCES categories(id),
  depth         INT NOT NULL,
  PRIMARY KEY (ancestor_id, descendant_id)
);
-- all descendants of category 5:
-- SELECT descendant_id FROM category_closure WHERE ancestor_id = 5;

Data Warehousing: Star, Snowflake, OLTP vs OLAP

Q75. What is the difference between OLTP and OLAP?

A: OLTP (Online Transaction Processing) handles many small concurrent reads/writes — orders, payments, logins — and favors normalized schemas, row storage and strict ACID transactions. OLAP (Online Analytical Processing) handles few but huge aggregate queries over historical data and favors denormalized dimensional schemas, columnar storage and batch loading. Mixing them on one database makes analytics queries starve transactional workloads, which is why warehouses exist.

Q76. What is a star schema?

A: A star schema has one central fact table (measurable events: sales, clicks) surrounded by denormalized dimension tables (who, what, where, when), each joined by a single foreign key. Dimensions are intentionally flat — one join per dimension — which keeps analytical queries simple and fast, and is the default dimensional model.

CREATE TABLE fact_sales (
  date_key     INT    NOT NULL REFERENCES dim_date(date_key),
  product_key  BIGINT NOT NULL REFERENCES dim_product(product_key),
  store_key    BIGINT NOT NULL REFERENCES dim_store(store_key),
  customer_key BIGINT NOT NULL REFERENCES dim_customer(customer_key),
  quantity     INT NOT NULL,
  revenue      DECIMAL(14,2) NOT NULL
);
CREATE TABLE dim_product (   -- flat, denormalized dimension
  product_key BIGINT PRIMARY KEY,   -- surrogate
  product_id  BIGINT NOT NULL,      -- natural/business key
  name        VARCHAR(200), category VARCHAR(100), brand VARCHAR(100)
);

Q77. Star vs snowflake schema — trade-offs?

A: A snowflake schema normalizes dimensions into sub-dimensions (dim_product → dim_category → dim_department), saving some storage and easing dimension maintenance, but every extra level adds a join to analytical queries and complexity for BI tools. Star keeps dimensions flat and redundant for query speed and simplicity. Since dimension tables are tiny relative to facts, the storage savings rarely justify snowflaking — star is the usual recommendation.

Q78. What are fact tables and dimension tables?

A: Fact tables record business events at a declared grain, containing numeric measures (quantity, revenue) plus foreign keys to dimensions; they are huge, narrow and append-only. Dimension tables hold descriptive context — product names, store regions, calendar attributes — and are small, wide and slowly changing. Queries filter and group by dimension attributes while aggregating fact measures.

Q79. What are additive, semi-additive and non-additive facts?

A: Additive facts sum correctly across all dimensions (revenue, units sold). Semi-additive facts sum across some dimensions but not time — account balance or inventory level can be summed across accounts but not across days (use last-value or average over time). Non-additive facts never sum meaningfully (ratios, percentages, unit price) — store their additive components and compute the ratio at query time.

Q80. What are Slowly Changing Dimensions (SCD) Types 1, 2 and 3?

A: Type 1: overwrite the old value — no history, simplest. Type 2: insert a new dimension row with validity dates and a current flag — full history, the warehouse workhorse; facts link to the row version valid at event time. Type 3: keep an extra "previous_value" column — one step of history only, for rare "before/after reorg" comparisons.

-- SCD Type 2 dimension
CREATE TABLE dim_customer (
  customer_key BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, -- surrogate
  customer_id  BIGINT NOT NULL,        -- business key, repeats per version
  city         VARCHAR(100),
  valid_from   DATE NOT NULL,
  valid_to     DATE NOT NULL DEFAULT '9999-12-31',
  is_current   BOOLEAN NOT NULL DEFAULT TRUE
);

Q81. Why do warehouses use surrogate keys instead of source-system keys?

A: Because dimension rows are versioned (SCD Type 2 gives one business key many rows), because data is integrated from multiple sources whose natural keys can collide or change format, and because compact integer surrogates keep the enormous fact table's foreign keys and indexes small. The business key is retained as an attribute for lineage and lookups, but facts always join on the surrogate.

Q82. What is the grain of a fact table and why must you fix it first?

A: The grain is the exact real-world event one fact row represents — "one order line", "one page view", "one account balance per day". It is the first dimensional-design decision because every measure and dimension must be consistent with it; mixing grains in one table (order-level shipping cost next to line-level quantity) double-counts under aggregation. When in doubt choose the finest grain available — you can always aggregate up.

Q83. What is the difference between ETL and ELT?

A: ETL transforms data on a separate processing tier before loading the warehouse — traditional when warehouse compute was scarce. ELT loads raw data first and transforms inside the warehouse with SQL (dbt on Snowflake/BigQuery/Redshift), exploiting cheap elastic compute and keeping raw data replayable. ELT dominates modern cloud stacks; ETL persists where data must be cleansed or anonymized before it may land.

Q84. What is a date dimension and why not just use a date column?

A: A date dimension pre-computes one row per calendar day with attributes like day of week, month, quarter, fiscal period, and is_holiday. Joining facts to it makes "revenue by fiscal quarter, weekends only" a simple filter instead of scattered date functions, keeps fiscal logic in one place, and is shared (conformed) across all fact tables so every report buckets time identically.

CREATE TABLE dim_date (
  date_key     INT PRIMARY KEY,      -- 20260703 (YYYYMMDD)
  full_date    DATE NOT NULL,
  day_of_week  VARCHAR(9) NOT NULL,
  month_name   VARCHAR(9) NOT NULL,
  quarter      SMALLINT NOT NULL,
  fiscal_year  SMALLINT NOT NULL,
  is_weekend   BOOLEAN NOT NULL,
  is_holiday   BOOLEAN NOT NULL
);

Soft Deletes, Auditing & Versioning Patterns

Q85. What is a soft delete and how is it implemented?

A: A soft delete marks a row as deleted instead of removing it — typically a deleted_at TIMESTAMP NULL column (NULL = live), which also records when. It preserves history, enables undo, and avoids breaking foreign keys from historical records. Every query must then filter WHERE deleted_at IS NULL, usually centralized via a view or ORM-level filter such as Hibernate's @SQLRestriction.

ALTER TABLE users ADD COLUMN deleted_at TIMESTAMPTZ NULL;

-- delete
UPDATE users SET deleted_at = now() WHERE user_id = 42;

-- all reads
SELECT * FROM users WHERE deleted_at IS NULL;

Q86. What are the drawbacks of soft deletes?

A: Every query, index and unique constraint must account for dead rows: forgotten filters resurrect deleted data, unique constraints block re-registering a deleted email (fix with partial unique indexes, Q61), tables and indexes grow forever, and foreign keys cannot stop children referencing a "deleted" parent. GDPR-style erasure also still requires real deletion or anonymization. Mitigate by periodically archiving soft-deleted rows to an archive table and hard-deleting.

-- keep live-row scans fast: index only live rows
CREATE INDEX idx_users_live ON users(email) WHERE deleted_at IS NULL;

Q87. How do you design an audit trail table?

A: An append-only table capturing who changed what, when: table/entity name, row key, action (INSERT/UPDATE/DELETE), the changed data (old/new snapshots as JSON, or one row per changed column), the acting user and timestamp, and ideally a request/correlation ID. It must be insert-only — no updates or deletes — and is typically partitioned by time for retention management.

CREATE TABLE audit_log (
  audit_id   BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  table_name VARCHAR(63)  NOT NULL,
  row_pk     VARCHAR(100) NOT NULL,
  action     CHAR(1) NOT NULL CHECK (action IN ('I','U','D')),
  old_data   JSONB,
  new_data   JSONB,
  changed_by BIGINT NOT NULL,
  changed_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

Q88. Trigger-based vs application-based auditing — which is better?

A: Trigger-based auditing catches every write path (including ad-hoc SQL and migrations) and cannot be forgotten, but it lacks application context (which user, which use case) unless the app passes it via session variables, and it adds write latency. Application-based auditing (e.g., Hibernate Envers, interceptors) knows the business context but misses out-of-band writes. High-assurance systems combine both, or use CDC (change data capture from the WAL/binlog) for a low-overhead, complete stream.

Q89. What are temporal (system-versioned) tables?

A: A system-versioned temporal table keeps every historical version of each row with system-maintained validity timestamps; the database moves the old version into history automatically on UPDATE/DELETE. You can query the table "AS OF" any past instant — built into SQL Server, MariaDB and Oracle; in PostgreSQL it is emulated with triggers and range columns. It gives auditing plus time travel without hand-rolled version tables.

-- SQL Server / MariaDB style
CREATE TABLE products (
  id    BIGINT PRIMARY KEY,
  price DECIMAL(12,2) NOT NULL,
  valid_from TIMESTAMP(6) GENERATED ALWAYS AS ROW START,
  valid_to   TIMESTAMP(6) GENERATED ALWAYS AS ROW END,
  PERIOD FOR SYSTEM_TIME (valid_from, valid_to)
) WITH SYSTEM VERSIONING;

SELECT * FROM products FOR SYSTEM_TIME AS OF '2026-01-01 00:00:00';

Q90. How does an optimistic-locking version column work?

A: Each row carries a version integer; every UPDATE includes WHERE version = :readVersion and increments it. If another transaction updated the row in between, zero rows match, and the application knows it read stale data and must retry or report a conflict — concurrency control without holding locks. JPA/Hibernate automates this with @Version, throwing OptimisticLockException on conflict.

UPDATE accounts
SET balance = 900, version = version + 1
WHERE account_id = 42 AND version = 7;
-- 0 rows affected => concurrent modification, retry the read-modify-write

Q91. How does event sourcing differ from an audit log?

A: An audit log is a byproduct: the current-state tables remain the source of truth and the log records what happened to them. In event sourcing the append-only event stream is the source of truth — current state is a projection rebuilt by replaying events (with snapshots for speed). Event sourcing gives perfect history and temporal queries by construction, but costs significant complexity (versioned event schemas, projections, eventual consistency), so most systems settle for state + audit log.

Q92. What standard bookkeeping columns should most tables carry?

A: Conventional columns: created_at and updated_at timestamps (server-generated: DEFAULT now() and trigger/ORM-maintained), often created_by/updated_by user references, deleted_at where soft deletes apply, and version where optimistic locking applies. Use TIMESTAMP WITH TIME ZONE (store UTC), and generate values in the database or a single service layer so they stay consistent across writers.

Multi-Tenancy, Scalability & ORM Mapping Considerations

Q93. What are the three main multi-tenancy strategies?

A: Shared schema: all tenants in the same tables with a tenant_id column — cheapest, scales to many tenants, weakest isolation. Schema-per-tenant: one schema per tenant in a shared database — moderate isolation, migrations multiply per tenant. Database-per-tenant: strongest isolation, per-tenant backup/restore and tuning, highest operational cost. Choice is driven by tenant count, compliance requirements and noisy-neighbor tolerance; large SaaS often mixes them (shared for small tenants, dedicated for enterprise).

-- shared-schema model
CREATE TABLE invoices (
  tenant_id  BIGINT NOT NULL REFERENCES tenants(id),
  invoice_id BIGINT NOT NULL,
  amount     DECIMAL(14,2) NOT NULL,
  PRIMARY KEY (tenant_id, invoice_id)   -- tenant leads every key
);

Q94. In a shared-schema design, how do you prevent cross-tenant data leaks?

A: Defense in depth: put tenant_id first in primary keys and composite indexes so every query is naturally tenant-scoped and fast; inject the tenant filter centrally (Hibernate @TenantId/filters, repository base classes) rather than trusting each query; and enforce it in the database with PostgreSQL Row-Level Security so even a buggy query cannot cross tenants.

ALTER TABLE invoices ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON invoices
  USING (tenant_id = current_setting('app.tenant_id')::BIGINT);
-- app sets: SET app.tenant_id = '42'; per connection/transaction

Q95. What is the difference between partitioning and sharding?

A: Partitioning splits one logical table into pieces (by range, list or hash) within a single database server — the engine prunes partitions transparently, and old partitions can be dropped instantly. Sharding distributes data across multiple servers, each holding a subset keyed by a shard key; the application or a router picks the shard, and cross-shard joins/transactions become your problem. Partition first; shard only when one machine truly cannot hold the write load or data volume.

-- PostgreSQL declarative range partitioning
CREATE TABLE events (
  event_id   BIGINT NOT NULL,
  occurred_at TIMESTAMPTZ NOT NULL,
  payload    JSONB,
  PRIMARY KEY (event_id, occurred_at)
) PARTITION BY RANGE (occurred_at);

CREATE TABLE events_2026_07 PARTITION OF events
  FOR VALUES FROM ('2026-07-01') TO ('2026-08-01');

Q96. How do you choose a shard key?

A: A good shard key has high cardinality, distributes reads and writes evenly (no celebrity hotspots), and appears in nearly every query so requests route to a single shard. tenant_id or user_id are typical; monotonically increasing keys like timestamps concentrate all inserts on one shard, and low-cardinality keys like country create imbalance. Changing a shard key later means resharding — one of the most painful migrations — so co-locate data that must be queried and transacted together.

Q97. How do read replicas and replication lag affect schema and application design?

A: Replicas scale reads but are asynchronously behind the primary, so a user can write and then not see their own write on a replica. Designs must route writes and read-your-own-writes flows to the primary (or use session pinning/LSN tracking), tolerate staleness in feeds and reports, and keep unique/integrity enforcement on the primary. Replication also amplifies the cost of huge transactions and unindexed cascading deletes, reinforcing lean write-path design.

Q98. How do you map a many-to-many with extra columns in JPA/Hibernate?

A: @ManyToMany cannot hold columns on the join table, so model the junction as its own entity with two @ManyToOne sides — mirroring the SQL design in Q25. Even for plain M:N, many teams prefer the explicit entity because it survives requirement changes and gives control over the composite key and indexes.

@Entity
public class Enrollment {
    @Id @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;

    @ManyToOne(fetch = FetchType.LAZY) @JoinColumn(name = "student_id")
    private Student student;

    @ManyToOne(fetch = FetchType.LAZY) @JoinColumn(name = "course_id")
    private Course course;

    private LocalDate enrolledOn;   // relationship attribute
    private String grade;
}

Q99. What Hibernate inheritance strategies exist and how do they shape the schema?

A: SINGLE_TABLE: one table for the whole hierarchy with a discriminator column — fastest (no joins) but subclass columns must be nullable. JOINED: one table per class joined by shared PK — normalized, allows NOT NULL subclass columns, costs joins. TABLE_PER_CLASS: one independent table per concrete class — no joins for single-type queries, but polymorphic queries need UNIONs and identity generation is awkward. Default to SINGLE_TABLE for small hierarchies, JOINED when subclass integrity matters.

@Entity
@Inheritance(strategy = InheritanceType.SINGLE_TABLE)
@DiscriminatorColumn(name = "payment_type")
public abstract class Payment { @Id private Long id; }

@Entity @DiscriminatorValue("CARD")
public class CardPayment extends Payment { private String last4; }

@Entity @DiscriminatorValue("UPI")
public class UpiPayment extends Payment { private String vpa; }

Q100. How does the N+1 problem relate to schema and mapping design, and how do you fix it?

A: N+1 happens when loading N parents lazily triggers one extra query per parent for its association — 1 query becomes N+1. It is a mapping/fetching issue, not a schema flaw: keep associations FetchType.LAZY by default and fetch explicitly per use case with JOIN FETCH, @EntityGraph, or batch fetching (@BatchSize). Schema-side, make sure every foreign key backing an association is indexed, or the generated joins will be slow even after N+1 is fixed.

// N+1: one query for orders, then one per order for items
List<Order> orders = em.createQuery("SELECT o FROM Order o", Order.class)
                       .getResultList();
orders.forEach(o -> o.getItems().size());

// Fix: fetch join — single SQL query
List<Order> orders = em.createQuery(
    "SELECT DISTINCT o FROM Order o JOIN FETCH o.items", Order.class)
    .getResultList();
No comments
Leave a Comment